CN113806610A - Data clustering method, system and storage medium - Google Patents

Data clustering method, system and storage medium Download PDF

Info

Publication number
CN113806610A
CN113806610A CN202111156414.XA CN202111156414A CN113806610A CN 113806610 A CN113806610 A CN 113806610A CN 202111156414 A CN202111156414 A CN 202111156414A CN 113806610 A CN113806610 A CN 113806610A
Authority
CN
China
Prior art keywords
clustering
data
result
entropy
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111156414.XA
Other languages
Chinese (zh)
Inventor
邓少冬
盛龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xi'an Mix Intelligent Technology Co ltd
Original Assignee
Xi'an Mix Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xi'an Mix Intelligent Technology Co ltd filed Critical Xi'an Mix Intelligent Technology Co ltd
Priority to CN202111156414.XA priority Critical patent/CN113806610A/en
Priority to PCT/CN2021/123007 priority patent/WO2023050461A1/en
Publication of CN113806610A publication Critical patent/CN113806610A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data clustering method, a system and a storage medium, comprising the following steps: determining a data clustering condition; clustering data according to the data clustering conditions to obtain at least one first clustering result, and calculating the entropy load of each first clustering result; the entropy carriers represent the size of average information quantity carried by the corresponding first clustering results; and taking the maximum entropy load in the entropy loads, wherein the corresponding first clustering result is a data clustering result. The invention carries out clustering from the whole data, realizes the integrity of data clustering, and obtains more complete and accurate clustering results; in addition, dependence and processing on any special data do not exist in the clustering process, and any data type is not limited, so that the method is generally suitable for clustering of any data; the maximum average information bearing amount is used as a basis for determining the clustering result, and for a computer system with a certain storage space, the larger the information amount which can be stored, the higher the information expression efficiency is.

Description

Data clustering method, system and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a data clustering method, a data clustering system and a storage medium.
Background
With the development and popularization of the internet in recent years, the quantity of data such as images, videos, texts and the like and the dimensionality of representation data are increased, and in order to utilize the massive data, the high-dimensional data needs to be clustered quickly and effectively, so that a large number of clustering algorithms are derived.
Clustering algorithms have been widely used in important fields such as data mining, face recognition, medical image analysis, and image segmentation, as one of the important research subjects in the field of machine learning. The image clustering is to divide and classify target data of completely unknown labels into different clusters, belongs to an exploratory technology for grouping through data characteristics, can be generally used for sorting image information or generating training sample labels and the like, and belongs to a common image processing means.
The conventional image Clustering method generally performs image Clustering Based on image features extracted from images by using a conventional Clustering algorithm, for example, Clustering by using algorithms such as a K-Means Clustering algorithm (K-Means Clustering algorithm) or a Density-Based Spatial Clustering of Applications with Noise (DBSCAN).
Taking a K-Means algorithm as an example, the conventional K-Means algorithm needs to input a sample set, a clustered cluster tree K, a maximum iteration number N, and finally output cluster division, and the general process is as follows: selecting K objects from the data as initial clustering centers; calculating the distance from each clustering object to the clustering center for division; calculating each cluster center again; and calculating the standard measure function until the maximum iteration times are reached, stopping, and otherwise, continuing to operate.
However, based on the above algorithm process, the K-Means algorithm has the following major disadvantages:
a.K values are difficult to determine because it is not possible to determine in advance what category a given sample set should be divided into to be optimal;
b.K-Means adopts the iteration method, the obtained result is only the local optimal clustering result, and the integrity is lacked;
c. sensitive to outliers and outliers;
d. the mean value of the sample set is required to exist, and the data category is limited;
e. the clustering effect relies on the initialization of the cluster centers, while the initial cluster centers are randomly selected.
The applicant also carries out sufficient research on other clustering algorithms, and finds that other traditional clustering algorithms contain too much dependence and processing on special data besides the K-Means algorithm, so that the algorithms do not have universal applicability and integrity to data clustering, and the data clustering field lacks sufficient exploration on a clustering method for overcoming the defects of lack of universal applicability and integrity.
Disclosure of Invention
The invention aims to provide a data clustering method, a data clustering system and a data clustering storage medium, which solve the technical problem that the traditional clustering algorithm in the prior art in the field is lack of integrity and universal applicability.
In order to achieve the above object, an embodiment of the present invention provides a data clustering method, where the method includes:
determining a data clustering condition;
clustering data according to the data clustering condition to obtain at least one first clustering result, wherein each first clustering result in the at least one first clustering result comprises at least one data set; calculating an entropy load corresponding to each first clustering result, wherein the entropy load represents the size of the average information quantity carried by the corresponding first clustering result;
and taking the maximum entropy load in the entropy loads corresponding to each first clustering result, wherein the first clustering result corresponding to the maximum entropy load is a data clustering result.
Preferably, the data clustering condition is determined according to similarity between data.
Preferably, clustering data according to the data clustering condition includes: and clustering the data according to the combination relation of the data with different dimensions.
Further preferably, the combination relationship of the data with different dimensions is determined according to the dimensions concerned by the data clustering, and includes: and fixing the dimension data which is not concerned, and combining and traversing the concerned dimension data.
Further preferably, the clustering data according to the combination relationship of the different dimensional data specifically includes:
(v1,v2,v3,……,vj),
vj={amj}=a1j,a2j,……,amj
wherein v isjFor data of j dimension, data vjThe difference values are arranged in a sequence from small to large { a }mj},amjIs a sequence { amjThe m-th item in (a)mjRepresentative data vjMaximum difference between a1jRepresentative data vjA minimum difference between; when v isjFor dimensional data that is not of interest for data clustering, then vjIs taken as the sequence { amjAny at least one of }; when v isjFor data clustering of the dimensional data of interest, vjIs taken as the sequence { amjThe precedence order of the items in the } goes through each item, and vjClustering based on v when taking the latter itemjAnd taking the first clustering result obtained by the previous item for further clustering.
Preferably, the method for calculating the entropy vector comprises the following steps:
Figure BDA0003288758600000021
Figure BDA0003288758600000022
wherein, amjIs a sequence { amjThe m-th item in { a }mjIs the data v of the jth dimensionjThe difference values between the two are arranged in a sequence from small to large, a is the base number of a logarithmic function, a is more than 1, and the entropy is loaded
Figure BDA0003288758600000031
Denotes vjGet the sequence { amjThe mth item a inmjCarrying out clustering to obtain the size of the average information quantity borne by the first clustering result; n is vjGet the sequence { amjThe mth item a inmjThe first clustering result obtained by clustering containsThe number of data sets; k is a radical ofiIs the number of elements in the ith data set, N is the total number of data, piIs the ratio of the number of elements in the ith data set to the total number of data.
Further preferably, a is 2, and the entropy carrier calculated thereby represents bits, which are binary and represent a measure of the average amount of information.
Preferably, the method comprises the step of forming an information structure tree comprising:
re-determining data clustering conditions, further clustering a certain data set in the data clustering results according to a new data clustering condition by executing the clustering method according to the new data clustering condition to obtain a new maximum entropy load, wherein the clustering result corresponding to the new maximum entropy load comprises a plurality of diversity combinations, the information corresponding to the diversity combinations is subdivision information of the certain data set, the certain data set is used as a father node, the diversity combinations are used as child nodes, and an information structure tree is formed step by step.
Preferably, the method comprises the step of forming a clustering process tree comprising:
when said v isjIs taken as the sequence { amjV when the sequence of the items in the item is traversed each item for clustering, vjGet aqjThe first clustering result obtained by clustering is arranged in the q-th level of the clustering process tree, q is more than or equal to 1 and less than or equal to m, vjGet amjThe first clustering result obtained by clustering is the root node of the clustering process tree vjGet a1jClustering to obtain a first clustering result which is a leaf node of the clustering process tree, wherein the degree of the leaf node is zero; the q-th level set is used as a father node, and all elements forming the set by the q-1-th level clustering are used as child nodes of the set, so that a clustering process tree is formed step by step.
In order to achieve the above object, another embodiment of the present invention provides a data clustering system, which includes a memory, a processor, and a program stored in the memory and executable on the processor, where the program, when executed by the processor, implements the steps of the above data clustering method.
To achieve the above object, another embodiment of the present invention provides a computer-readable storage medium, wherein: the storage medium stores at least one program executable by at least one processor, and the at least one program, when executed by the at least one processor, implements the steps of a method for clustering data as described above.
The data clustering method, the data clustering system and the data clustering storage medium have the following beneficial effects:
(1) the data clustering method, the data clustering system and the storage medium provided by the invention cluster from the whole data according to the data clustering condition to obtain at least one first clustering result, and the data clustering result is obtained through the first clustering result with the largest bearing average information quantity, so that the integrity of data clustering is realized, and the obtained clustering result is more complete and accurate; in addition, dependence and processing on any special data do not exist in the clustering process, and any data type is not limited, so that the method is generally suitable for clustering of any data, and has very high practicability; the maximum average information bearing amount is used as a basis for determining a clustering result, and for a computer system with a certain storage space, the larger the information amount which can be stored, the higher the information expression efficiency is;
(2) the data clustering method, the data clustering system and the storage medium provided by the invention cluster the at least one first clustering result again based on the at least one first clustering result to obtain the local subdivision information of the at least one first clustering result, so that the coordination and unification of the integrity and the locality of the data clustering are realized;
(3) the method, the system and the storage medium for clustering data form an information structure tree, the entropy load corresponding to each branch of the information structure tree is the maximum entropy load under a certain clustering condition, and the information quantity which can be stored in a computer system with a certain storage space is the maximum, so that the information expression efficiency is also the highest;
(4) the clustering method, the system and the storage medium of the data provided by the invention also form a clustering process tree in the clustering process, the clustering process tree clusters and distinguishes the data continuously from coarse to fine according to the granularity of the concerned dimension data, all information of single data point step by step can be reflected visually, and all clustering information of the data is traced and traceable.
Drawings
FIG. 1 is a schematic flow chart of a data clustering method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an application scenario of 12 data points in a storage medium and a method and a system for clustering data according to a second embodiment of the present invention;
FIG. 3 is a diagram illustrating a data clustering method, system and result of clustering with a data value difference of 1 in a storage medium according to a second embodiment of the present invention;
FIG. 4 is a diagram illustrating a data clustering method, system and result of clustering with a data value difference of 2 in a storage medium according to a second embodiment of the present invention;
FIG. 5 is a diagram illustrating a data clustering method, system and result of clustering with a data value difference of 3 in a storage medium according to a second embodiment of the present invention;
FIG. 6 is a diagram illustrating a data clustering method, system and result of clustering with a data value difference of 4 in a storage medium according to a second embodiment of the present invention;
FIG. 7 is a schematic diagram of a data clustering method, system and application scenario of 11 ordered data points in a storage medium according to a third embodiment of the present invention;
FIG. 8 is a schematic structural diagram of a clustering process tree of a data clustering method, a data clustering system and a storage medium according to a third embodiment of the present invention;
FIG. 9 is a schematic diagram of a data clustering method, system and result of clustering with a data value difference of 1 in a storage medium according to a third embodiment of the present invention;
FIG. 10 is a schematic diagram of a data clustering method, system and result of clustering with a data value difference of 2 in a storage medium according to a third embodiment of the present invention;
FIG. 11 is a schematic diagram of a data clustering method, system and result of clustering with a data value difference of 3 in a storage medium according to a third embodiment of the present invention;
FIG. 12 is a schematic diagram of a data clustering method, system and result of clustering with a data value difference of 4 in a storage medium according to a third embodiment of the present invention;
FIG. 13 is a schematic diagram of an application scenario of a data clustering method, system and storage medium according to a fourth embodiment of the present invention;
FIG. 14 is a schematic diagram of a data clustering method, system and application scenario of 156 ordered data points in a storage medium according to a fourth embodiment of the present invention;
FIG. 15 is a schematic structural diagram of a clustering process tree of a data clustering method, a system and a storage medium according to a fourth embodiment of the present invention;
FIG. 16 is a diagram illustrating a result of clustering with a data value difference of 0 in a data clustering method, system and storage medium according to a fourth embodiment of the present invention;
FIG. 17 is a diagram illustrating a data clustering method, system and result of clustering with a data value difference of 1 in a storage medium according to a fourth embodiment of the present invention;
FIG. 18 is a diagram illustrating a data clustering method, system and result of clustering with a data value difference of 2 in a storage medium according to a fourth embodiment of the present invention;
FIG. 19 is a diagram illustrating a data clustering method, system and result of clustering with a data value difference of 3 in a storage medium according to a fourth embodiment of the present invention;
FIG. 20 is a diagram illustrating a data clustering method, system and result of clustering with a data value difference of 4 in a storage medium according to a fourth embodiment of the present invention;
FIG. 21 is a diagram illustrating a data clustering method, system and result of clustering data of a "cup" set with a data value difference of 0 in a storage medium according to a fourth embodiment of the present invention;
FIG. 22 is a diagram illustrating a data clustering method, system and result of clustering data of a "cup" set with a data value difference of 2 in a storage medium according to a fourth embodiment of the present invention;
FIG. 23 is a diagram illustrating a data clustering method, system and result of clustering data of a "cup" set with a data value difference of 4 in a storage medium according to a fourth embodiment of the present invention;
fig. 24 is a schematic structural diagram of an information structure tree of a data clustering method, a data clustering system, and a storage medium according to four embodiments of the present invention.
Detailed Description
The technical solutions of the present invention are described in further detail below with reference to the accompanying drawings and specific examples, and the following is a detailed description of the present invention with reference to specific embodiments, and the specific embodiments of the present invention should not be construed as being limited to these descriptions. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.
Example one
An embodiment of the present invention provides a data clustering method, as shown in fig. 1, including the following steps:
(1) determining a data clustering condition, comprising the following steps:
determining factors affecting similarity between data;
determining a data dimension of interest for a data cluster from a plurality of factors;
determining a combination relation of different dimensional data;
and determining the clustering condition of the data according to the combination relation of the dimension data.
The determination of the data clustering condition is based on the similarity between data, and the similarity between data is often influenced by factors of multiple dimensions, so that the data is clustered according to the following combination relationship of different dimension data in the embodiment of the present invention, specifically as follows:
(v1,v2,v3,……,vj),
vj={amj}=a1j,a2j,……,amj
the combinatorial relationship is determined according to the concerned dimensionality of the data clustering, and comprises the following steps: and fixing the dimension data which is not concerned, and combining and traversing the concerned dimension data.
Wherein v isjFor data of j dimension, data vjThe difference values are arranged in a sequence from small to large { a }mj},amjIs a sequence { amjThe m-th item in (a)mjRepresentative data vjMaximum difference between a1jRepresentative data vjA minimum difference between; when v isjFor dimensional data that is not of interest for data clustering, then vjIs taken as the sequence { amjAny at least one of }; when v isjFor data clustering of the dimensional data of interest, vjIs taken as the sequence { amjThe precedence order of the items in the } goes through each item, and vjClustering based on v when taking the latter itemjAnd taking the first clustering result obtained by the previous item for further clustering.
(2) And clustering the data according to the data clustering conditions to obtain at least one first clustering result, wherein each first clustering result comprises at least one data set. And calculating the entropy load corresponding to each first clustering result, wherein the entropy load represents the average information quantity carried by the corresponding first clustering result. The calculation method of the entropy load comprises the following steps:
Figure BDA0003288758600000061
Figure BDA0003288758600000062
wherein, amjIs a sequence { amjThe m-th item in { a }mjIs the data v of the jth dimensionjThe difference values between the two are arranged in a sequence from small to large, a is the base number of a logarithmic function, a is more than 1, and the entropy is loaded
Figure BDA0003288758600000071
Denotes vjGet the sequence { amjThe mth item a inmjCarrying out clustering to obtain the size of the average information quantity borne by the first clustering result; n is vjGet the sequence { amjThe mth item a inmjClustering to obtain the number of data sets contained in a first clustering result; k is a radical ofiIs the number of elements in the ith data set, N is the total number of data, piIs the ratio of the number of elements in the ith data set to the total number of data.
In the embodiment a of the present invention, a is preferably 2, and the entropy carrier obtained by the calculation represents bits, where the bits are binary and represent a unit of measurement of the average information amount.
The result obtained by each clustering is a plurality of data sets, each data set corresponds to a data category, when the computer system stores the clustering result, each data category has a code with fixed length corresponding to the data category, the average information quantity which can be stored by each code is fixed, correspondingly, the information expression efficiency of each code is also fixed, and the code with fixed length is expected to store the maximum average information quantity, so that the information expression efficiency is highest.
Entropy carrier
Figure BDA0003288758600000072
Represents the average information quantity carried by the clustering result obtained by the current clustering,
Figure BDA0003288758600000073
the larger the average information amount of each data type in the current clustering result is, the larger the average information amount that can be stored by the code corresponding to each data type is, and the higher the information expression efficiency of the code corresponding to each data type is, and thus, for a computer system with a certain storage space, the larger the information amount that can be stored is, and therefore, the higher the information expression efficiency thereof is.
(3) Taking the maximum entropy load I in the entropy loads of the first clustering results calculated in the step (2)maxAccording to the maximum entropy carrier ImaxObtaining a data clustering result, specifically:
Figure BDA0003288758600000074
wherein, ImaxThe maximum entropy load represents the maximum average information amount carried by the clustering result obtained by clustering according to the clustering condition, and the computer system with a certain storage space has the maximum information amount and the highest information expression efficiency, so the maximum entropy load ImaxThe corresponding clustering result is expected.
After the step (1), the step (2), and the step (3) are executed to complete one-time clustering, the data clustering method according to the embodiment of the present invention may further include a step of forming an information structure tree, which specifically includes:
re-determining data clustering conditions, further clustering a certain data set in the data clustering results according to a new data clustering condition by executing the clustering method according to the new data clustering condition to obtain a new maximum entropy load, wherein the clustering result corresponding to the new maximum entropy load comprises a plurality of diversity combinations, the information corresponding to the diversity combinations is subdivision information of the certain data set, the certain data set is used as a father node, the diversity combinations are used as child nodes, and an information structure tree is formed step by step.
The entropy load corresponding to each branch of the information structure tree is the maximum entropy load under a certain clustering condition, and the amount of information which can be stored in a computer system with a certain storage space is the maximum, so that the information expression efficiency is the highest.
The data clustering method according to the first embodiment of the present invention may further include a step of forming a clustering process tree, which specifically includes:
when v isjIs taken in the sequence { a }mjV when the sequence of the items in the item is traversed each item for clustering, vjGet aqjThe first clustering result obtained by clustering is arranged in the q-th level of the clustering process tree, q is more than or equal to 1 and less than or equal to m, vjGet amjThe first clustering result obtained by clustering is the root node of the clustering process tree vjGet a1jClustering to obtain a first clustering result which is a leaf node of the clustering process tree, wherein the degree of the leaf node is zero; set of q-th hierarchyAnd as a parent node, clustering all elements forming the set by the q-1 th level to be child nodes of the set, and gradually forming a clustering process tree. When v isjIs taken in the sequence { a }mjWhen the sequence of the items in the tree is traversed to each item for clustering, the clustering process tree is embodied according to concerned dimension data vjThe data are clustered and distinguished continuously from coarse to fine according to the granularity of the data, all information clustered by a single data point step by step can be reflected visually by a clustering process tree, and all clustering information of the data can be traced and traced actively.
Example two
HSV is a color space created according to the intuitive nature of color, also known as a hexagonal pyramid model, in which the parameters of the color are: hue (h), saturation(s), lightness (v), the value ranges are respectively: h is 0-180, S is 0-255, V is 0-255, the image is composed of a plurality of data points, and each data point has an H value, an S value and a V value.
As shown in fig. 2, a second embodiment of the present invention provides a data clustering method, which aims at data of 12 scattered and unordered data points: the hue h value is clustered by the following method, and the method comprises the following steps:
(1) determining the condition of data clustering, specifically:
the similarity between the data of the embodiment is only influenced by the factors of one dimension: the difference Δ h between the hue h values, so the condition of data clustering in this embodiment is to cluster the data according to Δ h:
v1=Δh={am1}=a11,a21,……,am1
=1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23,25;
wherein v is1For data in dimension 1: Δ h, data v1The difference between Δ h is arranged in the order of smaller to larger as the sequence { a }m1},am1Is a sequence { am1The m-th item in (a)m125 represents a maximum difference between the data h of 25, a111 represents that the minimum difference between the data h is 1.Δ h is dimension data concerned by the second data clustering in the embodiment of the present invention, and the value of Δ h is in the sequence { a }m1And traversing each item according to the sequence of the items in the item, and further clustering based on a clustering result obtained by taking the previous item by the delta h when the latter item is taken by the delta h for clustering.
Therefore, the condition of the second data clustering in the embodiment of the invention is as follows: a sequence Δ h ═ a according to the difference Δ h between the tone h values of the data pointsm1The items in 1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23 and 25 are sequentially traversed through each item, and when the latter item is clustered by Δ h, the clustering result obtained by taking the former item by Δ h is further clustered.
(2) Clustering data according to data clustering conditions, and calculating entropy carrier after clustering
Figure BDA0003288758600000091
Figure BDA0003288758600000092
Figure BDA0003288758600000093
Wherein, am1Is a sequence { am1The m-th item in { a }m1Is the data v of the 1 st dimension1A sequence in which differences Δ h are arranged in descending order; a is the base number of the logarithmic function, and a is more than 1; entropy carrier
Figure BDA0003288758600000094
Denotes v1Get the sequence { am1The m-th item a inm1Carrying out clustering to obtain the size of the average information quantity borne by the first clustering result; n is v1Get the sequence { am1The m-th item a inm1Clustering to obtain the number of data sets contained in a first clustering result; k is a radical ofiIs the number of elements in the ith data set, N is the total number of data, piFor elements in the ith data setThe ratio of the number of elements to the total number of data.
The preferred value of the second embodiment a of the present invention is a ═ 2, and the entropy carrier obtained by this calculation represents bits, and the bits are binary and represent the measurement unit of the average information amount.
The result obtained by each clustering is a plurality of data sets, each data set corresponds to a data category, when the computer system stores the clustering result, each data category has a code with fixed length corresponding to the data category, the average information quantity which can be stored by each code is fixed, correspondingly, the information expression efficiency of each code is also fixed, and the code with fixed length is expected to store the maximum average information quantity, so that the information expression efficiency is highest.
Entropy carrier
Figure BDA0003288758600000096
Represents the average information quantity carried by the clustering result obtained by the current clustering,
Figure BDA0003288758600000097
the larger the average information amount of each data type in the current clustering result is, the larger the average information amount that can be stored by the code corresponding to each data type is, and the higher the information expression efficiency of the code corresponding to each data type is, and thus, for a computer system with a certain storage space, the larger the information amount that can be stored is, and therefore, the higher the information expression efficiency thereof is.
The embodiment of the invention specifically clusters data according to two data clustering conditions as follows:
s201, clustering when Δ h is 1, clustering data points representing hue h values with a difference of 1 in one set, obtaining a first clustering result after clustering, where the first clustering result includes five data sets, as shown in fig. 3, where N is 5 and N is 12, and calculating an entropy carrier I at this time1
Figure BDA0003288758600000095
S202, clustering based on the clustering result of Δ h ═ 1, when Δ h ═ 2, clustering data points representing hue h values with a difference of 2 in one set, and obtaining a first clustering result after clustering, where the first clustering result includes three data sets, as shown in fig. 4, where N is 3 and N is 12, and calculating the entropy load I at that time2
Figure BDA0003288758600000101
S203, clustering based on the clustering result of Δ h ═ 2, when Δ h ═ 3, clustering data points representing hue h values with a difference of 3 in one set, and obtaining a first clustering result after clustering, where the first clustering result includes two data sets, as shown in fig. 5, where N is 2 and N is 12, and calculating the entropy load I at that time3
Figure BDA0003288758600000102
S204, based on the clustering result of delta h-3, clustering is carried out when delta h is 4, data points representing hue h value difference of 4 are clustered in one set, a first clustering result is obtained after clustering, the first clustering result comprises one data set, as shown in figure 6, all the data points are all clustered into one set, and the entropy I at the moment is calculated4
Figure BDA0003288758600000103
In this embodiment, the result of clustering with Δ h ═ 5 to 25 is the same as the result of clustering with Δ h ═ 4 in step S204, and the entropy loading is also the same, so the details are not repeated.
(3) Taking the maximum entropy load I in the entropy loads of the first clustering results calculated in the step (2)maxAccording to the maximum entropy carrier ImaxObtaining a data clustering result, specifically:
Figure BDA0003288758600000111
wherein, ImaxAnd the maximum entropy load represents the maximum average information quantity carried by the clustering result obtained by clustering according to the clustering condition. I is3When the entropy loading obtained by clustering by the method of "Δ h ═ 3" is maximum, I is used for a computer system with a constant storage space3The corresponding clustering method has the largest amount of information which can be stored and the highest expression efficiency of the information, so the maximum entropy I3The corresponding clustering result is expected.
In the second embodiment of the present invention, the method for clustering one-dimensional data is exemplarily illustrated only by the difference Δ h between the hue h values of the data points, and substantially, the method, the system and the storage medium for clustering data are applicable to clustering any one-dimensional data.
EXAMPLE III
As shown in fig. 7, a third embodiment of the present invention provides a data clustering method, which is directed to data of 11 ordered data points in an orthogonal coordinate system: the hue h value, the x coordinate value and the y coordinate value are clustered by the following method, and the method comprises the following steps:
(1) determining the condition of data clustering, specifically:
the similarity between the three data in the embodiment of the invention is influenced by the factors of two dimensions: therefore, the condition of the three data clustering of the embodiment of the invention is that the data are clustered according to the combination relation of the delta h and the delta x:
(v1,v2),
v1=Δh,
v2=Δx;
in the embodiment of the invention, the concerned dimension data of the three-data clustering is delta h, and the irrelevant dimension data is delta x, so that the combination relation is fixed delta x, the data delta h clustering is traversed, and for delta h:
v1=Δh={am1}=a11,a21,……,am1
=0,1,2,3,4,5,6,7;
wherein v is1For data in dimension 1: Δ h, data v1The difference between Δ h is arranged in the order of smaller to larger as the sequence { a }m1},am1Is a sequence { am1The m-th item in (a)m17 represents the maximum difference between the data h as 7, a110 represents that the minimum difference between the data h is 0.Δ h is dimension data concerned by the three-data clustering in the embodiment of the present invention, and the value of Δ h is in the sequence { a }m1And traversing each item according to the sequence of the items in the item, and further clustering based on a clustering result obtained by taking the previous item by the delta h when the latter item is taken by the delta h for clustering.
For Δ x:
v2=Δx={am2}=a12,a22,……,am2
=1,2,3,4,5,6,7,8,9,10;
wherein v is2Data for the 2 nd dimension: Δ x, data v2The difference between Δ x is arranged in the order of the sequence { a } from small to largem2},am2Is a sequence { am2The m-th item in (a)m210 represents a maximum difference between the data Δ x of 10, a121 represents that the minimum difference between the data Δ x is 1. If Δ x is dimension data that is not concerned by the three-data clustering in the embodiment of the present invention, the value of Δ x is the sequence { a }m2At least one of them, the embodiment of the present invention takes the sequence { a } threem2The first term in (1), so Δ x ═ 1.
Therefore, the condition of the three-data clustering in the embodiment of the invention is as follows: the fixed Δ x is 1, and the sequence Δ h is { a } according to the difference Δ h between the tone h values of the respective data pointsm1And traversing each item in the sequence of items 0,1,2,3,4,5,6 and 7, and further clustering based on a clustering result obtained by taking the previous item based on the Δ h when the latter item is taken as the cluster.
Δ h take on the values in the sequence { a }m1The sequence of the items in the tree is traversed through each item for clustering, and delta h is a817, placing a clustering result obtained by clustering in the 7 th level of the clustering process tree, wherein the clustering result obtained by clustering is a root node of the clustering process tree; Δ h is taken to be a11The clustering result obtained by clustering is 0, the clustering result is arranged in the level 1 of the clustering process tree, the clustering result is a leaf node of the clustering process tree, and the degree of the leaf node is zero; when a certain set in the level 2 is used as a parent node, all elements forming the set in the level 1 are used as child nodes of the set, so as to gradually form a clustering process tree, as shown in fig. 8, when Δ h takes a value in a sequence { a }m1When the sequence of the items in the data is traversed through each item for clustering, the process that the clustering process tree continuously clusters and distinguishes the data from coarse to fine according to the granularity of the concerned dimension data delta h is embodied, the clustering process tree can visually reflect all information of single data point step by step clustering, and all clustering information of the data is traced and traceable, and the source traceability is realized.
(2) Clustering data according to data clustering conditions, and calculating entropy carrier after clustering
Figure BDA0003288758600000121
Figure BDA0003288758600000122
Figure BDA0003288758600000123
Wherein, am1Is a sequence { am1The m-th item in { a }m1Is the data v of the 1 st dimension1A sequence of differences Δ h arranged in descending order, a being the base of the logarithmic function, a > 1; entropy carrier
Figure BDA0003288758600000124
Denotes v1Get the sequence { am1The m-th item a inm1Carrying out clustering to obtain the size of the average information quantity borne by the first clustering result; n is v1Get the sequence { am1The m-th item a inm1Clustering to obtain the number of data sets contained in a first clustering result; k is a radical ofiIs the number of elements in the ith data set, N is the total number of data, piIs the ratio of the number of elements in the ith data set to the total number of data.
The preferred value of embodiment three a of the present invention is a ═ 2, and the entropy carrier obtained by this calculation represents bits, and the bits are binary and represent the measurement unit of the average information amount.
The result obtained by each clustering is a plurality of data sets, each data set corresponds to a data category, when the computer system stores the clustering result, each data category has a code with fixed length corresponding to the data category, the average information quantity which can be stored by each code is fixed, correspondingly, the information expression efficiency of each code is also fixed, and the code with fixed length is expected to store the maximum average information quantity, so that the information expression efficiency is highest.
Entropy carrier
Figure BDA0003288758600000131
Represents the average information quantity carried by the clustering result obtained by the current clustering,
Figure BDA0003288758600000132
the larger the average information amount of each data type in the current clustering result is, the larger the average information amount that can be stored by the code corresponding to each data type is, and the higher the information expression efficiency of the code corresponding to each data type is, and thus, for a computer system with a certain storage space, the larger the information amount that can be stored is, and therefore, the higher the information expression efficiency thereof is.
The data clustering according to the three data clustering conditions of the embodiment of the invention specifically comprises the following steps:
s301, fixing delta x to be 1, clustering when delta h is 0, and clustering data points representing the same hue h value in one set, wherein no data point meeting the clustering condition exists, and no clustering occurs to the data points, so that the entropy carrier I at the moment0=0。
S302, fixing delta x to be 1, clustering when delta h is 1, and clustering data points representing hue h values with difference of 1Clustering one set to obtain a first clustering result, where the first clustering result includes eight data sets, as shown in fig. 9, where N is 8 and N is 11, and calculating the entropy loading I at that time1
Figure BDA0003288758600000133
S303, fixing Δ x to 1, clustering when Δ h is 2 based on the clustering result of Δ h to 1, clustering to obtain a first clustering result, where the first clustering result includes four data sets, as shown in fig. 10, where N is 4 and N is 11, and calculating the entropy carrier I at that time2
Figure BDA0003288758600000134
S304, fixing Δ x to 1, clustering when Δ h is 3 based on the clustering result of Δ h to 2, clustering to obtain a first clustering result, where the first clustering result includes two data sets, as shown in fig. 11, where N is 2 and N is 11, and calculating the entropy carrier I at that time3
Figure BDA0003288758600000141
S305, fixing Δ x to 1, clustering when Δ h is 4 based on the clustering result of Δ h to 3, clustering to obtain a first clustering result, where the first clustering result includes a data set, as shown in fig. 12, where N is 1 and N is 11, and calculating the entropy carrier I at that time4
Figure BDA0003288758600000142
In the third embodiment of the present invention, the result of clustering with Δ h ═ 5 to 7 is the same as the result of clustering with Δ h ═ 4 in step S305, and the entropy loading is also the same, so that details are not repeated.
(3) Taking the maximum entropy load I in the entropy loads of the first clustering results calculated in the step (2)maxAccording to the maximum entropy carrier ImaxObtaining a data clustering result, specifically:
Figure BDA0003288758600000143
wherein, ImaxAnd the maximum entropy load represents the maximum average information quantity carried by the clustering result obtained by clustering according to the clustering condition. I is2When the entropy carrier obtained by clustering by the method of "fixed Δ x is 1 and Δ h is 2" is the maximum, the amount of information that can be stored is the maximum for a computer system with a fixed storage space, and the expression efficiency of the information is the highest, so the maximum entropy carrier I is the maximum2The corresponding clustering result is expected.
In the third embodiment of the present invention, the two-dimensional data clustering method is exemplarily illustrated only by the difference Δ h between hue h values and the difference Δ x between x coordinate values, and substantially, the data clustering method, system and storage medium of the present invention are applicable to any two-dimensional data clustering.
Example four
The fourth embodiment of the present invention takes the image segmentation field as an example to explain the data clustering method of the present invention, and the application scenario of image segmentation is shown in fig. 13.
As shown in fig. 14, the fourth embodiment is an image, and for the data of 156 ordered data points in the image: the hue h value, the x coordinate value and the y coordinate value are clustered by the following method:
(1) determining the condition of data clustering, specifically:
the similarity between the four data in the embodiment of the invention is only influenced by the factors of three dimensions: therefore, the four data clustering conditions in the embodiment of the invention are that the data are clustered according to the combination relationship of the delta h, the delta x and the delta y:
(v1,v2,v3),
v1=Δh,
v2=Δx,
v3=Δy;
in the embodiment of the invention, the concerned dimension data of the four-data clustering is delta h, and the unconcerned dimension data is delta x and delta y, so that the combination relation is that the delta x and the delta y are fixed, the data delta h clustering is traversed, and for the delta h:
v1=Δh={am1}=a11,a21,……,am1
=0,1,2,3,4,5,158,159,160,161,162,163;
wherein v is1For data in dimension 1: Δ h, data v1The difference between Δ h is arranged in the order of smaller to larger as the sequence { a }m1},am1Is a sequence { am1The m-th item in (a)m1163 represents the maximum difference between data h as 163, a110 represents that the minimum difference between the data h is 0.Δ h is the dimensional data of interest for implementing the quad-like data clustering of the present invention, so the value of Δ h is in the sequence { a }m1And traversing each item according to the sequence of the items in the item, and further clustering based on a clustering result obtained by taking the previous item by the delta h when the latter item is taken by the delta h for clustering.
For Δ x:
v2=Δx={am2}=a12,a22,……,am2
=1,2,3,4,5,6,7,8,9,10,11;
wherein v is2Data for the 2 nd dimension: Δ x, data v2The difference between Δ x is arranged in the order of the sequence { a } from small to largem2},am2Is a sequence { am2The m-th item in (a)m2The maximum difference between 11 representative data Δ x is 11, a121 represents that the minimum difference between the data Δ x is 1. If Δ x is dimension data that is not concerned by the four-data clustering in the embodiment of the present invention, the value of Δ x is the sequence { a }m2At least one of them, the fourth embodiment of the present inventionColumn { am2The first term in (1), so Δ x ═ 1.
For Δ y:
v3=Δy={am3}=a13,a23,……,am3
=1,2,3,4,5,6,7,8,9,10,11,12;
wherein v is3For data in dimension 3: Δ y, data v3The difference between Δ y is arranged in the order of smaller to larger as the sequence { a }m3},am3Is a sequence { am3The m-th item in (a)m312 represents a maximum difference between data Δ y of 12, a131 represents that the minimum difference between the data Δ y is 1. If Δ y is dimension data that is not concerned by the four-data clustering in the embodiment of the present invention, the value of Δ y is the sequence { a }m3At least one of them, the sequence { a } is taken four times in the embodiment of the present inventionm3The first term in (1), so Δ y ═ 1.
Therefore, the four data clustering conditions in the embodiment of the present invention are as follows: a fixed Δ x and a fixed Δ y are 1, and a sequence Δ h and { a) is determined according to the difference Δ h between the hue h values of the data pointsm1And traversing each item according to the sequence of the items in 0,1,2,3,4,5,158,159,160,161,162 and 163, and further clustering based on the clustering result obtained by taking the previous item by the delta h when the latter item is clustered by the delta h.
Δ h is taken as sequence am1Traversing each item in the sequence of the middle items for clustering, taking a clustering result obtained by clustering 163 by delta h, and placing the clustering result in the 163 th level of the clustering process tree, wherein the clustering result obtained by clustering is a root node of the clustering process tree; Δ h is taken as 0 to perform clustering to obtain a clustering result, the clustering result is arranged in the level 1 of the clustering process tree, the clustering result is a leaf node of the clustering process tree, and the degree of the leaf node is zero; when a certain set in the level 2 is used as a parent node, all elements forming the set in the level 1 are used as child nodes of the set, so as to gradually form a clustering process tree, as shown in fig. 15, when Δ h is valued in the sequence am1When the sequence of the middle items traverses each item for clustering, the method reflects that the clustering process tree carries out clustering from coarse to fine according to the granularity of the concerned dimension data delta h,The continuous clustering and distinguishing can visually reflect that the data points of the single image are clustered step by step to form distinguishable objects, and further cluster to form all information of the whole image, thereby realizing that all clustering information of the data is traceable and has source traceability.
(2) Clustering data according to data clustering conditions, and calculating entropy load of the data after each clustering
Figure BDA0003288758600000161
Figure BDA0003288758600000162
Figure BDA0003288758600000163
Wherein, am1Is a sequence { am1The m-th item in { a }m1Is the data v of the 1 st dimension1A sequence of differences Δ h arranged in descending order, a being the base of the logarithmic function, a > 1; entropy carrier
Figure BDA0003288758600000164
Denotes v1Get the sequence { am1The m-th item a inm1Carrying out clustering to obtain the size of the average information quantity borne by the first clustering result; n is v1Get the sequence { am1The m-th item a inm1Clustering to obtain the number of data sets contained in a first clustering result; k is a radical ofiIs the number of elements in the ith data set, N is the total number of data, piIs the ratio of the number of elements in the ith data set to the total number of data.
In the fourth embodiment of the present invention, a is 2, and the entropy carrier thus calculated represents bits, and the bits are binary and represent a unit of measure of the average amount of information, so that it is more appropriate to take a to 2.
The result obtained by each clustering is a plurality of data sets, each data set corresponds to a data category, when the computer system stores the clustering result, each data category has a code with fixed length corresponding to the data category, the average information quantity which can be stored by each code is fixed, correspondingly, the information expression efficiency of each code is also fixed, and the code with fixed length is expected to store the maximum average information quantity, so that the information expression efficiency is highest.
Entropy carrier
Figure BDA0003288758600000171
Represents the average information quantity carried by the clustering result obtained by the current clustering,
Figure BDA0003288758600000172
the larger the average information amount of each data type in the current clustering result is, the larger the average information amount that can be stored by the code corresponding to each data type is, and the higher the information expression efficiency of the code corresponding to each data type is, and thus, for a computer system with a certain storage space, the larger the information amount that can be stored is, and therefore, the higher the information expression efficiency thereof is.
The four conditions for data clustering in the embodiment of the invention specifically comprise:
s401, fixing Δ x to 1, fixing Δ y to 1, clustering when Δ h to 0, clustering data points representing the same hue h value together in one set, and obtaining a first clustering result after clustering, where the first clustering result includes eighteen data sets, as shown in fig. 16, where N is 18 and N is 156, and calculating the entropy carrier I at that time0
Figure BDA0003288758600000173
S402, setting Δ x to 1, setting Δ y to 1, clustering based on the clustering result of Δ h to 0, clustering when Δ h to 1, clustering data points representing hue h values with a difference of 1, and obtaining a first clustering result after clustering, where the first clustering result includes fifteen data sets, as shown in fig. 17, where N is 15 and N is 156, and calculating the entropy load I at that time1
Figure BDA0003288758600000181
S403, fixing Δ x to 1, fixing Δ y to 1, clustering based on the clustering result of Δ h to 1, clustering when Δ h to 2, clustering data points representing hue h values with a difference of 2 into a set, and clustering to obtain a first clustering result, where the first clustering result includes nine data sets, as shown in fig. 18, where N is 9 and N is 156, and calculating the entropy load I at that time2
Figure BDA0003288758600000182
S404, fixing Δ x to 1, fixing Δ y to 1, clustering based on the clustering result of Δ h to 2, clustering when Δ h to 3, clustering data points representing hue h values with a difference of 3 into one set, and clustering to obtain a first clustering result, where the first clustering result includes six data sets, as shown in fig. 19, where N is 6 and N is 156, and calculating the entropy load I at that time3
Figure BDA0003288758600000183
S405, setting Δ x to 1, setting Δ y to 1, clustering based on a clustering result of Δ h to 3, clustering when Δ h to 4, clustering data points representing hue h values with a difference of 4, and obtaining a first clustering result after clustering, where the first clustering result includes four data sets, as shown in fig. 20, where N is 4 and N is 156, and calculating the entropy load I at that time4
Figure BDA0003288758600000191
S406, fixing delta x to 1, fixing delta y to 1, clustering based on a clustering result that delta h to 4 when delta h to 158, clustering data points representing hue h values with difference of 158 in a set, and obtaining a first clustering result after clustering, wherein the first clustering result is obtainedContains a data set: the background of the clustered images and the other sets on the images form a set corresponding to the entire image, as shown in fig. 14, where N is 1 and N is 156, and the entropy carrier I at that time is calculated158
Figure BDA0003288758600000192
In this embodiment, the result of clustering with Δ h ═ 5 is the same as the result of clustering with Δ h ═ 4 in step S405, and the result of clustering with Δ h ═ 159 to 163 is the same as the result of clustering with Δ h ═ 158 in step S406, and therefore, the description thereof is omitted.
(3) Taking the maximum entropy load I in the entropy loads of the first clustering results calculated in the step (2)maxAccording to the maximum entropy carrier ImaxObtaining a data clustering result, specifically:
Figure BDA0003288758600000193
wherein, ImaxAnd the maximum entropy load represents the maximum average information quantity carried by the clustering result obtained by clustering according to the clustering condition. I is4When the average information amount obtained by clustering by the method of "fixed Δ x is 1, fixed Δ y is 1, and Δ h is 4" is the largest, the information amount that can be stored in a computer system with a fixed storage space is the largest, and the expression efficiency of the information is also the highest, so that the maximum entropy carrier I is the largest4The corresponding data clustering result is expected.
In the fourth embodiment of the invention, the three-dimensional data clustering method is exemplarily illustrated only by the difference value delta h between hue h values, the difference value delta x between x coordinate values and the difference value delta y between y coordinate values, and in essence, the data clustering method, the system and the storage medium are suitable for clustering data in any three dimensions and more than three dimensions; in addition, as can be seen from fig. 20, after Δ h-4 clustering, four clearly distinguishable objects have been formed by clustering on the image: the four sets of safety helmet, water cup, gloves and image background are adopted, so that image segmentation is accurately realized.
Performing the step (1), the step (2) and the step (3) to complete one-time clustering, and it can be seen from the corresponding figure that at least one first clustering result can be obtained for each time of clustering, and each first clustering result comprises at least one set, as shown in fig. 20, the maximum entropy I is4The corresponding clustering results are four sets: assuming that the subdivision information of the 'cup' set data needs to be known and the expected maximum entropy load is obtained, the fourth embodiment of the present invention re-determines the data clustering condition, further clusters the 'cup' set data according to the new data clustering condition by repeating the steps (1), (2) and (3) to obtain a new maximum entropy load, and the clustering result corresponding to the new maximum entropy load includes two subsets: the cup cover and the cup body correspond to information which is subdivision information of the cup set data. The cup is collected as a father node, and the cup cover and the cup body are collected as child nodes, so that an information structure tree is gradually formed. The entropy load corresponding to each branch of the information structure tree is the maximum entropy load under a certain clustering condition, and the amount of information which can be stored in a computer system with a certain storage space is the maximum, so that the information expression efficiency is the highest, specifically:
the fourth embodiment of the invention is directed at the data values of 6 ordered data points in the 'cup' set: determining a new data clustering condition according to the hue h value, the x coordinate value and the y coordinate value, and then repeatedly executing the step (1), the step (2) and the step (3) for further clustering, wherein the specific steps are as follows:
(1) determining a new data clustering condition, specifically:
the similarity between these 6 ordered data points is only affected by factors in two dimensions: the difference value delta h between the hue h values and the difference value delta y between the y coordinate values, so that the condition for clustering the data in the 'cup' set is that the data are clustered according to the combination relation of the delta h and the delta y:
(v1,v2),
v1=Δh,
v2=Δy;
in the fourth pair of 'water cup' set of the embodiment of the invention, the concerned dimension data of the data cluster is delta h, and the irrelevant dimension data is delta y, so that the combination relationship is fixed delta y, the data delta h cluster is traversed, and for delta h:
v1=Δh={am1}=a11,a21,……,am1
=0,2,4;
wherein v is1For data in dimension 1: Δ h, data v1The difference between Δ h is arranged in the order of smaller to larger as the sequence { a }m1},am1Is a sequence { am1The m-th item in (a)m14 represents a maximum difference between the data h of 4, a110 represents that the minimum difference between the data h is 0. Delta h is dimension data concerned by four pairs of 'cup' set data clustering in the embodiment of the invention, so that the value of Delta h is in a sequence { am1And traversing each item according to the sequence of the items in the item, and further clustering based on a clustering result obtained by taking the previous item by the delta h when the latter item is taken by the delta h for clustering.
For Δ y:
v2=Δy={am2}=a12,a22,……,am2
=1,2,3,4,5;
wherein v is2Data for the 2 nd dimension: Δ y, data v2The difference between Δ y is arranged in the order of smaller to larger as the sequence { a }m2},am2Is a sequence { am2The m-th item in (a)m25 represents a maximum difference between data Δ y of 5, a121 represents that the minimum difference between the data Δ y is 1. Delta y is dimension data which is not concerned by four pairs of 'cup' set data clustering in the embodiment of the invention, and the value of the Delta y is a sequence { am2At least one arbitrary item in the sequence, the four pairs of 'water cup' set data clustering and taking sequence { a }in the embodiment of the inventionm2The first term in (1), so Δ y ═ 1.
Therefore, the condition of the four pairs of 'cup' set data clustering in the embodiment of the invention is as follows: fixed Δ y 1, rootA sequence Δ h ═ a according to the difference Δ h between the hue h values of the data pointsm1And traversing each item according to the sequence of the items in 0,2 and 4, and further clustering based on a clustering result obtained by taking the previous item by the delta h when the latter item is taken by the delta h.
(2) Clustering data according to data clustering conditions, and calculating entropy load of the data after each clustering
Figure BDA0003288758600000211
Figure BDA0003288758600000212
Figure BDA0003288758600000213
Wherein, am1Is a sequence { am1The m-th item in { a }m1Is the data v of the 1 st dimension1A sequence of differences Δ h arranged in descending order, a being the base of the logarithmic function, a > 1; entropy carrier
Figure BDA0003288758600000214
Denotes v1Get the sequence { am1The m-th item a inm1Carrying out clustering to obtain the size of the average information quantity borne by the first clustering result; n is v1Get the sequence { am1The m-th item a inm1Clustering to obtain the number of data sets contained in a first clustering result; k is a radical ofiIs the number of elements in the ith data set, N is the total number of data, piIs the ratio of the number of elements in the ith data set to the total number of data.
In the fourth embodiment of the present invention, a is 2, and the entropy carrier thus calculated represents bits, and the bits are binary and represent a unit of measure of the average amount of information, so that it is more appropriate to take a to 2.
The result obtained by each clustering is a plurality of data sets, each data set corresponds to a data category, when the computer system stores the clustering result, each data category has a code with fixed length corresponding to the data category, the average information quantity which can be stored by each code is fixed, correspondingly, the information expression efficiency of each code is also fixed, and the code with fixed length is expected to store the maximum average information quantity, so that the information expression efficiency is highest.
Entropy carrier
Figure BDA0003288758600000221
Represents the average information quantity carried by the clustering result obtained by the current clustering,
Figure BDA0003288758600000222
the larger the average information amount of each data type in the current clustering result is, the larger the average information amount that can be stored by the code corresponding to each data type is, and the higher the information expression efficiency of the code corresponding to each data type is, and thus, for a computer system with a certain storage space, the larger the information amount that can be stored is, and therefore, the higher the information expression efficiency thereof is.
The four embodiments of the present invention further cluster the 'cup' set data according to the new data clustering conditions specifically:
s407, fixing Δ y to 1, clustering when Δ h is 0, clustering data points representing that hue h values in a "cup" set are the same, obtaining a first clustering result after clustering, where the first clustering result includes five data sets, as shown in fig. 21, where N is 5 and N is 6, and calculating the entropy load I at this time0
Figure BDA0003288758600000223
S408, fixing Δ y equal to 1, clustering when Δ h equal to 2 based on the clustering result when Δ h equal to 0, representing that data points with hue h value difference of 2 in the "cup" set are clustered into one set, obtaining a first clustering result after clustering, where the first clustering result includes two data sets, as shown in fig. 22, where N equal to 2 and N equal to 6, and calculating entropy at this time, where N equal to 2 and N equal to 6Carrier I2
Figure BDA0003288758600000224
S409, fixing Δ y to 1, clustering when Δ h is 4 based on a clustering result of Δ h to 2, representing that data points with hue h value difference of 4 in a "cup" set are clustered into a set, and obtaining a first clustering result after clustering, where the first clustering result includes a data set, as shown in fig. 23, where N is 1 and N is 6, and calculating entropy load I at this time4
Figure BDA0003288758600000231
(3) Taking the maximum entropy load I in the entropy loads of the first clustering results calculated in the step (2)maxAccording to the maximum entropy carrier ImaxObtaining a data clustering result, ImaxThe maximum entropy load that can be obtained from the clustering result after each clustering is finished is represented as follows:
Figure BDA0003288758600000232
wherein the maximum entropy carries ImaxThe maximum average information amount carried by the clustering result obtained by clustering according to the clustering condition is shown. I is2The entropy load obtained by clustering the 'water cup' set data by the 'fixed delta y-1 and delta h-2' method is maximum, the amount of information which can be stored in a computer system with a certain storage space is maximum, the expression efficiency of the information is also maximum, and therefore the maximum entropy load I is maximum4The corresponding clustering result is expected.
First, it can be seen from fig. 22 that the segmentation information of the "cup" set data is obtained by further clustering under the new clustering condition: the cup cover and the cup body have the advantages that the amount of information which can be stored in a computer system with a certain storage space is maximum, the information expression efficiency is also highest, and therefore the maximum entropy I is carried2Is correspondingly provided withThe clustering result of (2) is the segmentation information of the 'water cup' set data expected to be obtained.
Secondly, the cup set is used as a father node, the cup covers and the cups are combined together to be used as child nodes, and an information structure tree is gradually formed, as shown in fig. 24, the information structure tree embodies the information that original image data are clustered into a safety helmet set, a cup set and a glove set according to the size of delta h value granularity in a coarse-grained mode, and the cup set data are further clustered and distinguished in a fine-grained mode. Therefore, the entropy load corresponding to each branch of the information structure tree is the maximum entropy load under a certain clustering condition, and the amount of information which can be stored in a computer system with a certain storage space is the maximum, so that the information expression efficiency is the highest.
Finally, in the fourth embodiment of the present invention, if the cup lid and the cup body are obtained by further clustering from the cup set data, they are obviously separated, as shown in fig. 24; compared with a water cup, a cup cover and a cup body, the whole image data is only local data, and the local data is incomplete and inaccurate clustering information, so that the whole clustering data of the whole image is expected to be obtained firstly, and the clustering result of the whole data is further clustered to obtain local subdivision information, as shown in fig. 20, so that the method carries out clustering from the whole data to obtain at least one first clustering result, obtains the data clustering result according to each first clustering result, and realizes the integrity of data clustering; and based on the at least one first clustering result, clustering the at least one first clustering result again to obtain the local subdivision information of the at least one first clustering result, so that the coordination and unification of the integrity and the locality of the data clustering are realized, and the coordination and unification of the integrity and the locality of the data clustering are realized, so that the obtained clustering result is more complete and accurate.
The above four embodiments only use x coordinate value, y coordinate value, and hue h value as data to perform exemplary clustering, so as to illustrate the specific implementation method of the present invention, the present invention is not exhaustive for various combination relationships of other kinds of data and various dimension data, because the present invention does not depend on and process any special data, and is generally applicable to clustering of any data.
EXAMPLE five
An embodiment of the present invention provides a data clustering system, where the system includes: a memory, a processor, and a program stored on the memory and executable on the processor, the program when executed by the processor implementing a method of clustering data, the method comprising the steps of:
(1) determining a data clustering condition, specifically:
the data clustering condition is determined according to the similarity between data, and the similarity between the data is often influenced by factors of multiple dimensions, so that the data are clustered according to the following combination relationship of different dimension data by the five-data clustering condition in the embodiment of the invention:
(v1,v2,v3,……,vj),
vj={amj}=a1j,a2j,……,amj
the combinatorial relationship is determined according to the concerned dimensionality of the data clustering, and comprises the following steps: and fixing the dimension data which is not concerned, and combining and traversing the concerned dimension data.
Wherein v isjFor data of j dimension, data vjThe difference values are arranged in a sequence from small to large { a }mj},amjIs a sequence { amjThe m-th item in (a)mjRepresentative data vjMaximum difference between a1jRepresentative data vjA minimum difference between; when v isjFor dimensional data that is not of interest for data clustering, then vjIs taken as the sequence { amjAny at least one of }; when v isjFor data clustering of the dimensional data of interest, vjIs taken as the sequence { amjThe precedence order of the items in the } goes through each item, and vjClustering based on v when taking the latter itemjAnd taking the first clustering result obtained by the previous item for further clustering.
(2) And clustering the data according to the data clustering conditions to obtain at least one first clustering result, wherein each first clustering result comprises at least one data set. And calculating the entropy load corresponding to each first clustering result, wherein the entropy load represents the average information quantity carried by the corresponding first clustering result. The calculation method of the entropy load comprises the following steps:
Figure BDA0003288758600000241
Figure BDA0003288758600000251
wherein, amjIs a sequence { amjThe m-th item in { a }mjIs the data v of the jth dimensionjThe difference values between the two are arranged in a sequence from small to large, a is the base number of a logarithmic function, a is more than 1, and the entropy is loaded
Figure BDA0003288758600000252
Denotes vjGet the sequence { amjThe mth item a inmjCarrying out clustering to obtain the size of the average information quantity borne by the first clustering result; n is vjGet the sequence { amjThe mth item a inmjClustering to obtain the number of data sets contained in a first clustering result; k is a radical ofiIs the number of elements in the ith data set, N is the total number of data, piIs the ratio of the number of elements in the ith data set to the total number of data.
The preferred value of the fifth a in the embodiment of the present invention is a ═ 2, and the entropy carrier obtained by this calculation represents bits, and the bits are binary and represent the measurement unit of the average information amount.
The result obtained by each clustering is a plurality of data sets, each data set corresponds to a data category, when the computer system stores the clustering result, each data category has a code with fixed length corresponding to the data category, the average information quantity which can be stored by each code is fixed, correspondingly, the information expression efficiency of each code is also fixed, and the code with fixed length is expected to store the maximum average information quantity, so that the information expression efficiency is highest.
Entropy carrier
Figure BDA0003288758600000253
Represents the average information quantity carried by the clustering result obtained by the current clustering,
Figure BDA0003288758600000254
the larger the average information amount of each data type in the current clustering result is, the larger the average information amount that can be stored by the code corresponding to each data type is, and the higher the information expression efficiency of the code corresponding to each data type is, and thus, for a computer system with a certain storage space, the larger the information amount that can be stored is, and therefore, the higher the information expression efficiency thereof is.
(3) Taking the maximum entropy load I in the entropy loads of the first clustering results calculated in the step (2)maxAccording to the maximum entropy carrier ImaxObtaining a data clustering result, specifically:
Figure BDA0003288758600000255
wherein, ImaxThe maximum entropy load represents the maximum average information amount carried by the clustering result obtained by clustering according to the clustering condition, and the computer system with a certain storage space has the maximum information amount and the highest information expression efficiency, so the maximum entropy load ImaxThe corresponding clustering result is expected.
After the step (1), the step (2), and the step (3) are executed to complete the primary clustering, the data clustering method in the fifth embodiment of the present invention may further include a step of forming an information structure tree, which specifically includes:
re-determining data clustering conditions, further clustering a certain data set in the data clustering results according to a new data clustering condition by executing the clustering method according to the new data clustering condition to obtain a new maximum entropy load, wherein the clustering result corresponding to the new maximum entropy load comprises a plurality of diversity combinations, the information corresponding to the diversity combinations is subdivision information of the certain data set, the certain data set is used as a father node, the diversity combinations are used as child nodes, and an information structure tree is formed step by step.
The entropy load corresponding to each branch of the information structure tree is the maximum entropy load under a certain clustering condition, and the amount of information which can be stored in a computer system with a certain storage space is the maximum, so that the information expression efficiency is the highest.
The fifth data clustering method in the embodiment of the present invention may further include a step of forming a clustering process tree, which specifically includes:
when v isjIs taken in the sequence { a }mjV when the sequence of the items in the item is traversed each item for clustering, vjGet aqjThe first clustering result obtained by clustering is arranged in the q-th level of the clustering process tree, q is more than or equal to 1 and less than or equal to m, vjGet amjThe first clustering result obtained by clustering is the root node of the clustering process tree vjGet a1jClustering to obtain a first clustering result which is a leaf node of the clustering process tree, wherein the degree of the leaf node is zero; the q-th level set is used as a father node, and all elements forming the set by the q-1-th level clustering are used as child nodes of the set, so that a clustering process tree is formed step by step. When v isjIs taken in the sequence { a }mjWhen the sequence of the items in the tree is traversed to each item for clustering, the clustering process tree is embodied according to concerned dimension data vjThe data are clustered and distinguished continuously from coarse to fine according to the granularity of the data, all information clustered by a single data point step by step can be reflected visually by a clustering process tree, and all clustering information of the data can be traced and traced actively.
EXAMPLE six
An embodiment of the present invention further provides a computer-readable storage medium, where the storage medium stores at least one program, the program is executable by at least one processor, and the at least one program, when executed by the at least one processor, implements a method for clustering data, where the method for clustering data includes:
(1) determining a data clustering condition, specifically:
the data clustering condition is determined according to the similarity between data, and the similarity between the data is often influenced by factors of multiple dimensions, so that the data are clustered according to the following combination relationship of different dimension data by the five-data clustering condition in the embodiment of the invention:
(v1,v2,v3,……,vj),
vj={amj}=a1j,a2j,……,amj
the combinatorial relationship is determined according to the concerned dimensionality of the data clustering, and comprises the following steps: and fixing the dimension data which is not concerned, and combining and traversing the concerned dimension data.
Wherein v isjFor data of j dimension, data vjThe difference values are arranged in a sequence from small to large { a }mj},amjIs a sequence { amjThe m-th item in (a)mjRepresentative data vjMaximum difference between a1jRepresentative data vjA minimum difference between; when v isjFor dimensional data that is not of interest for data clustering, then vjIs taken as the sequence { amjAny at least one of }; when v isjFor data clustering of the dimensional data of interest, vjIs taken as the sequence { amjThe precedence order of the items in the } goes through each item, and vjClustering based on v when taking the latter itemjAnd taking the first clustering result obtained by the previous item for further clustering.
(2) And clustering the data according to the data clustering conditions to obtain at least one first clustering result, wherein each first clustering result comprises at least one data set. And calculating the entropy load corresponding to each first clustering result, wherein the entropy load represents the average information quantity carried by the corresponding first clustering result. The calculation method of the entropy load comprises the following steps:
Figure BDA0003288758600000271
Figure BDA0003288758600000272
wherein, amjIs a sequence { amjThe m-th item in { a }mjIs the data v of the jth dimensionjThe difference values between the two are arranged in a sequence from small to large, a is the base number of a logarithmic function, a is more than 1, and the entropy is loaded
Figure BDA0003288758600000273
Denotes vjGet the sequence { amjThe mth item a inmjCarrying out clustering to obtain the size of the average information quantity borne by the first clustering result; n is vjGet the sequence { amjThe mth item a inmjClustering to obtain the number of data sets contained in a first clustering result; k is a radical ofiIs the number of elements in the ith data set, N is the total number of data, piIs the ratio of the number of elements in the ith data set to the total number of data.
The preferred value of the sixth a of the embodiment of the present invention is a ═ 2, and the entropy carrier obtained by this calculation represents bits, and the bits are binary and represent the measurement unit of the average information amount.
The result obtained by each clustering is a plurality of data sets, each data set corresponds to a data category, when the computer system stores the clustering result, each data category has a code with fixed length corresponding to the data category, the average information quantity which can be stored by each code is fixed, correspondingly, the information expression efficiency of each code is also fixed, and the code with fixed length is expected to store the maximum average information quantity, so that the information expression efficiency is highest.
Entropy carrier
Figure BDA0003288758600000274
Representing the average information content carried by the clustering result obtained by the current clusteringThe size of (a) is (b),
Figure BDA0003288758600000275
the larger the average information amount of each data type in the current clustering result is, the larger the average information amount that can be stored by the code corresponding to each data type is, and the higher the information expression efficiency of the code corresponding to each data type is, and thus, for a computer system with a certain storage space, the larger the information amount that can be stored is, and therefore, the higher the information expression efficiency thereof is.
(3) Taking the maximum entropy load I in the entropy loads of the first clustering results calculated in the step (2)maxAccording to the maximum entropy carrier ImaxObtaining a data clustering result, specifically:
Figure BDA0003288758600000276
wherein, ImaxThe maximum entropy load represents the maximum average information amount carried by the clustering result obtained by clustering according to the clustering condition, and the computer system with a certain storage space has the maximum information amount and the highest information expression efficiency, so the maximum entropy load ImaxThe corresponding clustering result is expected.
After the step (1), the step (2) and the step (3) are executed to complete the primary clustering, the data clustering method in the sixth embodiment of the present invention may further include a step of forming an information structure tree, which specifically includes:
re-determining data clustering conditions, further clustering a certain data set in the data clustering results according to a new data clustering condition by executing the clustering method according to the new data clustering condition to obtain a new maximum entropy load, wherein the clustering result corresponding to the new maximum entropy load comprises a plurality of diversity combinations, the information corresponding to the diversity combinations is subdivision information of the certain data set, the certain data set is used as a father node, the diversity combinations are used as child nodes, and an information structure tree is formed step by step.
The entropy load corresponding to each branch of the information structure tree is the maximum entropy load under a certain clustering condition, and the amount of information which can be stored in a computer system with a certain storage space is the maximum, so that the information expression efficiency is the highest.
The data clustering method in the sixth embodiment of the present invention may further include a step of forming a clustering process tree, which specifically includes:
when v isjIs taken in the sequence { a }mjV when the sequence of the items in the item is traversed each item for clustering, vjGet aqjThe first clustering result obtained by clustering is arranged in the q-th level of the clustering process tree, q is more than or equal to 1 and less than or equal to m, vjGet amjThe first clustering result obtained by clustering is the root node of the clustering process tree vjGet a1jClustering to obtain a first clustering result which is a leaf node of the clustering process tree, wherein the degree of the leaf node is zero; the q-th level set is used as a father node, and all elements forming the set by the q-1-th level clustering are used as child nodes of the set, so that a clustering process tree is formed step by step. When v isjIs taken in the sequence { a }mjWhen the sequence of the items in the tree is traversed to each item for clustering, the clustering process tree is embodied according to concerned dimension data vjThe data are clustered and distinguished continuously from coarse to fine according to the granularity of the data, all information clustered by a single data point step by step can be reflected visually by a clustering process tree, and all clustering information of the data can be traced and traced actively.
In summary, the data clustering method, the data clustering system and the storage medium provided by the invention cluster from the whole data according to the data clustering conditions to obtain at least one first clustering result, and the data clustering result is obtained through the first clustering result with the largest bearing average information amount, so that the integrity of data clustering is realized, and the obtained clustering result is more complete and accurate; in addition, dependence and processing on any special data do not exist in the clustering process, and any data type is not limited, so that the method is generally suitable for clustering of any data, and has very high practicability; the maximum average information bearing amount is used as a basis for determining a clustering result, and for a computer system with a certain storage space, the larger the information amount which can be stored, the higher the information expression efficiency is;
the data clustering method, the data clustering system and the storage medium provided by the invention cluster the at least one first clustering result again based on the at least one first clustering result to obtain the local subdivision information of the at least one first clustering result, so that the coordination and unification of the integrity and the locality of the data clustering are realized;
the method, the system and the storage medium for clustering data form an information structure tree, the entropy load corresponding to each branch of the information structure tree is the maximum entropy load under a certain clustering condition, and the information quantity which can be stored in a computer system with a certain storage space is the maximum, so that the information expression efficiency is also the highest;
the clustering method, the system and the storage medium of the data provided by the invention also form a clustering process tree in the clustering process, the clustering process tree clusters and distinguishes the data continuously from coarse to fine according to the granularity of the concerned dimension data, all information of single data point step by step can be reflected visually, and all clustering information of the data is traced and traceable.

Claims (11)

1. A method for clustering data, the method comprising:
determining a data clustering condition;
clustering data according to the data clustering condition to obtain at least one first clustering result, wherein each first clustering result in the at least one first clustering result comprises at least one data set; calculating an entropy load corresponding to each first clustering result, wherein the entropy load represents the size of the average information quantity carried by the corresponding first clustering result;
and taking the maximum entropy load in the entropy loads corresponding to each first clustering result, wherein the first clustering result corresponding to the maximum entropy load is a data clustering result.
2. The method according to claim 1, wherein the data clustering condition is determined based on similarity between data.
3. The method of claim 1, wherein clustering data according to the data clustering condition comprises: and clustering the data according to the combination relation of the data with different dimensions.
4. The method according to claim 3, wherein the combination relationship of the data with different dimensions is determined according to the dimension concerned by the data clustering, and comprises: and fixing the dimension data which is not concerned, and combining and traversing the concerned dimension data.
5. The method according to claim 3, wherein the clustering data according to the combination relationship of the data with different dimensions specifically comprises:
(v1,v2,v3,……,vj),
vj={amj}=a1j,a2j,……,amj
wherein v isjFor data of j dimension, data vjThe difference values are arranged in a sequence from small to large { a }mj},amjIs a sequence { amjThe m-th item in (a)mjRepresentative data vjMaximum difference between a1jRepresentative data vjA minimum difference between; when v isjFor dimensional data that is not of interest for data clustering, then vjIs taken as the sequence { amjAny at least one of }; when v isjFor data clustering of the dimensional data of interest, vjIs taken as the sequence { amjThe precedence order of the items in the } goes through each item, and vjClustering based on v when taking the latter itemjAnd taking the first clustering result obtained by the previous item for further clustering.
6. The method for clustering data according to claim 1, wherein the entropy carriers are calculated by:
Figure FDA0003288758590000011
Figure FDA0003288758590000012
wherein, amjIs a sequence { amjThe m-th item in { a }mjIs the data v of the jth dimensionjThe difference values between the two are arranged in a sequence from small to large, a is the base number of a logarithmic function, a is more than 1, and I is carried by entropyamjDenotes vjGet the sequence { amjThe mth item a inmjCarrying out clustering to obtain the size of the average information quantity borne by the first clustering result; n is vjGet the sequence { amjThe mth item a inmjClustering to obtain the number of data sets contained in a first clustering result; k is a radical ofiIs the number of elements in the ith data set, N is the total number of data, piIs the ratio of the number of elements in the ith data set to the total number of data.
7. A method for clustering data according to claim 6, wherein a is 2, and the entropy obtained by calculation represents bits, and the bits are binary and represent the measurement unit of the average information amount.
8. A method for clustering data according to claim 1, the method comprising the step of forming an information structure tree comprising:
re-determining data clustering conditions, further clustering a certain data set in the data clustering results according to a new data clustering condition by executing the clustering method according to the new data clustering condition to obtain a new maximum entropy load, wherein the clustering result corresponding to the new maximum entropy load comprises a plurality of diversity combinations, the information corresponding to the diversity combinations is subdivision information of the certain data set, the certain data set is used as a father node, the diversity combinations are used as child nodes, and an information structure tree is formed step by step.
9. A method for clustering data according to claim 5, the method comprising a step of forming a clustering process tree comprising:
when said v isjIs taken as the sequence { amjV when the sequence of the items in the item is traversed each item for clustering, vjGet aqjThe first clustering result obtained by clustering is arranged in the q-th level of the clustering process tree, q is more than or equal to 1 and less than or equal to m, vjGet amjThe first clustering result obtained by clustering is the root node of the clustering process tree vjGet a1jClustering to obtain a first clustering result which is a leaf node of the clustering process tree, wherein the degree of the leaf node is zero; the q-th level set is used as a father node, and all elements forming the set by the q-1-th level clustering are used as child nodes of the set, so that a clustering process tree is formed step by step.
10. A system for clustering data, the system comprising a memory, a processor and a program stored in the memory and executable on the processor, the program, when executed by the processor, implementing the steps of the method for clustering data according to any one of claims 1 to 9.
11. A computer-readable storage medium characterized by: the storage medium stores at least one program executable by at least one processor, the at least one program when executed by the at least one processor implementing the steps of the method of clustering data according to any one of claims 1 to 9.
CN202111156414.XA 2021-09-30 2021-09-30 Data clustering method, system and storage medium Pending CN113806610A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111156414.XA CN113806610A (en) 2021-09-30 2021-09-30 Data clustering method, system and storage medium
PCT/CN2021/123007 WO2023050461A1 (en) 2021-09-30 2021-10-11 Data clustering method and system, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111156414.XA CN113806610A (en) 2021-09-30 2021-09-30 Data clustering method, system and storage medium

Publications (1)

Publication Number Publication Date
CN113806610A true CN113806610A (en) 2021-12-17

Family

ID=78939055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111156414.XA Pending CN113806610A (en) 2021-09-30 2021-09-30 Data clustering method, system and storage medium

Country Status (2)

Country Link
CN (1) CN113806610A (en)
WO (1) WO2023050461A1 (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678417B (en) * 2012-09-25 2017-11-24 华为技术有限公司 Human-machine interaction data treating method and apparatus
CN107909478A (en) * 2017-11-27 2018-04-13 苏州点对点信息科技有限公司 FOF mutual fund portfolio system and methods based on social network clustering and information gain entropy index
CN109657695A (en) * 2018-11-05 2019-04-19 中国电子科技集团公司电子科学研究院 A kind of fuzzy division clustering method and device based on definitive operation
CN111539443B (en) * 2020-01-22 2024-02-09 北京小米松果电子有限公司 Image recognition model training method and device and storage medium

Also Published As

Publication number Publication date
WO2023050461A1 (en) 2023-04-06

Similar Documents

Publication Publication Date Title
Nasir et al. Deep learning-based classification of fruit diseases: An application for precision agriculture
Moallem et al. Optimal threshold computing in automatic image thresholding using adaptive particle swarm optimization
Trujillo et al. Automated design of image operators that detect interest points
US8015125B2 (en) Multi-scale segmentation and partial matching 3D models
Chen et al. A self-balanced min-cut algorithm for image clustering
CN104616029B (en) Data classification method and device
Al-Sahaf et al. Image descriptor: A genetic programming approach to multiclass texture classification
CN103544499B (en) The textural characteristics dimension reduction method that a kind of surface blemish based on machine vision is detected
CN108334805B (en) Method and device for detecting document reading sequence
CN103745201B (en) A kind of program identification method and device
CN109766469A (en) A kind of image search method based on the study optimization of depth Hash
CN113761259A (en) Image processing method and device and computer equipment
Buvana et al. Content-based image retrieval based on hybrid feature extraction and feature selection technique pigeon inspired based optimization
WO2014195782A2 (en) Differential evolution-based feature selection
CN108764351A (en) A kind of Riemann manifold holding kernel learning method and device based on geodesic distance
CN109271544B (en) Method and device for automatically selecting painter representatives
Rahman et al. Seed-Detective: A Novel Clustering Technique Using High Quality Seed for K-Means on Categorical and Numerical Attributes.
CN113516019B (en) Hyperspectral image unmixing method and device and electronic equipment
CN104572930B (en) Data classification method and device
CN113806610A (en) Data clustering method, system and storage medium
CN110222778B (en) Online multi-view classification method, system and device based on deep forest
CN109472319B (en) Three-dimensional model classification method and retrieval method
Akhtar et al. Big data mining based on computational intelligence and fuzzy clustering
Ju et al. A novel neutrosophic logic svm (n-svm) and its application to image categorization
Histograms Bi-level classification of color indexed image histograms for content based image retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20211217