CN116383680A - Clustering method, clustering device, electronic equipment and computer readable storage medium - Google Patents

Clustering method, clustering device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN116383680A
CN116383680A CN202310351295.6A CN202310351295A CN116383680A CN 116383680 A CN116383680 A CN 116383680A CN 202310351295 A CN202310351295 A CN 202310351295A CN 116383680 A CN116383680 A CN 116383680A
Authority
CN
China
Prior art keywords
node
target
nodes
new
new sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310351295.6A
Other languages
Chinese (zh)
Inventor
王洪波
余涛
杨贵锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Bank Co Ltd
Original Assignee
Ping An Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Bank Co Ltd filed Critical Ping An Bank Co Ltd
Priority to CN202310351295.6A priority Critical patent/CN116383680A/en
Publication of CN116383680A publication Critical patent/CN116383680A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application provides a clustering method, a clustering device, electronic equipment and a computer readable storage medium, wherein the clustering process mainly comprises the following steps: firstly, constructing a cluster feature tree to determine a threshold parameter of the cluster feature tree, searching a target leaf node closest to a new sample point downwards from the root node in the cluster feature tree, calculating Gaussian probability that the new sample point belongs to each CF node in the target leaf node, adding a target CF node corresponding to the maximum value of the Gaussian probability, and finally adding the new sample point into the target CF node. Therefore, CF nodes with the nearest new sample points can be searched through calculating probability, concave or convex distribution is more easily adapted to, and the method is not limited to hypersphere, the optimized clustering model can effectively classify an object set with complex space distribution, and the problem of clustering adaptation caused by space attributes of service data is avoided, so that a matching model with lower deviation and more economy is obtained.

Description

Clustering method, clustering device, electronic equipment and computer readable storage medium
Technical Field
The present disclosure relates to the field of data storage, and in particular, to a clustering method, a clustering device, an electronic device, and a computer readable storage medium.
Background
The BIRCH hierarchical clustering algorithm is an algorithm utilizing a balanced iteration protocol and clustering of a hierarchical method, and has high running speed, and the clustering can be performed only by scanning a data set in a single pass.
The algorithm involves 2 concepts: first, CF (Cluster Feature) cluster features: it is a triplet, which can be represented by (N, LS, SS), where N is the number of sample points held in this CF, LS represents the sum of the features of the sample, and SS represents the sum of the squares of the features of the sample.
Second, CF Tree (Cluster Feature Tree) cluster feature Tree: this number can be divided into root node, branch node, and leaf node 3 classes, where each node is made up of multiple CFs. There are 3 important parameters for a CF Tree: first, the maximum CF number of internal nodes is called the branch balancing factor B; secondly, the maximum CF number of the leaf nodes is called a leaf balance factor L; thirdly, calculating the spatial distance between the new sample point and the CF by the spatial threshold T of the leaf node, and if the spatial distance is smaller than the threshold, taking the sample into a certain CF node.
The main process of the BIRCH algorithm is a process of establishing a CF Tree, and is briefly described as follows: searching down from the root node for the nearest leaf node and the nearest CF node in the leaf nodes, if the radius of the hypersphere corresponding to the CF node still meets less than the threshold value T after the new sample is added, updating all CF triples on the path, and finishing insertion.
The main process of the BIRCH algorithm is a process of establishing a CF Tree, and the clustering algorithm must depend on the radius of the hypersphere when clustering objects, which requires that the distribution cluster of the data set formed by the objects is similar to the hypersphere, otherwise, the BIRCH algorithm cannot perform effective clustering. It can be seen that the clustering factor of the existing NIRCH algorithm is single, so that the applicability is poor.
Disclosure of Invention
In order to solve the technical problems, embodiments of the present application provide a clustering method, a clustering device, an electronic device, and a computer readable storage medium.
In a first aspect, an embodiment of the present application provides a clustering method, where the method includes:
constructing a cluster feature tree and determining threshold parameters of the cluster feature tree, wherein the cluster feature tree comprises root nodes, branch nodes and leaf nodes, each root node, branch node and leaf node stores at least one CF node, and the threshold parameters comprise a first threshold corresponding to the maximum sample radius of the CF node in the leaf node and a second threshold corresponding to the maximum number of storable CF nodes in the leaf node;
searching a target leaf node closest to a new sample point downwards from the root node in the cluster feature tree;
Calculating Gaussian probability that the new sample point belongs to each CF node in the target leaf node;
and determining a target CF node corresponding to the maximum value in all Gaussian probabilities, and adding the new sample point into the target CF node.
According to a specific embodiment of the present application, the step of determining a target CF node corresponding to a maximum value of all gaussian probabilities and adding the new sample point to the target CF node includes:
determining a target CF node corresponding to the maximum value in all Gaussian probabilities, and calculating the corresponding hypersphere radius of the new sample point after the new sample point is added into the target CF node;
if the hypersphere radius is smaller than or equal to the third threshold value, adding the new sample point into the target CF node, and updating all CF nodes from the root node to the target CF node;
if the hypersphere radius is larger than the third threshold value, a new CF node is created in the target leaf node, the new sample point is added to the new CF node, and all CF nodes between the root node and the target CF node are updated.
According to one specific embodiment of the present application, if the radius of the hypersphere is greater than the third threshold value, a new CF node is created in the target leaf node, and the step of adding the new sample point to the new CF node includes:
If the hypersphere radius of the target CF node is larger than the third threshold value, creating a new CF node in the target leaf node and adding the new sample point into the new CF node;
judging whether the number of all CF nodes in the target leaf node is smaller than or equal to the second threshold value;
if the number of all CF nodes in the target leaf node is less than or equal to the second threshold value, adding the new CF node into the target leaf node;
if the number of all CF nodes in the target leaf node is larger than the second threshold value, splitting the target leaf node into two new leaf nodes, and adding all CF nodes in the target leaf node into corresponding new leaf nodes according to the node distance.
According to one specific embodiment of the present application, the step of calculating the gaussian probability that the new sample point belongs to each CF node in the target leaf node includes:
initializing each CF node in the target leaf node into a corresponding cluster, and determining a Gaussian distribution function of each cluster;
calculating a weight parameter when the probability of the Gaussian distribution function is maximized according to the information of each data point in the CF node;
and calculating the Gaussian probability of the new sample point belonging to each CF node according to the Gaussian distribution function and the weight parameter corresponding to each CF node.
According to one embodiment of the present application, the step of determining the gaussian distribution function of each cluster includes:
initializing the Gaussian distribution function of each cluster as:
Figure BDA0004162025300000041
wherein,,
Figure BDA0004162025300000042
representing Gaussian probability, alpha k (t+1) Represents variance, mu k (t+1) Represent the mean value, sigma k (t+1) Representing a gaussian distribution function.
According to one specific embodiment of the present application, the step of calculating the weight parameter when the probability of the gaussian distribution function is maximized according to the information of each data point in the CF node includes:
substituting the initialized Gaussian distribution function into the probability maximization according to the information of each data point to obtain the weight parameters as follows:
Figure BDA0004162025300000043
Figure BDA0004162025300000044
Figure BDA0004162025300000045
wherein alpha is k (t+1) Represents variance, mu k (t+1) Represent the mean value, sigma k (t+1) Representing a gaussian distribution function.
According to a specific embodiment of the present application, after the step of constructing the cluster feature tree, the method further includes at least one of the following steps:
deleting abnormal CF nodes with sample points less than a point threshold value, and updating all CF nodes from the root node to the abnormal CF nodes;
at least two second class anomaly CF nodes having a combined node distance less than a distance threshold.
In a second aspect, an embodiment of the present application provides a clustering apparatus, where the apparatus includes:
The cluster feature tree comprises root nodes, branch nodes and leaf nodes, wherein each root node, branch node and leaf node store at least one CF node, and the threshold parameters comprise a first threshold corresponding to the maximum sample radius of the CF node in the leaf node and a second threshold corresponding to the maximum number of storable CF nodes in the leaf node;
the searching module is used for searching the target leaf node closest to the new sample point downwards from the root node in the cluster feature tree;
the calculation module is used for calculating Gaussian probability that the new sample point belongs to each CF node in the target leaf node;
and the adding module is used for determining a target CF node corresponding to the maximum value in all Gaussian probabilities and adding the new sample point into the target CF node.
According to a specific embodiment of the present application, the adding module is configured to:
determining a target CF node corresponding to the maximum value in all Gaussian probabilities, and calculating the corresponding hypersphere radius of the new sample point after the new sample point is added into the target CF node;
if the hypersphere radius is smaller than or equal to the third threshold value, adding the new sample point into the target CF node, and updating all CF nodes from the root node to the target CF node;
If the hypersphere radius is larger than the third threshold value, a new CF node is created in the target leaf node, the new sample point is added to the new CF node, and all CF nodes between the root node and the target CF node are updated.
According to a specific embodiment of the present application, the adding module is configured to:
if the hypersphere radius of the target CF node is larger than the third threshold value, creating a new CF node in the target leaf node and adding the new sample point into the new CF node;
judging whether the number of all CF nodes in the target leaf node is smaller than or equal to the second threshold value;
if the number of all CF nodes in the target leaf node is less than or equal to the second threshold value, adding the new CF node into the target leaf node;
if the number of all CF nodes in the target leaf node is larger than the second threshold value, splitting the target leaf node into two new leaf nodes, and adding all CF nodes in the target leaf node into corresponding new leaf nodes according to the node distance.
According to one specific embodiment of the present application, the computing module is configured to:
initializing each CF node in the target leaf node into a corresponding cluster, and determining a Gaussian distribution function of each cluster;
Calculating a weight parameter when the probability of the Gaussian distribution function is maximized according to the information of each data point in the CF node;
and calculating the Gaussian probability of the new sample point belonging to each CF node according to the Gaussian distribution function and the weight parameter corresponding to each CF node.
According to one embodiment of the present application, the computing module is specifically configured to:
initializing the Gaussian distribution function of each cluster as:
Figure BDA0004162025300000061
wherein,,
Figure BDA0004162025300000062
representing Gaussian probability, alpha k (t+1) Represents variance, mu k (t+1) Represent the mean value, sigma k (t+1) Representing a gaussian distribution function.
According to one embodiment of the present application, the computing module is specifically configured to:
substituting the initialized Gaussian distribution function into the probability maximization according to the information of each data point to obtain the weight parameters as follows:
Figure BDA0004162025300000063
Figure BDA0004162025300000064
Figure BDA0004162025300000065
wherein alpha is k (t+1) Represents variance, mu k (t+1) Represent the mean value, sigma k (t+1) Representing a gaussian distribution function.
According to one embodiment of the present application, the building block is further configured to:
deleting abnormal CF nodes with sample points less than a point threshold value, and updating all CF nodes from the root node to the abnormal CF nodes;
at least two second class anomaly CF nodes having a combined node distance less than a distance threshold.
In a third aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where the memory stores a computer program that, when executed by the processor, performs the clustering method of any one of the first aspects.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program that, when run on a processor, performs the clustering method of any one of the first aspects.
The clustering method, the clustering device, the electronic equipment and the computer readable storage medium provided by the application mainly comprise the following clustering processes: firstly, constructing a cluster feature tree to determine a threshold parameter of the cluster feature tree, searching a target leaf node closest to a new sample point downwards from the root node in the cluster feature tree, calculating Gaussian probability that the new sample point belongs to each CF node in the target leaf node, adding a target CF node corresponding to the maximum value of the Gaussian probability, and finally adding the new sample point into the target CF node. Therefore, CF nodes with the nearest new sample points can be searched through calculating probability, concave or convex distribution is more easily adapted to, and the method is not limited to hypersphere, the optimized clustering model can effectively classify object sets with complex space distribution, the problem of clustering adaptation caused by space attributes of service data is avoided, and the memory saving advantage of the original BIRCH model can be reserved, so that a matching model with lower deviation and more economy is obtained.
Drawings
In order to more clearly illustrate the technical solutions of the present application, the drawings that are required for the embodiments will be briefly described, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope of protection of the present application. Like elements are numbered alike in the various figures.
Fig. 1 shows a schematic flow chart of a clustering method according to an embodiment of the present application;
FIG. 2 is a schematic view of a part of a clustering method according to an embodiment of the present application;
fig. 3 shows a schematic structural diagram of a clustering device according to an embodiment of the present application;
fig. 4 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments.
The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.
In the following, the terms "comprises", "comprising", "having" and their cognate terms may be used in various embodiments of the present application are intended only to refer to a particular feature, number, step, operation, element, component, or combination of the foregoing, and should not be interpreted as first excluding the existence of or increasing the likelihood of one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.
Furthermore, the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and should not be construed as indicating or implying relative importance.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of this application belong. The terms (such as those defined in commonly used dictionaries) will be interpreted as having a meaning that is identical to the meaning of the context in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in connection with the various embodiments.
Example 1
Referring to fig. 1, a flow chart of a clustering method provided in an embodiment of the present application is shown. As shown in fig. 1, the clustering method mainly includes the following steps:
step S101, constructing a cluster feature tree, and determining threshold parameters of the cluster feature tree; the threshold parameters comprise a first threshold corresponding to the maximum sample radius of the CF nodes in the leaf nodes and a second threshold corresponding to the maximum number of storable CF nodes in the leaf nodes;
the clustering algorithm provided by the embodiment is applied to a clustering feature tree, wherein the clustering feature tree comprises root nodes, branch nodes below the root nodes and leaf nodes below the branch nodes from top to bottom, and each root node, branch node and leaf node stores at least one CF node. The provided clustering algorithm process is a process of adding new sample points into CF nodes in a certain matched leaf node.
After creating the cluster feature tree, the relevant threshold parameters of the cluster feature tree need to be determined, and the relevant threshold parameters mainly comprise three types of threshold parameters:
firstly, a spatial threshold T of a leaf node, namely a first threshold corresponding to the maximum sample radius of a CF node in the leaf node;
secondly, the maximum number of CF nodes which can be stored by the leaf node is a second threshold value, which is also called She Pingheng factor L;
Thirdly, the maximum number of CF that a branch node can store is called branch balance factor B.
The three parameters are used for properly limiting the updating expansion of the CF and the expansion of the leaf nodes in the clustering process of the clustering feature tree, so that the influence on the clustering effect caused by overlarge super-sphere radius of the CF nodes or overlarge number of the leaf nodes is avoided.
Step S102, searching a target leaf node closest to a new sample point downwards from the root node in the cluster feature tree;
the new sample point is a multidimensional data point which is not stored in the clustering feature tree, and the matched leaf nodes need to be searched first in the process of clustering the new sample point to the clustering feature tree. Specifically, traversing the root node, the branch node connected with the root node and the leaf node connected with the branch node in sequence along the path from the root node until the leaf node closest to the new sample point is found out, and defining the leaf node as a target leaf node.
The target leaf node has stored therein a plurality of CF nodes, one CF node being a CF triplet. CF clustering features use a triplet to summarize information describing each cluster, each CF can use
Figure BDA0004162025300000101
Figure BDA0004162025300000102
And (3) representing. Wherein N represents the number of cluster points, +.>
Figure BDA0004162025300000103
Representing the linear summation of points, and SS represents the square summation of points. Assuming that there are N D-dimensional data points in a cluster, vector +. >
Figure BDA0004162025300000104
Is the linear summation of each point, and has the sizeThe direction, formula is as follows:
Figure BDA0004162025300000105
scalar SS is the sum of squares of the data points, and the formula is as follows:
Figure BDA0004162025300000106
the cluster feature CF has additivity, which is embodied between cluster feature CF nodes and the additivity of sample points to existing cluster feature CF nodes. Wherein, the additivity among the nodes is expressed as follows:
Figure BDA0004162025300000107
the two clusters are combined into a big cluster, and the clustering characteristics of the big cluster are as follows: />
Figure BDA0004162025300000108
The additivity of sample points to an existing CF node is expressed as:
data points B (1, 1) and C (2, 2) have been stored within the CF node, cf= (2, ((1+2), (1+2)), (12+12+22+22))= (2, (3, 3), 10);
the new sample point is A (1, 2), and the new sample point is updated after being added with the CF node:
CF=(3,((3+1),(3+2)),(10+12+22))=(2,(4,5),15)。
in addition, the cluster represented by the CF node has two more characteristic parameters, either radius or diameter, which reflect the degree of compactness within the cluster. The CF structure summarizes the basic information of clusters and is highly compressed, storing cluster information smaller than the actual data points. The process of clustering new sample points by the cluster feature tree provided by the application mainly utilizes the additivity of the cluster feature CF nodes and the node distance.
And searching each branch node and the leaf node under the branch node downwards from the root node in the cluster feature tree, judging the leaf node closest to the new sample point, and defining the leaf node as a target leaf node.
Step S103, calculating Gaussian probability that the new sample point belongs to each CF node in the target leaf node;
and after finding the target leaf node closest to the new sample point in the cluster feature tree according to the steps, calculating the Gaussian probability of the new sample point belonging to each CF node according to the data information of the cluster represented by each CF node in the target leaf node.
According to one specific embodiment of the present application, the step of calculating the gaussian probability that the new sample point belongs to each CF node in the target leaf node includes:
initializing each CF node in the target leaf node into a corresponding cluster, and determining a Gaussian distribution function of each cluster;
calculating a weight parameter when the probability of the Gaussian distribution function is maximized according to the information of each data point in the CF node;
and calculating the Gaussian probability of the new sample point belonging to each CF node according to the Gaussian distribution function and the weight parameter corresponding to each CF node.
Further, the step of determining a gaussian distribution function for each cluster includes:
initializing the Gaussian distribution function of each cluster as:
Figure BDA0004162025300000111
wherein,,
Figure BDA0004162025300000112
representing Gaussian probability, alpha k (t+1) Represents variance, mu k (t+1) Represent the mean value, sigma k (t+1) Representing a gaussian distribution function.
Specifically, the step of calculating the weight parameter when the probability of the gaussian distribution function is maximized according to the information of each data point in the CF node includes:
substituting the initialized Gaussian distribution function into the probability maximization according to the information of each data point to obtain the weight parameters as follows:
Figure BDA0004162025300000121
Figure BDA0004162025300000122
Figure BDA0004162025300000123
wherein alpha is k (t+1) Represents variance, mu k (t+1) Represent the mean value, sigma k (t+1) Representing a gaussian distribution function.
In the specific adding process, after a new sample point is added into a target leaf node, gaussian probability that the new node belongs to a cluster corresponding to each CF node is calculated. The number of k is set, namely the number of components of the initialization model, and the weight parameters of the Gaussian distribution of each cluster are randomly initialized, namely the mean and variance parameters, namely the Gaussian probability is calculated. The closer the new sample point is to the center of the gaussian distribution, the greater the probability, i.e. the higher the probability of belonging to the cluster.
According to the calculation formula of the Gaussian function, weight parameters alpha, mu and sigma parameters are calculated to maximize the probability of the data points, the new parameters are calculated by using the weighting of the probability of the data points, the weight is the probability that the data points belong to the cluster, and iteration is repeated until convergence is achieved.
And step S104, determining a target CF node corresponding to the maximum value in all Gaussian probabilities, and adding the new sample point into the target CF node.
And after the Gaussian probability that the new sample point belongs to each CF node is calculated according to the steps, selecting the CF node corresponding to the maximum value of the Gaussian probability as a target CF node, and adding the new sample point into the target CF node to finish clustering of the new sample point.
On the basis of the foregoing embodiment, referring to fig. 2, the step of determining a target CF node corresponding to a maximum value in all gaussian probabilities and adding the new sample point to the target CF node includes:
step S201, determining a target CF node corresponding to the maximum value in all Gaussian probabilities, and calculating the corresponding hypersphere radius of the new sample point after the new sample point is added into the target CF node;
step S202, if the hypersphere radius is smaller than or equal to the third threshold value, adding the new sample point into the target CF node, and updating all CF nodes from the root node to the target CF node;
if the hypersphere radius is greater than the third threshold, a new CF node is created in the target leaf node, the new sample point is added to the new CF node, and all CF nodes from the root node to the target CF node are updated in step S203.
Further, if the radius of the hypersphere is greater than the third threshold value, creating a new CF node in the target leaf node, and adding the new sample point to the new CF node, including:
if the hypersphere radius of the target CF node is larger than the third threshold value, creating a new CF node in the target leaf node and adding the new sample point into the new CF node;
judging whether the number of all CF nodes in the target leaf node is smaller than or equal to the second threshold value;
if the number of all CF nodes in the target leaf node is less than or equal to the second threshold value, adding the new CF node into the target leaf node;
if the number of all CF nodes in the target leaf node is larger than the second threshold value, splitting the target leaf node into two new leaf nodes, and adding all CF nodes in the target leaf node into corresponding new leaf nodes according to the node distance.
After the target CF node is determined, the hypersphere radius corresponding to the target CF node is calculated, whether the target CF node meets a threshold value T or not is judged, the probability that the target CF node is consistent with the probability that elements of other CF nodes belong to K clusters is compared, all CF triples on a path are updated, and the insertion is finished.
If the number of all CF nodes of the target leaf node is smaller than the second threshold L, a new CF node is created, a new sample is put in, the new CF node is put in the target leaf node, all CF triples on the path are updated, and the insertion is finished.
If the number of all CF nodes of the target leaf node is greater than the second threshold L, dividing the current target leaf node into two new leaf nodes, and storing all CF nodes of the target leaf node in the two new leaf nodes separately. When the two CF tuples are stored separately, the two CF tuples with the farthest hypersphere distance in all the CF tuples in the target leaf nodes and the largest membership probability difference of the CF nodes in the tuple can be selected and distributed as the first CF node of the two new leaf nodes. And putting other tuples and new sample tuples into corresponding new leaf nodes according to the distance principle. Of course, it is also necessary to check whether the parent node of each leaf node also exceeds the maximum number of leaf nodes in turn, for example, the maximum CF number corresponding to the branch node or the maximum CF number corresponding to the root node, and the splitting of the branch node and the root node should be split into two new nodes when exceeding the maximum CF number, and the splitting mode of the branch node and the root node is the same as that of the leaf node.
According to a specific embodiment of the present application, after the step of constructing the cluster feature tree, the method further includes at least one of the following steps:
Deleting abnormal CF nodes with sample points less than a point threshold value, and updating all CF nodes from the root node to the abnormal CF nodes;
at least two second class anomaly CF nodes having a combined node distance less than a distance threshold.
In specific implementation, the created cluster feature tree can be used for removing or merging abnormal points. For example, removing CF nodes with fewer sample points, or merging CF nodes with very close hyperspheres, all require updating all CF nodes on their upward path.
According to the clustering method provided by the embodiment of the application, a clustering feature tree is firstly constructed to determine the threshold parameter of the clustering feature tree, target leaf nodes closest to a new sample point are searched downwards from the root node in the clustering feature tree, gaussian probability that the new sample point belongs to each CF node in the target leaf nodes is calculated, target CF nodes corresponding to the maximum Gaussian probability are added to the new sample points, and finally the new sample points are added to the target CF nodes. Therefore, CF nodes with the nearest new sample points can be searched through calculating probability, concave or convex distribution is more easily adapted to, and the method is not limited to hypersphere, the optimized clustering model can effectively classify object sets with complex space distribution, the problem of clustering adaptation caused by space attributes of service data is avoided, and the memory saving advantage of the original BIRCH model can be reserved, so that a matching model with lower deviation and more economy is obtained.
The problem of influence of irregular service data distribution on a model in the model calculation process is effectively solved by introducing the probability measure idea, the influence of single measure caused by the distance measure on the model by the original clustering model is overcome by a combined model of the distance measure and the probability measure, and the effective expansion of the service use model range is realized.
Example 2
Referring to fig. 3, a block diagram of a clustering device according to an embodiment of the present application is provided. As shown in fig. 3, the clustering device 300 mainly includes:
the construction module 301 is configured to construct a cluster feature tree, and determine a threshold parameter of the cluster feature tree, where the cluster feature tree includes a root node, a branch node, and a leaf node, each root node, branch node, and leaf node stores at least one CF node, and the threshold parameter includes a first threshold corresponding to a maximum sample radius of the CF node in the leaf node, and a second threshold corresponding to a maximum number of storable CF nodes in the leaf node;
a searching module 302, configured to search downward from the root node in the cluster feature tree for a target leaf node closest to a new sample point;
a calculating module 303, configured to calculate gaussian probabilities that the new sample point belongs to each CF node in the target leaf node;
And the adding module 304 is configured to determine a target CF node corresponding to a maximum value in all gaussian probabilities, and add the new sample point to the target CF node.
According to a specific embodiment of the present application, the adding module 304 is configured to:
determining a target CF node corresponding to the maximum value in all Gaussian probabilities, and calculating the corresponding hypersphere radius of the new sample point after the new sample point is added into the target CF node;
if the hypersphere radius is smaller than or equal to the third threshold value, adding the new sample point into the target CF node, and updating all CF nodes from the root node to the target CF node;
if the hypersphere radius is larger than the third threshold value, a new CF node is created in the target leaf node, the new sample point is added to the new CF node, and all CF nodes between the root node and the target CF node are updated.
According to a specific embodiment of the present application, the adding module 304 is configured to:
if the hypersphere radius of the target CF node is larger than the third threshold value, creating a new CF node in the target leaf node and adding the new sample point into the new CF node;
Judging whether the number of all CF nodes in the target leaf node is smaller than or equal to the second threshold value;
if the number of all CF nodes in the target leaf node is less than or equal to the second threshold value, adding the new CF node into the target leaf node;
if the number of all CF nodes in the target leaf node is larger than the second threshold value, splitting the target leaf node into two new leaf nodes, and adding all CF nodes in the target leaf node into corresponding new leaf nodes according to the node distance.
According to one embodiment of the present application, the calculating module 303 is configured to:
initializing each CF node in the target leaf node into a corresponding cluster, and determining a Gaussian distribution function of each cluster;
calculating a weight parameter when the probability of the Gaussian distribution function is maximized according to the information of each data point in the CF node;
and calculating the Gaussian probability of the new sample point belonging to each CF node according to the Gaussian distribution function and the weight parameter corresponding to each CF node.
According to one embodiment of the present application, the calculating module 303 is specifically configured to:
initializing the Gaussian distribution function of each cluster as:
Figure BDA0004162025300000171
wherein,,
Figure BDA0004162025300000172
Representing Gaussian probability, alpha k (t+1) Represents variance, mu k (t+1) Represent the mean value, sigma k (t+1) Representing a gaussian distribution function.
According to one embodiment of the present application, the computing module is specifically configured to:
substituting the initialized Gaussian distribution function into the probability maximization according to the information of each data point to obtain the weight parameters as follows:
Figure BDA0004162025300000173
Figure BDA0004162025300000174
Figure BDA0004162025300000175
wherein alpha is k (t+1) Represents variance, mu k (t+1) Represent the mean value, sigma k (t+1) Representing a gaussian distribution function.
According to one embodiment of the present application, the building block 301 is further configured to:
deleting abnormal CF nodes with sample points less than a point threshold value, and updating all CF nodes from the root node to the abnormal CF nodes;
at least two second class anomaly CF nodes having a combined node distance less than a distance threshold.
The clustering device provided by the application, the clustering process mainly comprises: firstly, constructing a cluster feature tree to determine a threshold parameter of the cluster feature tree, searching a target leaf node closest to a new sample point downwards from the root node in the cluster feature tree, calculating Gaussian probability that the new sample point belongs to each CF node in the target leaf node, adding a target CF node corresponding to the maximum value of the Gaussian probability, and finally adding the new sample point into the target CF node. Therefore, CF nodes with the nearest new sample points can be searched through calculating probability, concave or convex distribution is more easily adapted to, and the method is not limited to hypersphere, the optimized clustering model can effectively classify object sets with complex space distribution, the problem of clustering adaptation caused by space attributes of service data is avoided, and the memory saving advantage of the original BIRCH model can be reserved, so that a matching model with lower deviation and more economy is obtained. The specific implementation process of the provided clustering device can be referred to the specific implementation process of the clustering method provided in the foregoing embodiment, and will not be described in detail herein.
Example 3
Furthermore, an embodiment of the present disclosure provides an electronic device, including a memory and a processor, where the memory stores a computer program that, when run on the processor, performs the clustering method provided in the above method embodiment 1.
Specifically, as shown in fig. 4, the electronic device 400 provided in this embodiment includes:
radio frequency unit 401, network module 402, audio output unit 403, input unit 404, sensor 405, display unit 406, user input unit 407, interface unit 408, memory 409, processor 410, and power source 411. Those skilled in the art will appreciate that the electronic device structure shown in fig. 4 is not limiting of the electronic device and that the electronic device may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. In the embodiment of the application, the electronic device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic device, a wearable device, a pedometer and the like.
It should be understood that, in the embodiment of the present application, the radio frequency unit 401 may be used to receive and send information or signals during a call, specifically, receive downlink data from a base station, and then process the downlink data with the processor 410; and, the uplink data is transmitted to the base station. Typically, the radio frequency unit 401 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 401 may also communicate with networks and other devices through a wireless communication system.
The electronic device provides wireless broadband internet access to the user through the network module 402, such as helping the user to send and receive e-mail, browse web pages, and access streaming media, etc.
The audio output unit 403 may convert audio data received by the radio frequency unit 401 or the network module 402 or stored in the memory 409 into an audio signal and output as sound. Also, the audio output unit 403 may also provide audio output (e.g., a call signal reception sound, a message reception sound, etc.) related to a specific function performed by the electronic device 400. The audio output unit 403 includes a speaker, a buzzer, a receiver, and the like.
The input unit 404 is used to receive an audio or video signal. The input unit 404 may include a graphics processor (Graphics Processing Unit, GPU) 4041 and a microphone 4042, the graphics processor 4041 processing image data of still pictures or video obtained by an image capture electronic device (e.g., a camera) in a video capture mode or an image capture mode. The processed image frames may be video played on the display unit 406. The image frames processed by the graphics processor 4041 may be stored in memory 409 (or other storage medium) or transmitted via the radio frequency unit 401 or the network module 402. The microphone 4042 may receive sound and may be capable of processing such sound into audio data. The processed audio data may be converted into a format output that can be transmitted to the mobile communication base station via the radio frequency unit 401 in the case of a telephone call mode.
The electronic device 400 also includes at least one sensor 405, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor that can adjust the brightness of the display panel 4061 according to the brightness of ambient light, and a proximity sensor that can turn off the display panel 4061 and/or the backlight when the electronic device 400 is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for recognizing the gesture of the electronic equipment (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; the sensor 405 may further include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, etc., which are not described herein.
The display unit 406 is used for video playing of information input by a user or information provided to the user. The display unit 406 may include a display panel 4061, and the display panel 4061 may be configured in the form of a liquid crystal video player (Liquid Crystal Display, LCD), an Organic Light-Emitting Diode (OLED), or the like.
The user input unit 407 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the electronic device. Specifically, the user input unit 407 includes a touch panel 4071 and other input devices 4072. The touch panel 4071, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the touch panel 4071 or thereabout by using any suitable object such as a finger, stylus, or the like). The touch panel 4071 may include two parts, a touch detection electronics and a touch controller. The touch detection electronic equipment detects the touch azimuth of a user, detects signals brought by touch operation and transmits the signals to the touch controller; the touch controller receives touch information from the touch detection electronic device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 410, and receives and executes commands sent by the processor 410. In addition, the touch panel 4071 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The user input unit 407 may include other input devices 4072 in addition to the touch panel 4071. In particular, other input devices 4072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein.
Further, the touch panel 4071 may be overlaid on the display panel 4061, and when the touch panel 4071 detects a touch operation thereon or thereabout, the touch operation is transferred to the processor 410 to determine the type of touch event, and then the processor 410 provides a corresponding visual output on the display panel 4061 according to the type of touch event. Although in fig. 4, the touch panel 4071 and the display panel 4061 are two independent components for implementing the input and output functions of the electronic device, in some embodiments, the touch panel 4071 may be integrated with the display panel 4061 to implement the input and output functions of the electronic device, which is not limited herein.
The interface unit 408 is an interface to which an external electronic device is connected with the electronic device 400. For example, the external electronic device may include a wired or wireless headset port, an external power (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting to an electronic device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and so forth. The interface unit 408 may be used to receive input (e.g., data information, power, etc.) from an external electronic device and to transmit the received input to one or more elements within the electronic device 400 or may be used to transmit data between the electronic device 400 and an external electronic device.
Memory 409 may be used to store software programs as well as various data. The memory 409 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and a storage data area; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 409 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
The processor 410 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 409 and invoking data stored in the memory 409, thereby performing overall monitoring of the electronic device. Processor 410 may include one or more processing units; preferably, the processor 410 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 410.
The electronic device 400 may also include a power supply 411 (e.g., a battery) for powering the various components, and preferably the power supply 411 may be logically connected to the processor 410 via a power management system that performs functions such as managing charging, discharging, and power consumption.
In addition, the electronic device 400 includes some functional modules, which are not shown, and are not described herein.
The electronic device provided in this embodiment may be the clustering method shown in embodiment 1, and in order to avoid repetition, details are not repeated here.
Example 4
The present application also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the clustering method provided by the above-mentioned embodiments.
In the present embodiment, the computer readable storage medium may be a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or the like.
The computer readable storage medium provided in this embodiment may be the clustering method shown in embodiment 1, and in order to avoid repetition, a description thereof will be omitted.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal comprising the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), including several instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative, not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application, which fall within the protection of the present application.

Claims (10)

1. A method of clustering, the method comprising:
Constructing a cluster feature tree and determining threshold parameters of the cluster feature tree, wherein the cluster feature tree comprises root nodes, branch nodes and leaf nodes, each root node, branch node and leaf node stores at least one CF node, and the threshold parameters comprise a first threshold corresponding to the maximum sample radius of the CF node in the leaf node and a second threshold corresponding to the maximum number of storable CF nodes in the leaf node;
searching a target leaf node closest to a new sample point downwards from the root node in the cluster feature tree;
calculating Gaussian probability that the new sample point belongs to each CF node in the target leaf node;
and determining a target CF node corresponding to the maximum value in all Gaussian probabilities, and adding the new sample point into the target CF node.
2. The method according to claim 1, wherein determining a target CF node corresponding to a maximum value of all gaussian probabilities and adding the new sample point to the target CF node comprises:
determining a target CF node corresponding to the maximum value in all Gaussian probabilities, and calculating the corresponding hypersphere radius of the new sample point after the new sample point is added into the target CF node;
if the hypersphere radius is smaller than or equal to a third threshold value, adding the new sample point into the target CF node, and updating all CF nodes from the root node to the target CF node;
If the hypersphere radius is larger than the third threshold value, a new CF node is created in the target leaf node, the new sample point is added to the new CF node, and all CF nodes between the root node and the target CF node are updated.
3. The method of claim 2, wherein if the hypersphere radius is greater than the third threshold, creating a new CF node within the target leaf node, adding the new sample point to the new CF node, comprises:
if the hypersphere radius of the target CF node is larger than the third threshold value, creating a new CF node in the target leaf node and adding the new sample point into the new CF node; the method comprises the steps of carrying out a first treatment on the surface of the
Judging whether the number of all CF nodes in the target leaf node is smaller than or equal to the second threshold value;
if the number of all CF nodes in the target leaf node is less than or equal to the second threshold value, adding the new CF node into the target leaf node;
if the number of all CF nodes in the target leaf node is larger than the second threshold value, splitting the target leaf node into two new leaf nodes, and adding all CF nodes in the target leaf node into corresponding new leaf nodes according to the node distance.
4. A method according to any one of claims 1 to 3, wherein said calculating gaussian probabilities that the new sample points belong to CF nodes within the target leaf node comprises:
initializing each CF node in the target leaf node into a corresponding cluster, and determining a Gaussian distribution function of each cluster;
calculating a weight parameter when the probability of the Gaussian distribution function is maximized according to the information of each data point in the CF node;
and calculating the Gaussian probability of the new sample point belonging to each CF node according to the Gaussian distribution function and the weight parameter corresponding to each CF node.
5. The method of claim 4, wherein said determining a gaussian distribution function for each cluster comprises:
initializing the Gaussian distribution function of each cluster as:
Figure FDA0004162025280000021
wherein,,
Figure FDA0004162025280000022
representing Gaussian probability, alpha k (t+1) Represents variance, mu k (t+1) Represent the mean value, sigma k (t+1) Representing a gaussian distribution function.
6. The method of claim 5, wherein calculating the weight parameter for maximizing the probability of the gaussian distribution function based on the information of each data point in the CF node comprises:
substituting the initialized Gaussian distribution function into the probability maximization according to the information of each data point to obtain the weight parameters as follows:
Figure FDA0004162025280000031
Figure FDA0004162025280000032
Figure FDA0004162025280000033
Wherein alpha is k (t+1) Represents variance, mu k (t+1) Represent the mean value, sigma k (t+1) Representing a gaussian distribution function.
7. The method of claim 1, wherein after the step of building a cluster feature tree, the method further comprises at least one of:
deleting abnormal CF nodes with sample points less than a point threshold value, and updating all CF nodes from the root node to the abnormal CF nodes;
at least two second class anomaly CF nodes having a combined node distance less than a distance threshold.
8. A clustering device, the device comprising:
the cluster feature tree comprises root nodes, branch nodes and leaf nodes, wherein each root node, branch node and leaf node store at least one CF node, and the threshold parameters comprise a first threshold corresponding to the maximum sample radius of the CF node in the leaf node and a second threshold corresponding to the maximum number of storable CF nodes in the leaf node;
the searching module is used for searching the target leaf node closest to the new sample point downwards from the root node in the cluster feature tree;
the calculation module is used for calculating Gaussian probability that the new sample point belongs to each CF node in the target leaf node;
And the adding module is used for determining a target CF node corresponding to the maximum value in all Gaussian probabilities and adding the new sample point into the target CF node.
9. An electronic device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, performs the clustering method of any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that it stores a computer program which, when run on a processor, performs the clustering method of any one of claims 1 to 7.
CN202310351295.6A 2023-03-29 2023-03-29 Clustering method, clustering device, electronic equipment and computer readable storage medium Pending CN116383680A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310351295.6A CN116383680A (en) 2023-03-29 2023-03-29 Clustering method, clustering device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310351295.6A CN116383680A (en) 2023-03-29 2023-03-29 Clustering method, clustering device, electronic equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN116383680A true CN116383680A (en) 2023-07-04

Family

ID=86962967

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310351295.6A Pending CN116383680A (en) 2023-03-29 2023-03-29 Clustering method, clustering device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN116383680A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117034051A (en) * 2023-07-27 2023-11-10 广东省水利水电科学研究院 Water conservancy information aggregation method, device and medium based on BIRCH algorithm
CN118068228A (en) * 2024-04-24 2024-05-24 山东泰开电力电子有限公司 High-efficiency detection method and system for short circuit of extra-high voltage reactor

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117034051A (en) * 2023-07-27 2023-11-10 广东省水利水电科学研究院 Water conservancy information aggregation method, device and medium based on BIRCH algorithm
CN117034051B (en) * 2023-07-27 2024-05-03 广东省水利水电科学研究院 Water conservancy information aggregation method, device and medium based on BIRCH algorithm
CN118068228A (en) * 2024-04-24 2024-05-24 山东泰开电力电子有限公司 High-efficiency detection method and system for short circuit of extra-high voltage reactor

Similar Documents

Publication Publication Date Title
CN110096580B (en) FAQ conversation method and device and electronic equipment
CN107426177A (en) A kind of user behavior clustering method and terminal, computer-readable recording medium
CN113190646B (en) User name sample labeling method and device, electronic equipment and storage medium
CN111159338A (en) Malicious text detection method and device, electronic equipment and storage medium
CN114444579A (en) General disturbance acquisition method and device, storage medium and computer equipment
CN111753520B (en) Risk prediction method and device, electronic equipment and storage medium
CN116383680A (en) Clustering method, clustering device, electronic equipment and computer readable storage medium
CN110780793B (en) Tree menu construction method and device, electronic equipment and storage medium
CN109858447B (en) Information processing method and terminal
CN117332844A (en) Challenge sample generation method, related device and storage medium
CN112351441B (en) Data processing method and device and electronic equipment
CN114399028B (en) Information processing method, graph convolution neural network training method and electronic equipment
CN115081643B (en) Confrontation sample generation method, related device and storage medium
CN110674294A (en) Similarity determination method and electronic equipment
CN111666421B (en) Data processing method and device and electronic equipment
CN111753047B (en) Text processing method and device
CN116361672A (en) Clustering method, clustering device, electronic equipment and computer readable storage medium
CN113707132B (en) Awakening method and electronic equipment
CN116386647B (en) Audio verification method, related device, storage medium and program product
CN113535693B (en) Data true value determination method and device for mobile platform and electronic equipment
CN116257657B (en) Data processing method, data query method, related device and storage medium
CN112150174B (en) Advertisement picture allocation method and device and electronic equipment
CN115412726B (en) Video authenticity detection method, device and storage medium
CN114743081B (en) Model training method, related device and storage medium
CN114064189A (en) Resource utilization rate improving method and system and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination