CN112287193A

CN112287193A - Data clustering method and device, computer equipment and storage medium

Info

Publication number: CN112287193A
Application number: CN202011189435.7A
Authority: CN
Inventors: 郑思晓; 罗泽坤; 王亚彪; 汪铖杰; 李季檩
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2021-01-29
Anticipated expiration: 2040-10-30
Also published as: CN112287193B

Abstract

The embodiment of the application discloses a data clustering method and device, computer equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: according to the first clustering clusters to which the data belong and the first centroids corresponding to the first clustering clusters, creating target relation data of the first centroids, and respectively determining the correlation degree between each data and the first centroids according to the data and the target relation data of the first centroids; and allocating each data to a first centroid corresponding to the maximum correlation degree, and forming a second clustering cluster by the data allocated by the same first centroid to obtain a plurality of second clustering clusters. Based on the target relation data created for each centroid, the correlation degree between the data and each centroid can be determined, and the data are clustered by taking the correlation degree as a reference standard, so that the high similarity of the data in the same cluster is ensured, the accuracy of the cluster is ensured, and the accuracy of data clustering is improved.

Description

Data clustering method and device, computer equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a data clustering method, a data clustering device, computer equipment and a storage medium.

Background

With the development of computer technology, more and more data is in the network. To facilitate management of data in a network, the data is typically clustered.

When a plurality of data are clustered, each data is allocated to a centroid with the closest distance according to the distance between each data and each centroid, and the data allocated to each centroid form a cluster.

Since the above method only performs clustering processing according to the distance between the data and the centroid, clustering accuracy is poor.

Disclosure of Invention

The embodiment of the application provides a data clustering method, a data clustering device, computer equipment and a storage medium, which can improve the accuracy of data clustering. The technical scheme is as follows:

in one aspect, a data clustering method is provided, where the method includes:

according to a first clustering cluster to which a plurality of data belong and a first centroid corresponding to each first clustering cluster, creating target relation data of the plurality of first centroids, wherein the target relation data is used for indicating a relation among any centroid, any data and a correlation degree, and the correlation degree represents the possibility that any data belong to the clustering cluster corresponding to any centroid;

respectively determining the correlation degree between each datum and the plurality of first centroids according to the plurality of data and the target relation data of the plurality of first centroids;

assigning each of the data to a first centroid corresponding to a maximum degree of correlation;

and forming a second clustering cluster by the data distributed by the same first centroid to obtain a plurality of second clustering clusters.

In another aspect, an apparatus for clustering data is provided, the apparatus comprising:

the system comprises a creating module, a calculating module and a processing module, wherein the creating module is used for creating target relation data of a plurality of first centroids according to the first clustering clusters to which a plurality of data belong and the first centroid corresponding to each first clustering cluster, the target relation data is used for indicating the relation among any centroid, any data and the correlation degree, and the correlation degree represents the possibility that any data belongs to the clustering clusters corresponding to any centroid;

the determining module is used for respectively determining the correlation degree between each datum and the plurality of first centroids according to the plurality of data and the target relation data of the plurality of first centroids;

a first assignment module, configured to assign each data to a first centroid corresponding to a maximum correlation;

and the first forming module is used for forming a second clustering cluster by the data distributed by the same first centroid to obtain a plurality of second clustering clusters.

In one possible implementation, the apparatus further includes:

and the updating module is used for updating the first centroid corresponding to the second clustering cluster according to the data in the second clustering cluster to obtain an updated second centroid.

In another possible implementation manner, the update module includes:

and the updating unit is used for determining the average value of the data in the second clustering cluster as the updated second centroid.

In another possible implementation manner, the apparatus further includes:

and the round switching module is used for responding to the fact that the distance between at least one second centroid and the corresponding first centroid is not smaller than a second reference distance, and re-clustering the plurality of data in the next round according to the second clustering clusters to which the plurality of data belong and the second centroid corresponding to each second clustering cluster.

In another possible implementation manner, the determining module includes:

a distance determination unit for determining a distance between any of the plurality of data and any of the plurality of first centroids;

and the correlation determination unit is used for determining the correlation between the data and the first centroid according to the distance corresponding to the data and the target relation data of the first centroid.

In another possible implementation manner, the apparatus further includes:

a second assigning module, configured to assign each data to a closest first centroid according to a distance between the each data and the first centroid;

and the second forming module is used for forming the data distributed by the same first centroid into a first cluster group to obtain the first cluster groups of the data.

In another aspect, a computer device is provided, which includes a processor and a memory, where at least one computer program is stored, and the at least one computer program is loaded by the processor and executed to implement the operations performed in the data clustering method according to the above aspect.

In another aspect, a computer-readable storage medium is provided, in which at least one computer program is stored, the at least one computer program being loaded and executed by a processor to implement the operations performed in the data clustering method according to the above aspect.

In yet another aspect, a computer program product or a computer program is provided, the computer program product or the computer program comprising computer program code, the computer program code being stored in a computer readable storage medium. The processor of the computer device reads the computer program code from the computer-readable storage medium, and executes the computer program code, so that the computer device implements the operations performed in the data clustering method according to the above aspect.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

the method, the device, the computer equipment and the storage medium provided by the embodiment of the application can determine the correlation degree between the data and each centroid based on the target relation data created for each centroid, and cluster a plurality of data by taking the correlation degree as a reference standard, so that the data in the cluster corresponding to any one clustered centroid has the maximum correlation degree with the centroid, namely, the high similarity of the data in the same cluster is ensured, the accuracy of the cluster is ensured, and the accuracy of data clustering is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application;

fig. 2 is a flowchart of a data clustering method provided in an embodiment of the present application;

fig. 3 is a flowchart of a data clustering method provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of distances between a centroid and data in other cluster clusters provided by an embodiment of the present application;

fig. 5 is a schematic structural diagram of a data clustering device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a data clustering device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present application more clear, the embodiments of the present application will be further described in detail with reference to the accompanying drawings.

The terms "first," "second," and the like as used herein may be used herein to describe various concepts that are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. For example, a first centroid may be referred to as a second centroid, and similarly, a second centroid may be referred to as a first centroid without departing from the scope of the present application.

As used herein, the terms "at least one," "a plurality," "each," and "any," at least one of which includes one, two, or more than two, and a plurality of which includes two or more than two, each of which refers to each of the corresponding plurality, and any of which refers to any of the plurality. For example, the plurality of centroids includes 3 centroids, and each centroid refers to each centroid in the 3 centroids, and any one of the 3 centroids refers to any one of the 3 centroids, which may be the first centroid, the second centroid, or the third centroid.

The data clustering method provided by the embodiment of the application can be used in computer equipment. Optionally, the computer device is a terminal or a server. Optionally, the server is an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like. Optionally, the terminal is a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart television, a smart watch, and the like, but is not limited thereto.

Fig. 1 is a schematic structural diagram of an implementation environment provided in an embodiment of the present application, and as shown in fig. 1, the system includes a terminal 101 and a server 102, where the terminal 101 and the server 102 are directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

The terminal 101 is used for acquiring data and uploading the data to the server 102 through a communication connection with the server 102. The server 102 is configured to perform clustering processing on the data to obtain a cluster group to which a plurality of data belong.

Alternatively, a plurality of terminals 101 are all connected with a server 102 directly or indirectly through wired or wireless communication, each terminal 101 is installed with a target application, and the server 102 provides service for the target application. Each terminal 101 uploads data to the server 102 through the target application, and the server 102 is configured to perform clustering processing on the data uploaded by the plurality of terminals 101 and store the plurality of data according to the obtained cluster.

Fig. 2 is a flowchart of a data clustering method provided in an embodiment of the present application, which is applied to a computer device, and as shown in fig. 2, the method includes:

201. the computer device creates target relation data of a plurality of first centroids according to the first cluster to which the plurality of data belong and the first centroid corresponding to each first cluster.

In the embodiment of the application, the data to be clustered is multiple, the first clustering clusters to which the multiple data belong are also multiple, and the data in different first clustering clusters are different. The first centroid is for representing a center of the corresponding first cluster. The target relation data is used for indicating the relation among any centroid, any data and the correlation degree, and the correlation degree represents the possibility that any data belongs to the cluster corresponding to any centroid.

By creating target relationship data for each first centroid, the relevance of each datum to multiple first centroids can subsequently be determined.

202. The computer device determines the correlation degree between each data and the plurality of first centroids according to the plurality of data and the target relation data of the plurality of first centroids.

The correlation between the data and the first centroid represents the possibility that the data belongs to the cluster corresponding to the first centroid, the higher the correlation is, the higher the possibility that the data belongs to the cluster corresponding to the first centroid is, and the lower the correlation is, the lower the possibility that the data belongs to the cluster corresponding to the first centroid is. By determining the relevance of each datum to the first plurality of centroids, the plurality of data can be subsequently re-clustered according to the determined relevance.

203. The computer device assigns each data to a first centroid corresponding to a maximum degree of correlation.

After determining a plurality of correlation degrees corresponding to each data, determining the maximum correlation degree corresponding to each data, and allocating each data to the first centroid corresponding to the maximum correlation degree to ensure the accuracy of clustering.

204. And the computer equipment forms a second clustering cluster by the data distributed by the same first centroid to obtain a plurality of second clustering clusters.

And according to a plurality of correlation degrees corresponding to each data, after each data in the plurality of data is allocated to a first centroid, the data allocated to the same first centroid form a second clustering cluster, and each first centroid corresponds to a second clustering cluster, so that a plurality of second clustering clusters are obtained, and the data in different second clustering clusters are different.

And re-clustering the plurality of data through the correlation degree between each data and the plurality of first centroids, so that the correlation degree between the data in the second clustering cluster and the corresponding first centroids is the maximum after re-clustering, and the accuracy of the re-clustered second clustering cluster is ensured.

The method provided by the embodiment of the application can determine the correlation degree between the data and each centroid based on the target relation data created for each centroid, and clusters the data by taking the correlation degree as a reference standard, so that the data in the cluster corresponding to any one clustered centroid has the maximum correlation degree with the centroid, that is, the data in the same cluster has high similarity, the cluster accuracy is ensured, and the data clustering accuracy is improved.

Fig. 3 is a flowchart of a data clustering method provided in an embodiment of the present application, which is applied to a computer device, and as shown in fig. 3, the method includes:

301. the computer device assigns each data to the closest first centroid based on the distance between each data and each first centroid.

In the embodiment of the application, when the plurality of data are clustered, the plurality of data are divided into a plurality of first clustering clusters according to the distance between the data and each first centroid, and then the plurality of data can be re-clustered to obtain a plurality of second clustering clusters.

The plurality of data are data to be clustered, and optionally, the data are image features, user features, pixel features in an image, or the like. For example, when clustering is performed on a plurality of images, the plurality of data are image features of the plurality of images; when clustering is carried out on a plurality of videos, the plurality of data are video characteristics of the plurality of videos; when clustering a plurality of user representations, the plurality of data is a plurality of user characteristics. Optionally, the data is represented in the form of a vector. For example, if the data is image features, each image feature is represented in the form of a feature vector, the plurality of image features are mapped into a multidimensional feature space to obtain feature vectors of the plurality of images, and then the plurality of feature vectors are clustered in the multidimensional feature space.

The first centroid is used to represent the center of the cluster of clusters. Alternatively, the number of the first centroids is arbitrarily set, for example, the number of the first centroids is 10 or 7, and the like. In addition, in the embodiment of the application, data are clustered in multiple rounds, and when the current round is the first round of the multiple rounds, multiple data are randomly selected from the multiple data to serve as a first centroid; and when the current round is not the first round in the current round, taking the updated centroid of the previous round as the first centroid of the current round.

The distance between the data and the first centroid represents the likelihood that the data belongs to the cluster to which the first centroid corresponds. The greater the distance, the greater the likelihood that the data belongs to the cluster corresponding to the first centroid, and the smaller the distance, the less likely the data belongs to the cluster corresponding to the first centroid. Optionally, the data is represented in the form of a vector, and the first centroid is also represented in the form of a vector, and the distance between the data and the first centroid is determined according to the data vector and the first centroid vector. Optionally, a difference vector between the data vector and the first centroid vector is determined, a modulus of the difference vector being determined as a distance between the data and the first centroid.

Optionally, the data is represented in the form of coordinates, and the first centroid is also represented in the form of coordinates, and the distance between the data and the first centroid is determined according to the coordinates of the data and the coordinates of the first centroid.

In the embodiment of the application, there are a plurality of data to be clustered, and there are a plurality of first centroids, so as to ensure accuracy of data allocation, for any data, it is necessary to determine a distance between the data and each first centroid, and determine a minimum distance from a plurality of distances corresponding to the data, so as to allocate the data to the first centroid closest to the data, so that each data can be equally allocated to the first centroid closest to the data, and a cluster is formed according to the data allocated to each first centroid. By allocating each data to the first centroid closest to the first centroid, the similarity between each first centroid and the data allocated to the first centroid is high, and the data allocated to the same centroid is more likely to belong to the same category, that is, the data in the same first clustering cluster is more likely to belong to the same category, so that the accuracy of data clustering is ensured.

In one possible implementation, there is a cluster label for each first centroid, and the cluster labels for different first centroids are different, this step 301 includes: the computer device assigns a cluster label corresponding to the closest first centroid to each data according to the distance between each data and the plurality of first centroids.

When a cluster label is allocated to any data, the smallest distance in a plurality of distances corresponding to the data is determined according to the distances between the data and a plurality of first centroids, and the cluster label corresponding to the first centroid corresponding to the smallest distance is allocated to the data.

Each data is assigned a cluster label, so that the data with the same cluster label form a cluster group through the assigned cluster label.

302. And the computer equipment forms the data distributed by the same first centroid into a first cluster group to obtain the first cluster group to which the plurality of data belong.

After each data in the plurality of data is allocated to a first centroid, the data allocated to the same first centroid forms a first cluster, and each first centroid corresponds to one first cluster, so that a plurality of first clusters are obtained, and the data in different first clusters are completely different.

In one possible implementation, each data has a cluster label, then this step 302 includes: and forming a first cluster group by the data belonging to the same cluster label to obtain the first cluster group to which a plurality of data belong.

303. The computer device creates initial relationship data of a third centroid, wherein the third centroid is any one of the plurality of first centroids, and the initial relationship data comprises parameters with undetermined values.

In the embodiment of the application, in order to ensure the accuracy of data clustering, the target relation data is created for each first centroid to determine the correlation degree of each data and a plurality of first centroids, so that a plurality of data can be subsequently re-clustered according to the correlation degree.

The initial relationship data is used for indicating the relationship among any centroid, any data and the correlation degree, and the correlation degree represents the possibility that any data belongs to the cluster corresponding to any centroid. Because the initial relationship comprises a parameter with an undetermined value, the value of the parameter needs to be determined subsequently, and thus the target relationship data corresponding to the third centroid can be obtained.

By creating initial relationship data for the third centroid, target relationship data for the third centroid can subsequently be determined. In the embodiment of the present application, only the third centroid is taken as an example, and the process of determining the target relationship data of the third centroid is described, but the process of determining the target relationship data for the other first centroids is similar to the process of determining the target relationship data for the third centroid.

In one possible implementation, the initial relationship data includes a plurality of parameters whose values are not determined.

Optionally, the initial relationship data satisfies the following relationship:

wherein, P (x, theta)_j) Representing data x and third centroid θ_jCorrelation between, | | x- θ_j||₂For representing data x and third centroid theta_jThe distance between them; mu is a position parameter, sigma is a scale parameter, xi is a shape parameter, and the position parameter mu, the scale parameter sigma and the shape parameter xi are parameters with undetermined values. When xi is more than or equal to 0, x is more than or equal to mu, when xi is less than 0,

304. the computer device selects a plurality of reference data from the first cluster clusters corresponding to the other first centroids.

And the reference data is data in the first clustering clusters corresponding to other first centroids. And selecting reference data from the first clustering clusters corresponding to other first centroids to enable the parameter values in the initial relation data of the third centroid to be determined through the reference data subsequently.

In one possible implementation, this step 304 includes: and determining the distance between each datum and the third centroid in the first clustering clusters corresponding to other first centroids, and selecting reference data with reference number.

And in the first clustering clusters corresponding to the other first centroids, the distance between the reference data and the third centroid is smaller than the distances between the other data and the third centroid.

Wherein the reference number is any number, for example, the reference data is 10 or 20. Optionally, a reference proportion of the selected reference data is determined, and a product of a total number of data in the first cluster corresponding to the other first centroids and the reference proportion is determined as the reference number. For example, if the reference proportion is 20%, and the total number of data in the first cluster corresponding to the other first centroids is 1000, the reference number is 200.

305. And the computer equipment performs fitting processing on the initial relation data according to the plurality of reference data to determine the value of the parameter.

And fitting the initial relation data through a plurality of reference data to determine the value of the parameter in the initial relation data, and ensuring that the cluster corresponding to the third centroid can be distinguished from the cluster corresponding to other first centroids to ensure the accuracy of the coverage data in the determined cluster.

In one possible implementation, this step 305 includes: determining the maximum distance in the distances between the plurality of reference data and the third centroid as a first reference distance, determining a distance difference between the distance corresponding to each reference data and the first reference distance, fitting the initial relationship data according to the distance difference corresponding to each reference data, and determining the value of the parameter.

Optionally, when determining the parameter values in the initial relationship data, the parameter values are obtained by fitting on a Maximum Likelihood Estimation (MLE) and a distance difference corresponding to each reference data.

Optionally, a distance difference between the distance corresponding to each reference datum and the first reference distance satisfies the following relationship:

Δ_i＝-d_ij-u_j，-d_ij＞u_j

wherein, Delta_iRepresenting the distance difference between the distance corresponding to the ith reference data and the first reference distance; u. of_jA negative value for representing the first reference distance; d_ijRepresenting the ith reference data and the third centroid theta_jThe distance between them.

In a possible implementation manner, logarithmic transformation processing is performed on the initial relationship data to obtain fitting relationship data, wherein the fitting relationship data comprises parameters with undetermined values; and fitting the fitting relation data according to the distance difference corresponding to each reference data to determine the value of the parameter.

In one possible implementation, the fitting relationship data satisfies the following relationship:

wherein, mu is a position parameter, sigma is a scale parameter, xi is a shape parameter, and the position parameter mu, the scale parameter sigma and the shape parameter xi are parameters with undetermined values; n represents the total number of the reference data, and n is a positive integer not less than 1; i represents the serial number of the reference data, and the value range of i is a positive integer which is greater than or equal to 1 and less than or equal to n; delta_iRepresenting the distance difference corresponding to the ith reference data; Γ (σ, μ, ξ) is used to represent the penalty values for fitting the relationship data.

Optionally, according to the fitting relationship data, determining gradient relationship data of a position parameter μ, gradient relationship data of a scale parameter σ, and gradient relationship data of a shape parameter ξ, respectively; acquiring initial values of a position parameter mu, a scale parameter sigma and a shape parameter xi; determining a loss value of fitting relation data according to initial values of the position parameter mu, the scale parameter sigma and the shape parameter xi, a distance difference value corresponding to each reference data and the fitting relation data; updating values of the position parameter mu, the scale parameter sigma and the shape parameter xi according to initial values of the position parameter mu, the scale parameter sigma and the shape parameter xi, a distance difference value corresponding to each reference data, gradient relation data of the position parameter mu, gradient relation data of the scale parameter sigma and gradient relation data of the shape parameter xi; determining a loss value of the fitting relation data according to the updated values of the position parameter mu, the scale parameter sigma and the shape parameter xi, the distance difference value corresponding to each reference data and the fitting relation data; and repeatedly determining the loss value of the fitting relation data according to the steps, responding to the convergence of the loss value of the fitting relation data, and determining the values of the current position parameter mu, the scale parameter sigma and the shape parameter xi as the final values of the position parameter mu, the scale parameter sigma and the shape parameter xi.

Optionally, the position parameter μ is derived from the fitting relation data to obtain gradient relation data of the position parameter μ; the fitting relation data is subjected to derivation on the scale parameter sigma to obtain gradient relation data of the scale parameter sigma; and the fitting relation data is subjected to derivation on the shape parameter xi to obtain gradient relation data of the shape parameter xi.

Optionally, the process of updating the value of the location parameter μ includes: determining the gradient value of the position parameter mu according to the initial values of the scale parameter sigma and the shape parameter xi and the distance difference value corresponding to each reference data; and determining the product of the learning rate and the gradient value of the position parameter mu, and determining the difference between the initial value of the position parameter mu and the product as the value of the updated position parameter mu. The learning rate is an arbitrary value, for example, 0.1 or 0.2.

Optionally, the process of updating the value of the scale parameter σ includes: determining the gradient value of the scale parameter sigma according to the initial values of the position parameter mu and the shape parameter xi and the distance difference value corresponding to each reference data; and determining the product of the learning rate and the gradient value of the scale parameter sigma, and determining the difference between the initial value of the scale parameter sigma and the product as the value of the updated scale parameter sigma.

Optionally, the process of updating the value of the shape parameter ξ includes: determining a gradient value of the shape parameter xi according to the initial value of the position parameter mu and the scale parameter sigma and the distance difference value corresponding to each reference data; and determining the product of the learning rate and the gradient value of the shape parameter xi, and determining the difference value between the initial value of the shape parameter xi and the product as the value of the updated shape parameter xi.

Optionally, according to the fitting relationship data, determining gradient relationship data of a position parameter μ, gradient relationship data of a scale parameter σ, and gradient relationship data of a shape parameter ξ, respectively; acquiring initial values of a position parameter mu, a scale parameter sigma and a shape parameter xi; determining a loss value of fitting relation data according to initial values of the position parameter mu, the scale parameter sigma and the shape parameter xi, a distance difference value corresponding to each reference data and the fitting relation data; updating values of the position parameter mu, the scale parameter sigma and the shape parameter xi according to initial values of the position parameter mu, the scale parameter sigma and the shape parameter xi, a distance difference value corresponding to each reference data, gradient relation data of the position parameter mu, gradient relation data of the scale parameter sigma and gradient relation data of the shape parameter xi; determining a loss value of the fitting relation data according to the updated values of the position parameter mu, the scale parameter sigma and the shape parameter xi, the distance difference value corresponding to each reference data and the fitting relation data; and carrying out iteration processing according to the steps, determining a loss value of the fitting relation data, and determining the values of the current position parameter mu, the scale parameter sigma and the shape parameter xi as final values of the position parameter mu, the scale parameter sigma and the shape parameter xi in response to the iteration times reaching the reference times.

306. And the computer equipment determines the relation data obtained after the value of the determined parameter as the target relation data of the third centroid.

And after the value of the parameter is determined, substituting the value of the parameter into the initial relation data to obtain target relation data of a third centroid. Subsequently, through the determined target relationship data, a correlation between each data and the third centroid can be determined.

Through the step 303-.

It should be noted that, in the embodiment of the present application, the target relationship data of the third centroid is determined by using the created initial relationship data, but in another embodiment, step 303 and step 306 do not need to be executed, and other manners can be adopted to create the target relationship data of the plurality of first centroids according to the first cluster to which the plurality of data belong and the first centroid corresponding to each first cluster.

307. The computer device determines the correlation degree between each data and the plurality of first centroids according to the plurality of data and the target relation data of the plurality of first centroids.

After the target relation data of each first centroid is determined, for each data, according to the target relation data of the plurality of first centroids, the correlation degree between the data and the plurality of first centroids can be determined, so that the correlation degree between each data and the plurality of first centroids is obtained, that is, the correlation degrees corresponding to each data are obtained, and the data are re-clustered according to the correlation degrees corresponding to each data in the following process.

In one possible implementation, this step 307 includes: determining the distance between any data in the plurality of data and any first centroid, and determining the correlation degree between the data and the first centroid according to the distance corresponding to the data and the target relation data of the first centroid. After the distance between any data and any first centroid is determined, the distance is substituted into the target relation data of the first centroid, so that the correlation degree between the data and the first centroid is obtained, and then a plurality of correlation degrees corresponding to the data can be determined for the data and the first centroids. Thus, for a plurality of data and a plurality of first centroids, a plurality of degrees of correlation can be determined for each data, and each degree of correlation corresponds to one first centroid, representing the likelihood that the data belongs to a cluster of first centroids.

308. The computer device assigns each data to a first centroid corresponding to a maximum degree of correlation.

After determining a plurality of correlation degrees corresponding to each data, determining the maximum correlation degree corresponding to each data, and allocating each data to the first centroid corresponding to the maximum correlation degree to ensure the allocation accuracy.

In one possible implementation, there is a cluster label for each first centroid, and the cluster labels for different first centroids are different, this step 308 includes: and according to the plurality of correlation degrees corresponding to each datum, distributing the clustering label corresponding to the first centroid corresponding to the maximum correlation degree for each datum.

Optionally, a cluster label is assigned to each data according to a plurality of relevancy degrees corresponding to each data, and the following relationship is satisfied:

λ_i＝arg max_{j∈(1，2，…，k)}P_ij

wherein λ is_iRepresenting a clustering label of ith data in the plurality of data, j representing a serial number of the first centroid, and k representing the total number of the plurality of first centroids; p_ijRepresenting the degree of correlation between the ith data and the jth first centroid.

309. And the computer equipment forms a second clustering cluster by the data distributed by the same first centroid to obtain a plurality of second clustering clusters.

And according to a plurality of correlation degrees corresponding to each data, after each data is distributed to a first centroid, the data distributed by the same first centroid forms a second clustering cluster, each first centroid corresponds to a second clustering cluster, and therefore a plurality of second clustering clusters are obtained, and the data in different first clustering clusters are different.

In one possible implementation, each data has a cluster label, then this step 309 includes: and forming a first cluster group by the data belonging to the same cluster label to obtain the first cluster group to which a plurality of data belong.

In one possible implementation, the second clustering clusters obtained by re-clustering the data satisfy the following relationship:

wherein n represents the total number of the plurality of second clustering clusters, and n is a positive integer not less than 1; i. j is respectively used for representing the serial numbers of the plurality of second clustering clusters, and i and j are positive integers which are larger than 0 and smaller than n; c₁Representing a 1 st second cluster; c₂Represents the 2 nd second cluster; c_nRepresenting an nth second cluster;

i ≠ j is used to denote that among the plurality of second cluster clusters, the ith second cluster C_iWith jth second cluster C_jThe middle intersection is empty, i.e. the ith second cluster C_iWith jth second cluster C_jDoes not include the same data.

310. And the computer equipment updates the first centroid corresponding to the second clustering cluster according to the data in the second clustering cluster to obtain an updated second centroid.

Because the data in each second cluster may be different from the data in the corresponding first cluster, the centroid of each second cluster may be changed, and therefore, after a plurality of second clusters are determined, the first centroid corresponding to the second cluster is updated through the second clusters, so as to obtain the second centroid corresponding to each second cluster, so as to ensure the accuracy of the second centroid corresponding to each cluster.

In one possible implementation, this step 310 includes: and determining the average value of the data in the second clustering cluster as the updated second centroid.

Optionally, each datum is represented in the form of a vector, and the second centroid is also represented in the form of a vector, and then for any second cluster, the sum of the vectors of the data in the second cluster is determined as the second centroid corresponding to the second cluster.

Optionally, the second centroids corresponding to the plurality of second cluster clusters satisfy the following relationship:

wherein n represents the total number of the plurality of second clustering clusters, and n is a positive integer not less than 1; j is respectively used for representing the serial numbers of the plurality of second clustering clusters, and j is a positive integer which is greater than 0 and less than n; phi denotes a plurality of second centroid sets; theta₁Representing a 1 st second centroid of the plurality of second centroids; theta₂Representing a 2 nd second centroid of the plurality of second centroids; theta_nRepresenting an nth second centroid of the plurality of second centroids; c_jRepresenting the jth second centroidA corresponding second cluster; x represents a second cluster C_jThe data of (1); i C_jI represents the second cluster C_jThe number of data in.

311. And the computer equipment responds to that the distance between at least one second centroid and the corresponding first centroid is not less than a second reference distance, and re-clusters the plurality of data for the next round according to the second clustering clusters to which the plurality of data belong and the second centroids corresponding to each second clustering cluster.

The step 301-. After second centroids corresponding to the second clustering clusters are determined, determining the distance between each second centroid and the corresponding first centroid, and if at least one distance in the determined distances is not smaller than a second reference distance, indicating that the currently obtained second clustering clusters are inaccurate, and performing next round of clustering to obtain a new clustering cluster.

In addition, when the next round of clustering is performed through the plurality of second centroids, one cluster label is allocated to each second centroid, and the cluster labels corresponding to different second centroids are different, so that when the plurality of data are clustered into a new cluster according to the distance between each data and each second centroid, the cluster label corresponding to the second centroid with the closest distance is allocated to each data according to the distance between each data and the plurality of second centroids, so that the data which belong to the same cluster label subsequently form a new cluster.

In one possible implementation, after step 310, the method further includes: and stopping clustering the plurality of data again for the next round according to the second clustering clusters to which the plurality of data belong and the second centroids corresponding to the second clustering clusters in response to the fact that the distance between each second centroid and the corresponding first centroid is smaller than the second reference distance.

And after second centroids corresponding to the second clustering clusters are determined, determining the distance between each second centroid and the corresponding first centroid, if each distance in the determined distances is smaller than a second reference distance, indicating that the currently obtained second clustering clusters are accurate without clustering in the next round, and taking the currently obtained second clustering clusters as final clustering clusters of the data.

In the embodiment of the application, in order to ensure the accuracy of data clustering, multiple rounds of clustering are performed on multiple data, and after the current round obtains the clustering clusters of the multiple data, the centroid of each new clustering cluster is updated, so that the updated centroid and the updated centroid in the previous round are compared in the following process, and whether the current obtained clustering cluster is accurate or not is determined. And when the currently obtained cluster is determined to be inaccurate, repeatedly executing the step 301 and the step 309 according to the updated centroid and the plurality of data of the cluster until the accurate cluster is obtained.

In one possible implementation, after the step 309, the method further includes: and in response to the number of the iteration rounds reaching the reference number, stopping updating the first centroid corresponding to the second cluster according to the data in the second cluster.

Wherein the reference number is any number, such as 20 or 15. After the data are divided into a plurality of new clustering clusters, the number of the iteration rounds which are clustered reaches the reference number, the current obtained clustering cluster meets the requirement, and the next round of clustering is not required to be executed, so that the corresponding centroid is not updated through the newly obtained clustering cluster, and the next round of clustering is not required to be executed according to the updated centroid.

In addition, in order to verify the accuracy of the second clustering cluster after clustering the plurality of data, it is determined that the smaller the sum variance J of the plurality of second clustering clusters, the higher the similarity of the data in each of the plurality of second clustering clusters, that is, the higher the accuracy of the second clustering cluster.

Wherein the sum variance of the plurality of second cluster clusters satisfies the following relationship:

wherein k represents a total number of the plurality of second cluster clusters; c_jRepresenting a jth second cluster of the plurality of second clusters; x represents the jth second cluster C_iThe data of (1); i x-theta_j||₂For representing data x with a first centroid theta_jThe distance between them.

And when the value of the parameter in the initial relation data of any centroid is determined, fitting processing is carried out on the initial relation data through reference data in clustering clusters corresponding to other centroids, so that the target relation data obtained after the value of the parameter is determined can be distinguished from the data of the clustering clusters corresponding to other centroids, the accuracy of the target relation data is ensured, and the accuracy of the obtained clustering clusters is improved.

And the initial relation data is fitted through the reference data which is selected from the cluster clusters corresponding to other centroids and is close to the current centroid, so that the data which are closest to the target relation data and belong to the cluster clusters corresponding to other centroids can be distinguished from the obtained target relation data, the accuracy of the target relation data is improved, and the accuracy of data clustering is improved.

And, through carrying out clustering of a plurality of rounds to a plurality of data, in order to improve the accuracy of the clustering cluster got. In the iterative process of multiple rounds, iterative clustering is continuously carried out only under certain conditions, so that an accurate clustering cluster can be obtained in time, excessive iterative rounds are prevented from being repeatedly executed, and clustering efficiency is improved.

In addition, the method provided by the embodiment of the application can be applied to an online clustering scene. When the data in the online data set is clustered online, the method comprises the following steps:

1. the online data set comprises a large amount of data, b data are randomly selected from the data set to serve as data to be clustered, the number of centroids corresponding to the data to be clustered is determined to be k, and the iteration round of clustering is determined to be t.

2. And randomly selecting k centroids from the b data, and allocating each data to the closest centroid according to the distance between each data and the multiple centroids to obtain k first clustering clusters.

3. Re-clustering the b data to obtain k second cluster clusters based on the

step

303 and 309.

4. And adding 1 to the updating times of the centroids corresponding to each second clustering cluster to obtain the new updating times corresponding to each centroid.

5. And determining the learning rate corresponding to each centroid according to the new updating times corresponding to each centroid, updating to obtain a new centroid according to the learning rate, the current centroid and the second clustering cluster corresponding to the current centroid, and completing an iteration round.

6. And finishing clustering the b data in response to finishing the t iteration rounds to obtain final k clustering clusters.

The method provided by the embodiment of the application can be applied to an image segmentation scene. In an image segmentation scene, pixel characteristics of pixel points in an image are equivalent to data in the embodiment, and a feature cluster to which a plurality of pixel characteristics belong is obtained by clustering the pixel characteristics of a plurality of pixel points in the image, so that the image is segmented according to the obtained feature cluster. The flow of image segmentation comprises the following steps:

1. and acquiring pixel characteristics of a plurality of pixel points in the image to be segmented.

2. And creating target relation data of a plurality of feature centers according to the first feature clusters to which the pixel features belong and the feature center corresponding to each first feature cluster.

The first feature cluster is equivalent to the first cluster in the above embodiment, and the feature center is equivalent to the first centroid in the above embodiment.

3. And respectively determining the correlation degree between each pixel feature and the plurality of feature centers according to the target relation data of the plurality of pixel features and the plurality of feature centers.

4. And each pixel feature is allocated to a feature center corresponding to the maximum correlation degree, and the pixel features allocated to the same feature center form a second feature cluster to obtain a plurality of second feature clusters.

The second feature cluster is equivalent to the second cluster in the above embodiment.

5. And according to the second feature clusters to which the pixel features belong, determining a second feature cluster corresponding to the pixel point belonging to the background area in the pixel points and a second feature cluster corresponding to the pixel point belonging to the foreground area, so that the background area and the foreground area in the image are distinguished according to the second feature clusters, and the foreground area in the image is extracted or the background area in the image is extracted.

The method provided by the embodiment of the application can be applied to a user portrait clustering scene. In the user portrait clustering scene, the user characteristics of the users are equivalent to the data in the above embodiment, and the user characteristics of a plurality of users are clustered to cluster similar user characteristics into one characteristic cluster, so that user recommendation can be performed according to the obtained characteristic cluster. The process of clustering user figures includes the following steps:

1. user characteristics of a plurality of users are obtained.

2. And creating target relation data of a plurality of feature centers according to the first feature clusters to which the plurality of user features belong and the feature center corresponding to each first feature cluster.

3. And respectively determining the correlation degree between each user characteristic and the plurality of characteristic centers according to the target relation data of the plurality of user characteristics and the plurality of characteristic centers.

4. And allocating each user characteristic to a characteristic center corresponding to the maximum correlation degree, and forming a second characteristic cluster by the user characteristics allocated to the same characteristic center to obtain a plurality of second characteristic clusters.

5. And determining a second feature cluster corresponding to each user according to the second feature clusters to which the plurality of user features belong, and recommending any user to other users in the corresponding second feature clusters.

In addition, the method provided by the embodiment of the application can be used in other scenes.

For example, in an image clustering scenario:

when clustering is performed on a plurality of images, image features of the plurality of images are obtained, the data clustering method provided by the embodiment of the application is adopted to cluster the plurality of image features into a plurality of cluster clusters, so that cluster clusters of the plurality of images are obtained, and then the images in each cluster are respectively managed according to the cluster clusters of the plurality of images, for example, the plurality of images are divided into a character image type, a landscape image type, an animation image type and the like.

As another example, in a video clustering scenario:

the method comprises the steps of extracting a plurality of video frames from a plurality of video data respectively, obtaining video frame characteristics of each video frame, carrying out fusion processing on the video frame characteristics of the plurality of video frames corresponding to each video data to obtain video characteristics of the plurality of video data, then carrying out cluster processing on the plurality of video characteristics by adopting the data clustering method provided by the embodiment of the application to obtain a cluster of the plurality of video data, and then managing the plurality of video data according to the cluster to which the plurality of video data belong.

The above embodiments relate to initial relationship data and fitting relationship data, and on the basis of the above embodiments, the following embodiments will describe the creation process of the above two types of relationship data in detail:

1. initial relationship data for a third centroid is created.

Based on Pickands-Balkema-de Haan (based on the second theorem of extreme value theory), giving independent and identically distributed random variable sequence χ ═ x₁，x₂，…，x_n}, the cumulative distribution function is F (-). Determining a threshold value u for a random variable sequence, wherein the distribution condition of the excess amount χ -u of each random variable relative to the threshold value u in the random variable sequence meets the following condition distribution relation data of the excess amount:

F_u(x)＝Pr(χ-u≤x|χ＞u)

wherein, P_u(x) Data representing cumulative distribution relations for the threshold u and the random variable x; x is a random variable.

When the threshold value is large enough, the excess amount condition distribution relation data P_u(x) Approximated as a generalized Pareto distribution, i.e. F_u(x)→G(x；σ，μ，ξ)，u→∞。

Therefore, with the above-mentioned excess condition distribution relation data, a GPD (Generalized Pareto Distributions) function g (x) is defined as:

wherein μ is a position parameter related to the position of the GPD function g (x) in the coordinate axis; sigma is a scale parameter and is related to the value range of the random variable x; ξ is a shape parameter that is related to the shape of the curve in which the GPD function G (x) is located; x is a random variable, which in the embodiment of the present application is the distance between each datum and the third centroid.

The inter-centroid distance distribution is defined as the minimum distance between the centroid corresponding to one cluster and the data in the other clusters, as shown in fig. 4, the minimum distance is determined from the determined distances by determining the distance between the centroid 1 and the data in the other clusters, that is, the centroid separation distance corresponding to the centroid 1. The centroid separation distance D_jThe following relationship is satisfied:

wherein D is_jRepresents the centroid separation distance; theta_jRepresenting a third centroid; c_jRepresenting a third centroid theta_jA corresponding cluster;

representing the ith data x of the plurality of data_iData in the clustering clusters corresponding to other centroids; d_ijRepresenting the ith data and the third centroid theta in the cluster corresponding to other centroids_jThe distance between them; i x-theta_j||₂For representing data x and third centroid theta_jThe distance between them; i | · | purple wind₂Representing a two-norm.

Since the Pickands-Balkema-de Haan theorem is used to fit the distribution of the maximum values of the samples, to fit the distribution of the minimum values of the samples, the distance D between the centers of mass is found to be negative_jI.e. by

D′_jCan be adapted to a GPD distribution. This initial relationship data for the third centroid can be determined from the GPD distribution, which satisfies the following relationship:

2. and creating fitting relation data according to the initial relation data of the third centroid.

According to the initial relation data of the third centroid determined above, density relation data g (x; mu, sigma, xi) of the third centroid is determined, and the density relation data g (x; mu, sigma, xi) satisfies the following relation:

wherein, mu is a position parameter, sigma is a scale parameter, xi is a shape parameter, and the position parameter mu, the scale parameter sigma and the shape parameter xi are parameters with undetermined values; z represents a free variable in the density-related data, and in the embodiment of the present application, the free variable z corresponds to a distance difference corresponding to each reference data. When xi is greater than 0, z is greater than mu, when xi is less than 0,

determining fitting relation data corresponding to the third centroid according to the density relation data, wherein the fitting relation data satisfies the following relation:

wherein, mu is a position parameter, sigma is a scale parameter, xi is a shape parameter, and the position parameter mu, the scale parameter sigma and the shape parameter xi are parameters with undetermined values; n denotes reference dataN is a positive integer not less than 1; i represents the serial number of the reference data, and the value range of i is a positive integer which is greater than or equal to 1 and less than or equal to n; delta_iRepresenting the distance difference corresponding to the ith reference data; Γ (σ, μ, ξ) is used to represent the penalty values for fitting the relationship data.

Fig. 5 is a schematic structural diagram of a data clustering device according to an embodiment of the present application, and as shown in fig. 5, the device includes:

a creating module 501, configured to create target relationship data of the multiple first centroids according to the first cluster to which the multiple data belong and the first centroid corresponding to each first cluster, where the target relationship data is used to indicate a relationship among any centroid, any data, and a degree of correlation, where the degree of correlation indicates a possibility that any data belongs to the cluster corresponding to any centroid;

a determining module 502, configured to determine, according to the multiple data and the target relationship data of the multiple first centroids, a correlation between each data and the multiple first centroids, respectively;

a first assigning module 503, configured to assign each data to a first centroid corresponding to the maximum correlation;

the first forming module 504 is configured to form a second cluster from the data allocated by the same first centroid, so as to obtain a plurality of second clusters.

In one possible implementation, as shown in fig. 6, the creating module 501 includes:

the creating unit 5101 is configured to create initial relationship data of a third centroid, where the third centroid is any one of the plurality of first centroids, and the initial relationship data includes a parameter whose value is not determined;

the selecting unit 5102 is configured to select a plurality of reference data from the first cluster clusters corresponding to the other first centroids;

the value determination unit 5103 is configured to perform fitting processing on the initial relationship data according to the plurality of reference data, and determine a value of a parameter;

the relationship data determining unit 5104 is configured to determine relationship data obtained by dereferencing the determined parameter as target relationship data of the third centroid.

In another possible implementation manner, in the first cluster group corresponding to the other first centroids, the distance between the reference data and the third centroid is smaller than the distance between the other data and the third centroid.

In another possible implementation manner, the value determining unit 5103 is configured to determine a maximum distance of the distances between the plurality of reference data and the third centroid as the first reference distance; determining a distance difference value between the distance corresponding to each reference data and the first reference distance; and fitting the initial relation data according to the distance difference corresponding to each reference data to determine the value of the parameter.

In another possible implementation manner, as shown in fig. 6, the apparatus further includes:

and an updating module 505, configured to update the first centroid corresponding to the second cluster according to the data in the second cluster, so as to obtain an updated second centroid.

In another possible implementation, as shown in fig. 6, the updating module 505 includes:

an updating unit 5501 is configured to determine the average value of the data in the second clustering cluster as the updated second centroid.

and a round switching module 506, configured to, in response to that the distance between at least one second centroid and the corresponding first centroid is not less than the second reference distance, re-cluster the plurality of data in a next round according to the second clustering cluster to which the plurality of data belong and the second centroid corresponding to each second clustering cluster.

In another possible implementation, as shown in fig. 6, the determining module 502 includes:

a distance determining unit 5201 for determining a distance between any of the plurality of data and any of the plurality of first centroids;

the correlation determining unit 5202 is configured to determine the correlation between the data and the first centroid according to the distance corresponding to the data and the target relationship data of the first centroid.

a second assigning module 507 for assigning each data to the closest first centroid according to the distance between each data and each first centroid;

a second forming module 508, configured to form a first cluster from the data allocated to the same first centroid, so as to obtain a first cluster to which multiple data belong.

It should be noted that: the data clustering device provided in the above embodiment is only illustrated by the division of the above functional modules, and in practical applications, the above functions can be distributed by different functional modules as needed, that is, the internal structure of the computer device is divided into different functional modules to complete all or part of the above described functions. In addition, the data clustering device and the data clustering method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

The embodiment of the present application further provides a computer device, where the computer device includes a processor and a memory, where the memory stores at least one computer program, and the at least one computer program is loaded and executed by the processor to implement the operations performed in the data clustering method according to the foregoing embodiment.

Optionally, the computer device is provided as a terminal. Fig. 7 shows a block diagram of a terminal 700 according to an exemplary embodiment of the present application. The terminal 700 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 700 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and so on.

The terminal 700 includes: a processor 701 and a memory 702.

The processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 701 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 701 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 701 may be integrated with a GPU (Graphics Processing Unit) which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 701 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 702 may include one or more computer-readable storage media, which may be non-transitory. Memory 702 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 702 is used to store at least one computer program for execution by the processor 701 to implement the data clustering methods provided by the method embodiments herein.

In some embodiments, the terminal 700 may further optionally include: a peripheral interface 703 and at least one peripheral. The processor 701, the memory 702, and the peripheral interface 703 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 703 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 704, a display screen 705, a camera assembly 706, an audio circuit 707, a positioning component 708, and a power source 709.

The peripheral interface 703 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 701 and the memory 702. In some embodiments, processor 701, memory 702, and peripheral interface 703 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 701, the memory 702, and the peripheral interface 703 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 704 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 704 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 704 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 704 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 704 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 704 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 705 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 705 is a touch display screen, the display screen 705 also has the ability to capture touch signals on or over the surface of the display screen 705. The touch signal may be input to the processor 701 as a control signal for processing. At this point, the display 705 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 705 may be one, disposed on a front panel of the terminal 700; in other embodiments, the display 705 can be at least two, respectively disposed on different surfaces of the terminal 700 or in a folded design; in other embodiments, the display 705 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 700. Even more, the display 705 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The Display 705 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or the like.

The camera assembly 706 is used to capture images or video. Optionally, camera assembly 706 includes a front camera and a rear camera. The front camera is arranged on the front panel of the terminal, and the rear camera is arranged on the back of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 706 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 707 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 701 for processing or inputting the electric signals to the radio frequency circuit 704 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 700. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 701 or the radio frequency circuit 704 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 707 may also include a headphone jack.

The positioning component 708 is used to locate the current geographic Location of the terminal 700 for navigation or LBS (Location Based Service). The Positioning component 708 can be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, or the galileo System in russia.

Power supply 709 is provided to supply power to various components of terminal 700. The power source 709 may be alternating current, direct current, disposable batteries, or rechargeable batteries. When the power source 709 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 700 also includes one or more sensors 710. The one or more sensors 710 include, but are not limited to: acceleration sensor 711, gyro sensor 712, pressure sensor 713, fingerprint sensor 714, optical sensor 715, and proximity sensor 716.

The acceleration sensor 711 can detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the terminal 700. For example, the acceleration sensor 711 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 701 may control the display screen 705 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 711. The acceleration sensor 711 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 712 may detect a body direction and a rotation angle of the terminal 700, and the gyro sensor 712 may cooperate with the acceleration sensor 711 to acquire a 3D motion of the terminal 700 by the user. From the data collected by the gyro sensor 712, the processor 701 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 713 may be disposed on a side frame of terminal 700 and/or underneath display 705. When the pressure sensor 713 is disposed on a side frame of the terminal 700, a user's grip signal on the terminal 700 may be detected, and the processor 701 performs right-left hand recognition or shortcut operation according to the grip signal collected by the pressure sensor 713. When the pressure sensor 713 is disposed at a lower layer of the display screen 705, the processor 701 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 705. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 714 is used for collecting a fingerprint of a user, and the processor 701 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 714, or the fingerprint sensor 714 identifies the identity of the user according to the collected fingerprint. When the user identity is identified as a trusted identity, the processor 701 authorizes the user to perform relevant sensitive operations, including unlocking a screen, viewing encrypted information, downloading software, paying, changing settings, and the like. The fingerprint sensor 714 may be disposed on the front, back, or side of the terminal 700. When a physical button or a vendor Logo is provided on the terminal 700, the fingerprint sensor 714 may be integrated with the physical button or the vendor Logo.

The optical sensor 715 is used to collect the ambient light intensity. In one embodiment, the processor 701 may control the display brightness of the display screen 705 based on the ambient light intensity collected by the optical sensor 715. Specifically, when the ambient light intensity is high, the display brightness of the display screen 705 is increased; when the ambient light intensity is low, the display brightness of the display screen 705 is adjusted down. In another embodiment, processor 701 may also dynamically adjust the shooting parameters of camera assembly 706 based on the ambient light intensity collected by optical sensor 715.

A proximity sensor 716, also referred to as a distance sensor, is disposed on a front panel of the terminal 700. The proximity sensor 716 is used to collect the distance between the user and the front surface of the terminal 700. In one embodiment, when the proximity sensor 716 detects that the distance between the user and the front surface of the terminal 700 gradually decreases, the processor 701 controls the display 705 to switch from the bright screen state to the dark screen state; when the proximity sensor 716 detects that the distance between the user and the front surface of the terminal 700 is gradually increased, the processor 701 controls the display 705 to switch from the breath-screen state to the bright-screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 7 is not intended to be limiting of terminal 700 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

Optionally, the computer device is provided as a server. Fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 800 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 801 and one or more memories 802, where the memory 802 stores at least one computer program, and the at least one computer program is loaded and executed by the processors 801 to implement the methods provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

The embodiment of the present application further provides a computer-readable storage medium, where at least one computer program is stored in the computer-readable storage medium, and the at least one computer program is loaded and executed by a processor to implement the operations performed in the data clustering method according to the foregoing embodiment.

Embodiments of the present application also provide a computer program product or a computer program comprising computer program code stored in a computer readable storage medium. The processor of the computer apparatus reads the computer program code from the computer-readable storage medium, and the processor executes the computer program code, so that the computer apparatus implements the operations performed in the data clustering method as in the above-described embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only an alternative embodiment of the present application and should not be construed as limiting the present application, and any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for clustering data, the method comprising:

2. The method of claim 1, wherein creating target relationship data for a plurality of first centroids from the first cluster to which the plurality of data belongs and the first centroid corresponding to each first cluster comprises:

creating initial relation data of a third mass center, wherein the third mass center is any one of the plurality of first mass centers, and the initial relation data comprises parameters with undetermined values;

selecting a plurality of reference data from the first clustering clusters corresponding to other first centroids;

fitting the initial relation data according to the plurality of reference data to determine the value of the parameter;

and determining the relation data obtained after the value of the parameter is determined as the target relation data of the third centroid.

3. The method of claim 2, wherein the reference data is less distant from the third centroid than the other data in the first cluster of clusters corresponding to the other first centroids.

4. The method of claim 2, wherein the fitting the initial relationship data to determine the values of the parameters according to the reference data comprises:

determining a maximum distance of distances between the plurality of reference data and the third centroid as a first reference distance;

determining a distance difference value between the distance corresponding to each reference data and the first reference distance;

and fitting the initial relation data according to the distance difference corresponding to each reference data, and determining the value of the parameter.

5. The method of claim 1, wherein after forming a second cluster from the data assigned to the same first centroid, the method further comprises:

and updating the first centroid corresponding to the second clustering cluster according to the data in the second clustering cluster to obtain an updated second centroid.

6. The method of claim 5, wherein the updating the first centroid corresponding to the second cluster according to the data in the second cluster to obtain an updated second centroid comprises:

determining an average of the data in the second clustered cluster as the updated second centroid.

7. The method of claim 5, wherein the updating the first centroid corresponding to the second cluster according to the data in the second cluster, and after obtaining the updated second centroid, the method further comprises:

and in response to the distance between at least one second centroid and the corresponding first centroid being not less than a second reference distance, re-clustering the plurality of data for the next round according to the second clustering clusters to which the plurality of data belong and the second centroids corresponding to each second clustering cluster.

8. The method of claim 1, wherein determining the correlation between each data and the first centroids respectively according to the data and the target relationship data of the first centroids comprises:

determining a distance between any of the plurality of data and any of the plurality of first centroids;

and determining the correlation degree between the data and the first centroid according to the distance corresponding to the data and the target relation data of the first centroid.

9. The method of claim 1, wherein before creating the target relationship data for the plurality of first centroids based on the first cluster to which the plurality of data belongs and the first centroid corresponding to each first cluster, the method further comprises:

assigning each data to the closest first centroid according to the distance between said each data and each first centroid;

and forming a first cluster group by the data distributed by the same first centroid to obtain the first cluster group to which the plurality of data belong.

10. An apparatus for clustering data, the apparatus comprising:

11. The apparatus of claim 10, wherein the creation module comprises:

the device comprises a creating unit, a calculating unit and a calculating unit, wherein the creating unit is used for creating initial relation data of a third mass center, the third mass center is any one of the plurality of first mass centers, and the initial relation data comprises parameters with undetermined values;

the selecting unit is used for selecting a plurality of reference data from the first clustering clusters corresponding to other first centroids;

a value determining unit, configured to perform fitting processing on the initial relationship data according to the multiple reference data, and determine a value of the parameter;

and the relation data determining unit is used for determining the relation data obtained after the values of the parameters are determined as the target relation data of the third centroid.

12. The apparatus of claim 11, wherein the reference data is less distant from the third centroid than the other data in the first cluster of clusters corresponding to the other first centroids.

13. The apparatus according to claim 11, wherein the value determining unit is configured to determine a maximum distance among distances between the plurality of reference data and the third centroid as the first reference distance; determining a distance difference value between the distance corresponding to each reference data and the first reference distance; and fitting the initial relation data according to the distance difference corresponding to each reference data, and determining the value of the parameter.

14. A computer device, characterized in that the computer device comprises a processor and a memory, in which at least one computer program is stored, which is loaded and executed by the processor to implement the operations performed in the data clustering method according to any one of claims 1 to 9.

15. A computer-readable storage medium, having at least one computer program stored therein, the at least one computer program being loaded and executed by a processor to perform the operations performed in the data clustering method according to any one of claims 1 to 9.