CN113850281A

CN113850281A - Data processing method and device based on MEANSHIFT optimization

Info

Publication number: CN113850281A
Application number: CN202110161944.7A
Authority: CN
Inventors: 吕超; 张继东; 沈志平; 吴浩宇; 吴风蛟
Original assignee: Tianyi Smart Family Technology Co Ltd
Current assignee: Tianyi Digital Life Technology Co Ltd
Priority date: 2021-02-05
Filing date: 2021-02-05
Publication date: 2021-12-28
Anticipated expiration: 2041-02-05
Also published as: WO2022166380A1; CN113850281B

Abstract

The invention provides a data processing method and device based on mean shift. The method comprises the following steps: collecting user behavior data in real time as an original sample set; initializing a cluster center according to the number of clusters and the original sample set; for each sample in the original sample set, determining whether two or more cluster centers are closest to the sample, if so, calculating a local density gradient direction of the sample by using mean shift, calculating a similarity between the local density gradient direction of the sample and a direction of the sample towards each of the two or more cluster centers, and dividing the sample into the cluster corresponding to the maximum similarity; otherwise, dividing the sample into a cluster closest to the center of the cluster; and pushing related data to each user group in real time according to the clustering result.

Description

Data processing method and device based on MEANSHIFT optimization

Technical Field

The invention relates to the field of data mining and machine learning, in particular to a data processing method and device based on MEANSHIFT optimization.

Background

With the rapid development of modern information technology, the world has spanned the internet + big data era. Big data is changing people's thinking, production and life style deeply, and big data is deeply fused with each industry, producing unprecedented social and commercial value. A plurality of data processing methods based on data mining and machine learning are generated in the big data development process, wherein the traditional K-means algorithm is used for processing N samples

The K samples are randomly selected as initial cluster centers, the original samples are divided into the clusters where the cluster centers closest to the original samples are located based on a minimum distance rule, and when the distances between the samples and the centers of one or more other clusters are close to the minimum distance, the K-means clustering effect is not ideal. How to improve the clustering effect in this scenario becomes an urgent problem to be solved.

Chinese patent application 'a K-means clustering method based on density Canopy' (CN201911127104.8) proposes a K-means clustering method based on density Canopy, and the density Canopy is taken as a preprocessing step of a K-means algorithm, so that the clustering accuracy is improved compared with that of the traditional K-means algorithm, but the method does not consider the relation between an original sample and other clusters, only local optimization is ensured, and global optimization cannot be obtained.

The Chinese patent application 'K-means clustering method based on a neural network' (CN201810570097.8) provides a K-means clustering method based on a neural network, which solves the problems that the prior K-means iteratively optimizes clustering centers and label distribution by two independent steps, so that the inference speed is slow, new data, large-scale data and online data cannot be processed, and the prior K-means is sensitive to an initial value.

Therefore, in order to make the sample division more reasonable and further improve the clustering accuracy under the condition that the sample is closest to and similar to the plurality of clusters, it is desirable to provide an improved data processing method.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

The invention provides a data processing method and device based on mean shift optimization, which consider the relationship between an original sample and other clusters, so that the edges and peripheral regions of each cluster are divided more reasonably, the cluster is compact, and the clustering precision and speed are greatly improved.

According to an aspect of the present invention, there is provided a data processing method, the method including:

collecting user behavior data in real time as an original sample set;

initializing a cluster center according to the number of clusters and the original sample set;

determining, for each sample in the original sample set, whether there are two or more cluster-like centers that are closest in distance to the sample,

if present, then

The local density gradient direction of the sample is calculated using mean shift meanshift,

calculating a similarity between a local density gradient direction of the sample and a direction of the sample toward each of the two or more cluster centers, an

Dividing the samples into class clusters corresponding to the maximum similarity;

otherwise, dividing the sample into a cluster closest to the center of the cluster; and

and pushing related data to each user group in real time according to the clustering result.

According to one embodiment of the present invention, determining whether there are two or more cluster-like centers closest in distance to the sample further comprises:

calculating Euclidean distances from the samples to the centers of K clusters to obtain a distance set aiming at the samples, wherein K is the number of the clusters;

calculating the distance c between the sample and the center of other cluster_qTo the smallest distance in said set of distances to obtain a corresponding set of distance ratios

Wherein if a set exists

Then determine presence

The cluster center is closest to the sample, where ε is a threshold set by human experience.

According to a further embodiment of the present invention, calculating the local density gradient direction of the sample using mean shift mean further comprises:

a mean-shift vector local to the sample is calculated, where the vector represents the direction of greatest increase relative to the estimated density to which the sample itself points.

According to a further embodiment of the present invention, calculating the similarity further comprises:

calculating a similarity between a local density gradient direction of the sample and a direction of the sample toward each of the two or more cluster-like centers using a cosine similarity algorithm, wherein the greater the cosine value, the higher the similarity.

According to a further embodiment of the present invention, the initializing of the cluster centers is performed by a K-means + + clustering algorithm, wherein the distance between the respective cluster centers is as large as possible.

According to another aspect of the present invention, there is provided a data processing apparatus, the apparatus comprising:

a data collection module configured to collect user behavior data in real-time as an original sample set;

an initializing cluster center module configured to initialize a cluster center according to a number of clusters and the original sample set;

a data clustering module configured to:

if present, then

Calculating a local density gradient direction of the sample using a mean shift mean, calculating a similarity between the local density gradient direction of the sample and a direction of the sample toward each of the two or more cluster centers, and

a data push module configured to push relevant data in real-time to respective user groups associated with respective class clusters based on the clustering results.

Wherein if a set exists

Then determine presence

Compared with the scheme in the prior art, the data processing method and device based on the meanshift optimization provided by the invention at least have the following advantages:

(1) by considering the relation between the original sample and other clusters, the edges and peripheral regions of each cluster are divided more reasonably, the cluster is compact, the clustering effect is improved, and the global optimum is achieved.

(2) Compared with the traditional K-means algorithm, the method can more accurately estimate the central positions of the K clusters, so that the K clusters are quickly converged, and the iteration times are reduced.

These and other features and advantages will become apparent upon reading the following detailed description and upon reference to the accompanying drawings. It is to be understood that both the foregoing general description and the following detailed description are explanatory only and are not restrictive of aspects as claimed.

Drawings

So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only some typical aspects of this invention and are therefore not to be considered limiting of its scope, for the description may admit to other equally effective aspects.

Fig. 1 shows an exemplary architecture diagram of a data processing apparatus based on meanshift optimization according to an embodiment of the present invention.

Fig. 2 shows a flowchart of a data processing method based on meanshift optimization according to an embodiment of the present invention.

Fig. 3 shows a flowchart of a meanshift-based clustering algorithm according to an embodiment of the present invention.

FIG. 4 shows an example of a central sample two-dimensional region according to one embodiment of the invention.

Detailed Description

The present invention will be described in detail below with reference to the attached drawings, and the features of the present invention will be further apparent from the following detailed description.

Fig. 1 is an exemplary architecture diagram of a data processing apparatus 100 based on meanshift optimization according to an embodiment of the present invention. As shown in fig. 1, the apparatus 100 of the present invention comprises: the system comprises a data acquisition module 101, an initialization cluster center module 102, a data clustering module 103 and a data pushing module 104.

The data collection module 101 may collect user data in real time as a raw sample set and store it in a big data platform according to data characteristics. As an example, the data collection module 101 may collect behavior data of tv programs watched by the user in real time as an original sample set, where the history of tv programs watched by the user i 30 days before is counted each day, and for each of the T program types, the program types are accumulated according to their corresponding watching time, and the normalized metric is a score, that is, time_t/(time₁+time₂+…+time_T) Wherein each user is to each sectionThe score of the mesh type is stored as the original sample x_i。

The initialize cluster center module 102 may initialize the cluster center based on the number of clusters and the original sample set. As an example, the initialize cluster centers module 102 may utilize a K-means + + clustering algorithm to initialize the K cluster centers with as large a distance as possible. The K-means + + algorithm comprises the following specific steps: (1) firstly, randomly selecting a sample point X from an original sample set X_iAs the first initial cluster center c_i(ii) a (2) Then calculate each sample point x_iThe shortest distance D (x) between the current existing cluster center and each sample point x is calculated_iThe probability P (x) of the next clustering center is selected, and finally the sample point x corresponding to the maximum probability value is selected_iAs the next cluster center; and (3) repeating the step (2) until K cluster centers are selected.

The data clustering module 103 may calculate, for each sample closest and approximate to two or more cluster centers in the original sample set, a local density gradient direction of the sample using a mean shift mean algorithm; calculating a similarity between a local density gradient direction of the sample and a direction of the sample toward each of the two or more cluster centers; and attributing the sample to the cluster corresponding to the maximum similarity for clustering. In particular, the data clustering module 103 may calculate each sample X in the original sample set X_iEuclidean distances to the centers of K cluster classes (as can be seen in FIG. 4, each arrow in FIG. 4 points from the center sample point to the cluster class center), x for each sample_iObtaining a distance set, and calculating a sample x according to the distance set_iCorresponding distance ratio set, judging sample x_iWhether the center closest to and similar to two or more cluster-like centers exists or not, if so, recording the corresponding cluster-like center, and calculating a sample x_iThe local mean-shift vector, which represents the direction in which the sample x is directed to the maximum increase in estimated density (referred to simply as the density gradient direction), is computed_iWith the local density gradient direction of the sample x_iSimilarity of directions to the center of various clustersThe sample x_iAnd dividing the cluster into the cluster with the maximum similarity and clustering.

The data pushing module 104 may push related data to each user group in real time according to the clustering result. In one example, the tv users may be automatically divided into K groups by a clustering algorithm, then T attributes (program types) in the centers of the clusters of each group are sorted, and the background directionally pushes related programs for each group according to the respective Top-N attributes (program types).

For convenience of explanation, the following will describe the embodiments of the present invention by taking the K-means + + clustering algorithm based on mean shift mean as an example, but those skilled in the art will understand that the present invention is also applicable to other clustering algorithms.

Fig. 2 is a flow diagram of a data processing method 200 based on meanshift optimization according to an embodiment of the invention. The method begins at step 201 with the data collection module 101 collecting user behavior data in real time as a raw sample set X.

In step 202, the initialize cluster center module 102 initializes the cluster center based on the number of clusters and the original sample set. Algorithms for initializing cluster centers include, but are not limited to, K-means + +, K-means, Canopy, and the like.

In step 203, the data clustering module 103 determines, for each sample in the original sample set, whether there are two or more cluster-like centers that are closest and approximate to the sample; if so, calculating a local density gradient direction of the sample using a non-parametric estimation mean shift algorithm, calculating a similarity between the local density gradient direction of the sample and a direction of the sample towards each of the two or more cluster centers, and dividing the sample into the cluster corresponding to the maximum similarity; otherwise, dividing the sample into the cluster closest to the center of the cluster. The specific implementation steps of the algorithm are described in further detail below in fig. 3.

In step 204, the data pushing module 104 pushes relevant data to each user group in real time according to the clustering result.

Fig. 3 shows a flow diagram of a meanshift-based clustering algorithm 300 according to one embodiment of the invention. The detailed steps of the algorithm 300 are as follows:

step 1: inputting the number of clusters K and the original sample set X, i.e.

Step 2: initializing K cluster centers using a K-means + + algorithm,

namely, it is

And step 3: computing each original sample X in the original sample set X_iEuclidean distances to the centers of K clusters of classes, denoted as d (x)_i,c_k) Where K is 1,2,3, …, K, where the euclidean distance is found by the following equation: for points x and y in the n-dimensional space,

thus, for each sample x_iObtain a set of distances

And 4, step 4: computing the original sample x_iC from other cluster centers_qDistance and from cluster-like center

To obtain a corresponding set of distance ratios

Wherein the original sample x_iFrom the center of the cluster

Is the smallest.

And 5: if it is

Are all greater than a threshold value epsilon, then the minimum distance is usedIs divided, i.e. sample x_iAnd dividing the cluster into the cluster class closest to the center of the cluster class, wherein epsilon can be a threshold value set by manual experience.

Step 6: if there is a collection

Then it indicates that there is

Cluster center and sample x_iThe distance is nearest and approximate, and the sample x is judged by mean shift mean at the moment_iTo which cluster class it belongs. The method comprises the following specific steps:

a) with sample x_iAs a center, h is a radius, and is taken as a p-dimensional sphere, which is marked as S_h(x_i)。

b) Finding x_iOffset mean vector, denoted M_h(x_i)。

Note that if Z is 0, then look at x_iAbnormal points are selected and removed.

c) Finding a sample x_iTo

And { c_vDirections, i.e.

d)M_h(x_i) Are respectively connected with

Calculating corresponding similarity by cosine similarity algorithm, and calculating x_iAnd dividing the vectors into clusters with the maximum similarity, wherein the cosine similarity algorithm evaluates the similarity of the two vectors by calculating the cosine value of an included angle of the two vectors, and the greater the cosine value is, the higher the similarity is.

And 7: every sample X in original sample set X_iAfter the division is finished, updating the center of each cluster to obtain

Calculating the target function of the whole cluster, and marking as E⁽¹⁾Wherein the objective function expression is as follows:

and 8: when E is^(t+1)Approximation E^(t)If yes, convergence is indicated, and a clustering result is output, otherwise, the step 3-step 7 are continuously executed.

What has been described above includes examples of aspects of the claimed subject matter. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the claimed subject matter, but one of ordinary skill in the art may recognize that many further combinations and permutations of the claimed subject matter are possible. Accordingly, the disclosed subject matter is intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims.

Claims

1. A method of data processing, the method comprising:

collecting user behavior data in real time as an original sample set;

if present, then

2. The method of claim 1, wherein determining whether there are two or more cluster-like centers closest in distance to the sample further comprises:

Wherein if a set exists

Then determine presence

3. The method of claim 1, wherein calculating the local density gradient direction of the sample using a mean shift mean further comprises:

4. The method of claim 1, wherein computing a similarity further comprises:

5. The method of claim 1, wherein the initializing cluster centers is performed by a K-means + + clustering algorithm, wherein a distance between each cluster center is as large as possible.

6. A data processing apparatus, characterized in that the apparatus comprises:

a data clustering module configured to:

if present, then

7. The apparatus of claim 6, wherein determining whether there are two or more cluster-like centers closest in distance to the sample further comprises:

Wherein if a set exists

Then determine presence

8. The apparatus of claim 6, wherein calculating the local density gradient direction of the sample using a mean shift mean further comprises:

9. The apparatus of claim 6, wherein calculating a similarity further comprises:

10. The apparatus of claim 6, wherein the initializing cluster centers is performed by a K-means + + clustering algorithm, wherein a distance between each cluster center is as large as possible.