CN113850281B

CN113850281B - MEANSHIFT optimization-based data processing method and device

Info

Publication number: CN113850281B
Application number: CN202110161944.7A
Authority: CN
Inventors: 吕超; 张继东; 沈志平; 吴浩宇; 吴风蛟
Original assignee: Tianyi Digital Life Technology Co Ltd
Current assignee: Tianyi Digital Life Technology Co Ltd
Priority date: 2021-02-05
Filing date: 2021-02-05
Publication date: 2024-03-12
Anticipated expiration: 2041-02-05
Also published as: CN113850281A; WO2022166380A1

Abstract

The invention provides a data processing method and device based on mean shift. The method comprises the following steps: collecting user behavior data in real time as an original sample set; initializing a class cluster center according to the number of class clusters and the original sample set; determining, for each sample in the original set of samples, whether there are two or more cluster centers closest to the sample, if so, calculating a local density gradient direction of the sample using mean shift, calculating a similarity between the local density gradient direction of the sample and a direction of the sample toward each of the two or more cluster centers, and classifying the sample into a cluster corresponding to a maximum similarity; otherwise, dividing the sample into a class cluster closest to the center of the class cluster; and pushing related data to each user group in real time according to the clustering result.

Description

MEANSHIFT optimization-based data processing method and device

Technical Field

The present invention relates to the field of data mining and machine learning, and more particularly to a MEANSHIFT-based optimized data processing method and apparatus.

Background

With the rapid development of modern information technology, the world has spanned the Internet+big data age. Big data are deeply changing the thinking, production and life style of people, and the big data are deeply fused with various industries to generate unprecedented social and commercial values. Many data processing methods based on data mining and machine learning are generated in the course of big data development, wherein the traditional K-means algorithm is formed by N samplesRandomly select K samplesThe method is used as an initial cluster center, an original sample is divided into clusters which are closest to the original sample based on a minimum distance rule, and when the distances between the sample and the centers of one or more clusters are close to the minimum distance, the clustering effect of K-means is not ideal. How to improve the clustering effect in this scenario becomes a urgent problem to be solved.

The Chinese patent application (CN 201911127104.8) proposes a K-means clustering method based on density Canopy, which takes the density Canopy cluster as a preprocessing step of a K-means algorithm, and compared with the traditional K-means algorithm, the clustering accuracy is improved, but the method does not consider the relation between an original sample and other clusters, only ensures local optimization, but cannot obtain global optimization.

The Chinese patent application (CN 201810570097.8) proposes a K-means clustering method based on a neural network, which solves the problems that the existing K-means iteratively optimizes a clustering center and label distribution by two independent steps, so that the reasoning speed is low, new data, large-scale data and online data cannot be processed, and the method is sensitive to initial values, but the method does not consider the scene of closest and approximate sample and a plurality of clusters, and the sample cannot be reasonably divided under the scene.

Therefore, in order to make sample division more reasonable and further improve clustering accuracy in the case that samples are nearest and approximate to a plurality of class clusters, it is desirable to provide an improved data processing method.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

The invention provides a data processing method and a data processing device based on mean shift optimization, which consider the relation between an original sample and other clusters, so that the edges of each cluster and the peripheral areas of the clusters are divided more reasonably, the clusters are compact, and the clustering precision and speed are greatly improved.

According to an aspect of the present invention, there is provided a data processing method, the method comprising:

collecting user behavior data in real time as an original sample set;

initializing a class cluster center according to the number of class clusters and the original sample set;

for each sample in the original set of samples, determining whether there are two or more cluster-like centers closest to the sample,

if present, then

The local density gradient direction of the sample is calculated using mean shift,

calculating a similarity between a local density gradient direction of the sample and a direction of the sample toward each of the two or more cluster centers, an

Dividing the samples into class clusters corresponding to the maximum similarity;

otherwise, dividing the sample into a class cluster closest to the center of the class cluster; and

and pushing relevant data to each user group in real time according to the clustering result.

According to one embodiment of the invention, determining whether there are two or more cluster-like centers closest to the sample further comprises:

calculating Euclidean distances from the sample to the centers of K class clusters to obtain a distance set aiming at the sample, wherein K is the number of the class clusters;

calculating the distance between the sample and the center c of other clusters _q Ratio of the distance of (2) to the smallest distance in the set of distances to obtain a corresponding set of distance ratios

Wherein if there is a setThen determine that +.>The cluster center is closest to the sample, where ε is a threshold set by human experience.

According to a further embodiment of the invention, calculating the local density gradient direction of the sample using mean shift further comprises:

a mean shift vector of the sample part is calculated, wherein the vector represents the direction of maximum increase of estimated density with respect to the sample itself.

According to a further embodiment of the present invention, calculating the similarity further comprises:

a cosine similarity algorithm is utilized to calculate a similarity between a local density gradient direction of the sample and a direction of the sample toward each of the two or more cluster-like centers, wherein the greater the cosine value, the higher the similarity.

According to a further embodiment of the present invention, the initializing cluster centers is performed by a K-means++ clustering algorithm, wherein the distance between the cluster centers is as large as possible.

According to another aspect of the present invention, there is provided a data processing apparatus, the apparatus comprising:

the data acquisition module is configured to acquire user behavior data in real time as an original sample set;

an initialization class cluster center module configured to initialize class cluster centers according to a number of class clusters and the original sample set;

a data clustering module configured to:

if present, then

Calculating a local density gradient direction of the sample using mean shift, calculating a similarity between the local density gradient direction of the sample and a direction of the sample toward each of the two or more cluster centers, and

and the data pushing module is configured to push related data to each user group associated with each class cluster in real time based on the clustering result.

Compared with the scheme in the prior art, the data processing method and device based on the meanshift optimization provided by the invention have at least the following advantages:

(1) By considering the relation between the original sample and other clusters, the edges of each cluster and the peripheral areas of the clusters are divided more reasonably, the clusters are compact, and the clustering effect is improved, so that the global optimum is achieved.

(2) Compared with the traditional K-means algorithm, the center positions of K class clusters can be estimated more accurately, so that the K class clusters can be converged rapidly, and the iteration times are reduced.

These and other features and advantages will become apparent upon reading the following detailed description and upon reference to the associated drawings. It is to be understood that both the foregoing general description and the following detailed description are explanatory only and are not restrictive of aspects as claimed.

Drawings

So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only certain typical aspects of this invention and are therefore not to be considered limiting of its scope, for the description may admit to other equally effective aspects.

FIG. 1 illustrates an exemplary architecture diagram of a data processing apparatus based on a meanshift optimization in accordance with one embodiment of the present invention.

FIG. 2 shows a flow chart of a method of data processing based on a meanshift optimization in accordance with one embodiment of the present invention.

FIG. 3 shows a flow chart of a mean shift based clustering algorithm according to one embodiment of the invention.

FIG. 4 illustrates an example of a central sample two-dimensional region according to one embodiment of the invention.

Detailed Description

The features of the present invention will become more apparent from the detailed description set forth below when taken in conjunction with the drawings.

FIG. 1 is an exemplary architecture diagram of a meanshift optimization-based data processing apparatus 100 in accordance with one embodiment of the present invention. As shown in fig. 1, the apparatus 100 of the present invention includes: the system comprises a data acquisition module 101, an initialization cluster-like center module 102, a data clustering module 103 and a data pushing module 104.

The data acquisition module 101 may acquire user data in real time as an original sample set and store it in a big data platform according to data characteristics classification. As an example, the data collection module 101 may collect, in real time, behavior data of a user watching a television program as an original sample set, wherein a history of television programs watched 30 days before user i is counted each day, accumulated according to the time of their respective watching for each of T program types, and normalized to a score, i.e., time _t /(time ₁ +time ₂ +…+time _T ) Wherein the score of each user for each program type is stored as an original sample x _i 。

The initialize class cluster center module 102 may initialize the class cluster center based on the number of class clusters and the original sample set. As one example, the initialize cluster center module 102 may initialize K cluster centers with a K-means++ clustering algorithm to maximize the distance between them. The K-means++ algorithm comprises the following specific steps: (1) Firstly, randomly selecting a sample point X from an original sample set X _i As the first initial cluster center c _i The method comprises the steps of carrying out a first treatment on the surface of the (2) Then calculate each sample point x _i The shortest distance D (x) between the current existing cluster center and the current existing cluster center is calculated, and each sample point x is calculated _i Probability P (x) of being selected as the next cluster center, and finally selecting the maximum probability value pairSample point x of interest _i As the center of the next cluster; and (3) repeating the step (2) until K cluster centers are selected.

The data clustering module 103 may calculate, for each sample in the original set of samples that is closest and approximate to the center of two or more clusters, a local density gradient direction for that sample using a mean shift algorithm; calculating a similarity between a local density gradient direction of the sample and a direction of the sample toward each of the two or more cluster-like centers; and attributing the sample to the class cluster corresponding to the maximum similarity for clustering. In particular, the data clustering module 103 may calculate each sample X in the original sample set X _i Euclidean distance to the center of K class clusters (as can be seen in FIG. 4, each arrow in FIG. 4 points from the center sample point to the center of each class cluster), for each sample x _i Obtaining a distance set, and calculating a sample x according to the distance set _i Corresponding distance ratio set, judging sample x _i Whether the sample x is closest to or similar to two or more cluster centers exists or not, if so, recording the corresponding cluster centers, and calculating the sample x _i A local mean shift vector representing the direction of maximum increase in estimated density (simply referred to as density gradient direction) relative to the direction in which the sample itself points, and the sample x is calculated _i Is the local density gradient direction of (1) and sample x _i Similarity of direction to the center of each cluster, sample x _i And dividing the clustering clusters into class clusters with the maximum similarity, and clustering the class clusters.

The data pushing module 104 may push relevant data to each user group in real time according to the clustering result. In one example, television users may be automatically grouped into K groups by a clustering algorithm, then the T attributes (program types) in the center of each group of clusters are ordered, and the relevant programs are pushed in a targeted manner to each group by the background according to the respective Top-N attribute (program type).

For ease of illustration, embodiments of the present invention will be described below using the mean shift mean++ based K-means clustering algorithm as an example, but those skilled in the art will appreciate that the present invention is equally applicable to other clustering algorithms.

FIG. 2 is a flow chart of a method 200 of data processing based on a meanshift optimization in accordance with one embodiment of the present invention. The method starts in step 201 with the data acquisition module 101 acquiring user behavior data in real time as an original sample set X.

In step 202, the initialize class cluster center module 102 initializes the class cluster center based on the number of class clusters and the original sample set. Algorithms for initializing cluster-like centers include, but are not limited to, K-means++, K-means, canopy, and the like.

At step 203, the data clustering module 103 determines, for each sample in the original set of samples, whether there are two or more cluster centers that are closest and approximate to the sample; if so, calculating a local density gradient direction of the sample by using a non-parameter estimation mean shift algorithm, calculating a similarity between the local density gradient direction of the sample and a direction of the sample toward each of the two or more cluster centers, and classifying the sample into a cluster corresponding to the maximum similarity; otherwise, the sample is divided into the class clusters nearest to the center of the class clusters. The specific implementation steps of this algorithm are described in further detail below in fig. 3.

In step 204, the data pushing module 104 pushes relevant data to each user group in real time according to the clustering result.

FIG. 3 shows a flowchart of a cluster algorithm 300 based on means shift, according to one embodiment of the invention. The detailed steps of the algorithm 300 are as follows:

step 1: the number K of the input class clusters and the original sample set X, namely

Step 2: initializing K cluster centers by using a K-means++ algorithm,

i.e.

Step 3: computing each original sample X in the original sample set X _i The Euclidean distance to the center of the K cluster-like clusters is denoted as d (x _i ,c _k ) Where k=1, 2,3, …, K, where the euclidean distance is found by the following formula: for points x and y in the n-dimensional space,thus, for each sample x _i Obtaining a distance set

Step 4: calculation of the original sample x _i From the center c of other clusters _q Distance from cluster centerTo obtain a corresponding distance ratio set +.>Wherein the original sample x _i Distance from cluster center->Is the smallest.

Step 5: if it isAll greater than the threshold epsilon, then the rule of minimum distance is used to divide, i.e. sample x _i Into the cluster closest to the center of the cluster, where ε may be a threshold set by human experience.

Step 6: if there is a setThen indicate +.>Cluster center and sample x _i The distance is nearest and approximate, and the sample x is determined by means of mean shift _i Belonging to which cluster. The method comprises the following specific steps:

a) In sample x _i Taking the center, h as the radius, and taking a p-dimensional sphere as S _h (x _i )。

b) Find x _i Shift mean vector, denoted M _h (x _i )。

Note that if Z is 0, then x is considered _i Is an outlier and is rejected.

c) Obtaining a sample x _i To the point of{ c _v Direction, i.e.)>

d)M _h (x _i ) Respectively withThe corresponding similarity is obtained through a cosine similarity algorithm, and x is calculated _i Dividing the two vectors into class clusters with maximum similarity, wherein the cosine similarity algorithm evaluates the similarity of the two vectors by calculating the cosine value of the included angle, and the larger the cosine value is, the higher the similarity is.

Step 7: each sample X in the original sample set X _i After division, updating the center of each class cluster to obtainCalculating the objective function of the whole cluster, and marking as E ⁽¹⁾ Wherein the objective function expression is as follows: />

Step 8: when E is ^(t+1) Approximation E ^(t) And (3) indicating that the clustering result is converged, and outputting the clustering result, otherwise, continuing to execute the steps 3-7.

What has been described above includes examples of aspects of the claimed subject matter. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the claimed subject matter, but one of ordinary skill in the art may recognize that many further combinations and permutations of the claimed subject matter are possible. Accordingly, the disclosed subject matter is intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims.

Claims

1. A method of data processing based on MEANSHIFT optimization, the method comprising:

collecting user behavior data in real time as an original sample set;

if so, the local density gradient direction of the sample is calculated using mean shift,

2. The method of claim 1, wherein determining whether there are two or more cluster-like centers closest to the sample further comprises:

calculating the distance between the sample and the center c of other clusters _q Is centered from the cluster-like center with the samples in the distance setTo obtain a corresponding distance ratio set +.>

3. The method of claim 1, wherein calculating the local density gradient direction of the sample using mean shift further comprises:

4. The method of claim 1, wherein calculating a similarity further comprises:

a cosine similarity algorithm is utilized to calculate a similarity between a local density gradient direction of the sample and a direction of the sample toward each of the two or more cluster-like centers.

5. The method of claim 1, wherein initializing cluster-like centers is performed by a K-means++ clustering algorithm.

6. A MEANSHIFT optimized data processing device, the device comprising:

a data clustering module configured to:

7. The apparatus of claim 6, wherein determining whether there are two or more cluster-like centers closest to the sample further comprises:

calculating the distance between the sample and the center c of other clusters _q Is located from the cluster center c with the samples in the distance set _m* To obtain a corresponding set of distance ratios

8. The apparatus of claim 6, wherein calculating the local density gradient direction of the sample using mean shift further comprises:

9. The apparatus of claim 6, wherein calculating a similarity further comprises:

10. The apparatus of claim 6, wherein the initializing cluster-like centers is performed by a K-means++ clustering algorithm.