CN104965863A

CN104965863A - Object clustering method and apparatus

Info

Publication number: CN104965863A
Application number: CN201510303335.5A
Authority: CN
Inventors: 吕俊; 杨诗; 邓宇; 吕鹏; 罗维
Original assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Current assignee: Beijing Hongxiang Technical Service Co Ltd
Priority date: 2015-06-05
Filing date: 2015-06-05
Publication date: 2015-10-07
Anticipated expiration: 2035-06-05
Also published as: CN104965863B

Abstract

The present invention discloses an object clustering method and apparatus, relating to the field of computer technologies. The method comprises: obtaining to-be-clustered samples and an access weight of each sample, wherein the access weight is the importance of the sample during accessing, and the sample comprises brand data; classifying each sample as a classification object, setting the coordinates of the corresponding sample as center coordinates of the classification object and setting the access weight of the corresponding sample as an access weight of the classification object; and clustering the classification objects according to the access weights and the center coordinates of the classification objects to obtain brand classifications each comprising at least one piece of brand data. The object clustering method and apparatus achieve the beneficial effect that the clustering result is accurate, so that the amount of calculation in subsequent processing is small, and the deviation is small.

Description

A kind of clustering objects method and apparatus

Technical field

The present invention relates to field of computer technology, be specifically related to a kind of clustering objects method and apparatus.

Background technology

In data processing, the process set of physics or abstract object being divided into the multiple classes be made up of similar object is called as cluster.The class (bunch) generated by cluster is the set of one group of data object, and these objects are similar each other to the object in same class (bunch), different with the object in other classes (bunch).Hereafter use the concept of " class ", it should be noted that, " class " is identical with the implication of " bunch " herein.

And in internet, there is a large amount of branding datas, these branding datas are needed to carry out cluster, facilitate subsequent treatment, such as advertisement data targetedly, a kind of method of hierarchical clustering is there is in first technology, it is the distance that the central point of each class calculates between two classes, then two nearest classes are merged into a new class, but it just goes to calculate according to the number of samples in two classes to the center of new class, then calculate the distance between each class of next round, circulation cluster is until reach termination condition.

But, in above-mentioned clustering method, the central point of its new class calculates according to the number of samples of two classes, relatively large deviation is there is with the distribution center of gravity of sample in practical application, therefore, the brand classification that cluster obtains is accurate not, cause follow-up computational processing large, and subsequent treatment result error is larger.

Summary of the invention

In view of the above problems, the present invention is proposed to provide a kind of overcoming the problems referred to above or the clustering objects device solved the problem at least in part and corresponding clustering objects method.

According to an aspect of the present invention, provide a kind of clustering objects method, comprising:

Obtain the access weight of sample to be clustered and each sample; Described access weight be described sample accessed time significance level, described sample comprises branding data;

Each sample is divided into an object of classification, and using the centre coordinate of the coordinate of respective sample as described object of classification, and using the access weight of the access weight of respective sample as described object of classification;

According to access weight and the centre coordinate of each object of classification, each object of classification is carried out cluster, obtain the brand classification respectively comprising at least one branding data.

Preferably, the described access weight according to each object of classification and centre coordinate, comprise the step that each object of classification carries out cluster:

For each object of classification, according to the centre coordinate of each object of classification, calculate the distance between every two object of classification;

Nearest two object of classification are polymerized to a new object of classification, and according to the centre coordinate of each object of classification and access weight, calculate centre coordinate and the access weight of described new object of classification;

Judge whether to reach polymerization termination condition, if do not reach polymerization termination condition, by object of classification unpolymerized in new object of classification and current pass, return for each object of classification in the lump, according to the centre coordinate of each object of classification, calculate the step of the distance between every two object of classification, until reach polymerization termination condition.

Preferably, described access weight comprises:

Described sample is viewed in a network browses weight;

And/or, the click weight that described sample is clicked in a network;

And/or, the time span weight of the nearest viewed moment distance reference instant of described sample;

And/or, the weight of website of website, described sample place;

And/or, described sample nearest viewed time, the region weight of the navigation patterns region of user;

And/or, search weight when described sample place is searched.

Preferably, described for each object of classification, according to the centre coordinate of each object of classification, calculate the step of the distance between every two object of classification, comprising:

For each object of classification, build center vector according to centre coordinate;

Calculate the COS distance between two center vectors corresponding to every two object of classification.

Preferably, the step of each sample that described acquisition is initial, comprising:

For each sample, obtain corresponding access weight according to service identification.

Preferably, according to the access weight of each object of classification and centre coordinate, after each object of classification being carried out the step of cluster, also comprise:

For each user, according to user to the access behavioral data of user described in each to the branding data in all kinds of, brand tag along sort is stamped to described user.

Preferably, for each user, according to user to the access behavioral data of user described in each to the branding data in all kinds of, after brand tag along sort step is stamped to described user, also comprise:

According to the brand tag along sort of described user, the 3rd object of described for correspondence mark is sent to described user place terminal; Described 3rd object comprises the ad data for described branding data.

Preferably, the described centre coordinate according to each object of classification and access weight, the centre coordinate calculating described new object of classification comprises:

According to service identification, call the centre coordinate that corresponding coordinate computing function calculates described classification newly.

According to another aspect of the present invention, also disclose a kind of clustering objects device, comprising:

Initial object acquisition module, is suitable for the access weight obtaining sample to be clustered and each sample; Described access weight be described sample accessed time significance level;

Divide module, be suitable for each sample to be divided into an object of classification, and using the centre coordinate of the coordinate of respective sample as described object of classification, and using the access weight of the access weight of respective sample as described object of classification;

Cluster module, is suitable for the access weight according to each object of classification and centre coordinate, and each object of classification is carried out cluster.

Preferably, described cluster module comprises:

Distance calculation module, is suitable for for each object of classification, according to the centre coordinate of each object of classification, calculates the distance between every two object of classification;

Aggregation module, is suitable for nearest two object of classification to be polymerized to a new object of classification, and according to the centre coordinate of each object of classification and access weight, calculates centre coordinate and the access weight of described new object of classification;

Judge module, be suitable for judging whether to reach polymerization termination condition, if do not reach polymerization termination condition, by object of classification unpolymerized in new object of classification and current pass, return for each object of classification in the lump, according to the centre coordinate of each object of classification, calculate the step of the distance between every two object of classification, until reach polymerization termination condition.

Preferably, described access weight comprises:

Described sample is viewed in a network browses weight;

And/or, the click weight that described sample is clicked in a network;

And/or, the weight of website of website, described sample place;

And/or, search weight when described sample place is searched.

Preferably, described distance calculation module comprises:

Center vector builds module, is suitable for for each object of classification, builds center vector according to centre coordinate;

COS distance computing module, is suitable for calculating the COS distance between two center vectors corresponding to every two object of classification.

Preferably, described initial object acquisition module comprises:

Access weight acquisition module, is suitable for, for each sample, obtaining corresponding access weight according to service identification.

Preferably, also comprise:

Mark module, is suitable for for each user, according to user to the access behavioral data of user described in each to the branding data in all kinds of, stamps brand tag along sort to described user.

Preferably, also comprise:

Object sending module, is suitable for the brand tag along sort according to described user, and the 3rd object of described for correspondence mark is sent to described user place terminal; Described 3rd object comprises the ad data for described branding data.

Preferably, described first aggregation module comprises:

Computing function selects module, is suitable for according to service identification, calls the centre coordinate that corresponding coordinate computing function calculates described classification newly.

Clustering objects method according to the present invention can obtain the initial branding data comprising access weight, this access weight indicate described branding data accessed time significance level, then cluster process is participated according to the access weight of branding data, so when cluster, the branding data participation that access weight is high is high, solve thus in traditional cluster process and carry out cluster according to the branding data quantity in all kinds of, and cause the branding data degree of polymerization to be disperseed, the problem that cluster is not accurate enough, thus cause subsequent treatment calculated amount large, the problem that result of calculation deviation is large, achieve cluster result accurate, make follow-up computational processing little, the beneficial effect that deviation is low.

Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.

Accompanying drawing explanation

By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:

Fig. 1 shows a kind of according to an embodiment of the invention schematic flow sheet of clustering objects method;

Fig. 2 shows a kind of according to an embodiment of the invention schematic flow sheet of clustering objects method;

Fig. 3 shows a kind of according to an embodiment of the invention schematic flow sheet of clustering objects method;

Fig. 4 shows a kind of according to an embodiment of the invention structural representation of clustering objects device;

Fig. 5 shows a kind of according to an embodiment of the invention structural representation of clustering objects device; And

Fig. 6 shows a kind of according to an embodiment of the invention structural representation of clustering objects device.

Embodiment

Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.

One of core concept of the present invention is: the embodiment of the present invention for be branding data, this branding data includes user accesses data, such as each user accesses the user accesses data such as browsing data, click data, purchase data, collection data of this brand, and the branding data of the embodiment of the present invention is summarized the one comprising above-mentioned user accesses data.For various brands data, obtain its access weight, significance level when this access weight represents that described branding data is accessed.Then using this branding data and access weight as sample carry out in cluster process, the branding data participation that access weight is high is high, make object of classification center deviation to the high side of access weight, thus make final branding data cluster result more accurate, reduce the calculated amount in subsequent processes, also reduce the deviation of the result of subsequent treatment.

Embodiment one

With reference to Fig. 1, it illustrates the schematic flow sheet of a kind of clustering objects method of the present invention, specifically can comprise:

Step 110, obtains the access weight of sample to be clustered and each sample; Described access weight be described sample accessed time significance level; Described sample comprises branding data.

Be appreciated that the embodiment of the present invention needs to obtain the information of each sample and the access weight of corresponding sample.

The embodiment of the present invention for be branding data, this branding data includes user accesses data, such as each user accesses the user accesses data such as browsing data, click data, purchase data, collection data of this brand, and the branding data of the embodiment of the present invention is summarized the one comprising above-mentioned user accesses data.Such as branding data " families of extra large billows ", it comprises the user data that user browses " the brief stamp V of family's men's clothing certified products of 2015 new product in summer sea billows leads short-sleeve T-shirt HNTCJ2A101A " of merchandise items in a webpage of cat website, sky.Again such as " Adidas ", " Nike ", " iphone ", " Samsung " etc. are all branding datas of corresponding commodity object.So for a branding data, the coordinate of the latent structure branding data of its multiple dimension can be obtained.The price, visit capacity, amount of collection etc. of the branding data under such as brand name, brand brief introduction, brand audient, this brand, obtain similar A={a1, a2, a3 ... multidimensional coordinate.In embodiments of the present invention, for a branding data, by gathering in network the web data of the dependent merchandise showing described brand, carrying out analysis extraction and obtaining appreciating feature.

Be appreciated that the initial parameter of each dimensional characteristics can for any character type, can be such as digital, also can be Chinese character, can certainly be the parameter of other types.In embodiments of the present invention for then the initial parameter not being numeral being converted to numeral, and then build the coordinate A={a1 for this sample, a2, a3 ....

Be appreciated that in the embodiment of the present invention, the dimension number of each sample is consistent.Dimension is at least one, also can arrange multiple according to actual needs.

Be appreciated that, in actual applications, during user's access websites, place, website server can generate access log based on the situation of user's access, access log have recorded the various parameters of user's access, as the browsing time span etc. of title, price, trading volume, amount of collection, sample clicked number of times uv, the nearest viewed number of times pv of sample, distance reference instant.The information of so above-mentioned sample all can obtain by carrying out in the access log of server statistics.

Also can add up the access weight of each sample simultaneously, this access weight be described sample accessed time significance level.

Preferably, in embodiments of the present invention, access weight described above comprises:

(1) sample is viewed in a network browses weight;

In embodiments of the present invention, sample is viewed in a network browses weight, can be set to the number of visits pv that sample is viewed in a network.For the branding data in webpage for sample, the webpage of a certain branding data shown in each website is opened once, can think that this branding data is once viewed.So because each user in network is to the statistics of the access times of each webpage at this branding data place, then can obtain the viewed number of times pv of this branding data.

In embodiments of the present invention, the viewed number of times of branding data can be obtained by access log.In actual applications, the each merchandise items shown in webpage belongs to a branding data, so can add up the viewed number of times of each merchandise items place webpage by access log, then number of times viewed for each merchandise items under this brand is added, remembers to obtain the viewed number of times of branding data.

Certainly, sample is viewed in a network browses weight pv, can be obtained by the access log statistics of server.

In embodiments of the present invention, the viewed recently number of times of sample is more, and its weight is higher.Weights W=pv.

And/or, the click weight that (2) described sample is clicked in a network;

In embodiments of the present invention, the click weight that sample is clicked in a network, can be set to the number of clicks that sample is clicked in a network.For branding data in webpage, when user opens the main page that is shown the merchandise items of certain branding data, in main page, then click the link of this certain concrete merchandise items, then represent that this branding data is clicked once.So because each user in network, to the click of the link of each merchandise items under this branding data in webpage, adds up it, then can obtain the clicked number of times uv of this branding data.

Certainly, the click weight uv that sample is clicked in a network, can be obtained by the access log statistics of server.

In embodiments of the present invention, the clicked number of times of sample is more, and its weight is higher.Weights W=uv.

And/or, the time span weight of the nearest viewed moment distance reference instant of (3) described sample;

In actual applications, the time of browsing row generation when branding data is viewed is nearer, and weight is larger.Also can be understood as, the time span of the moment distance reference instant that branding data is viewed is recently longer, and weight is larger, and wherein reference instant is in the past sometime, such as reference instant 2000-01-0100:00:00.In order to convenience of calculation, the present invention takes the logarithm to browsing time span, makes numerical value more succinct.Such as reference instant t (0)=(2000-01-01 00:00:00), time t (A)=(the 2015-04-13 07:30:30) that sample is viewed recently, the result of t (A)-t (0) in seconds, then take the logarithm and obtain access weight, such as W=log (t (A)-t (0)).

In embodiments of the present invention, when branding data is viewed, browsing the time that row occurs, can by the viewed moment of certain or certain a few merchandise items under record branding data, thus can obtain this branding data viewed time browse the time that row occurs.Or to based on the main page of certain branding data, record the moment that this main page is accessed recently.Such as " apple " shop main page in sky cat.

And/or, the weight of website of website, (4) described sample place;

In actual applications, access frequency, the public praise of different web sites are different with significance level, then the weighted of the website responded.Such as certain branding data has network shop in cat website, sky and www.tmall.com, then give high weight, if only have network shop in Taobao website and www.taobao.com, then low to weight.

Such as the weight of website W of website, sample place, have in cat website, aforementioned sky network shop to W assignment 0.8 access weight, give W assignment 0.6 access weight only in Zhong You shop, Taobao website.Certainly can also such as, only have in website, Jingdone district branding data to 0.7 access weight, have in cat website, sky, website, Jingdone district, Taobao website, Amazon branding data to 1.1 access weight.

Be appreciated that the embodiment of the present invention can pre-set weight of website table, the combination for different web sites and different web sites arranges corresponding access weight.Then according to the identity information of sample, inquiry can be gone to other websites whether to have corresponding sample, if some websites has this sample, then records this website.After each website obtaining the place that expires, from aforementioned weight of website table, inquire about the weight of website of the website, place of this sample.

And/or, (5) described sample nearest viewed time, the region weight of the navigation patterns region of user.

In actual applications, when certain user adopts terminal access sample, it is actually accesses in different regions, such as in Hubei access, access in Beijing.

In embodiments of the present invention, can setting area weight table, for different regions, different weights is set.Such as the weight in a line city is high, and such as 1, tier 2 cities takes second place, and such as 0.8, three line cities again such as 0.6, access weight is successively decreased successively.Certainly, also can by other regular setting area weights.

In actual applications, can by obtaining the access region of branding data to the Visitor Logs of the merchandise items of branding data subordinate.

Certainly, the access behavior occurrence frequency of each Regional Brand data can be added up, determine the region weight that the navigation patterns of user is final.Such as in a line city, the ratio of the access frequency of branding data reaches 90%, and the city of other lines has disperseed remaining 10%.When so can determine branding data viewed, the navigation patterns region of user is a line city, its region weight.Again such as, such as in a line city, the ratio of the access frequency of branding data reaches 50%, and tier 2 cities reaches 40%, and the city of other lines has disperseed remaining 10%.When so can determine branding data viewed, the navigation patterns region of user is line city and a tier 2 cities, so the weight of a line city and tier 2 cities can be merged in proportion the region weight of branding data.(1 × 0.5+0.8 × 0.4)/(0.5+0.4) is can be for the region weight after previous example merges.

So, in the embodiment of the present invention, for each sample, can the region at recording user place when accessing this sample, then determine its region weight by region weight table.In practice, region when by GPS (Global Positioning System GPS) and/or IP address user's sample above-mentioned by terminal access being determined for mobile terminal; Region when can determine the above-mentioned sample of user terminal access by IP address for personal computer.Certain embodiment the application is not limited it.

And/or, search weight when (6) described sample place is searched.

Sample has its key word, and the present invention by collecting sample by the number of times that its key word is searched in each search engine, and/or can press the searched number of times of its key word, using the number of times of search as search weight in each website.Searching times is larger, then the search weight of sample is higher.

Sub-step S112, for each sample, obtains corresponding access weight according to service identification.

In embodiments of the present invention, the corresponding table of service identification and various access weight can be pre-set.Such as aforementioned various access weight, adopt that the corresponding sample of mark p1 is viewed in a network browses weight, the click weight that the corresponding sample of mark p2 is clicked in a network, the time span weight of the nearest viewed moment distance reference instant of the corresponding sample of mark p3, the weight of website of the corresponding website, sample place of mark p4, the corresponding sample of mark p5 nearest viewed time, the region weight of the navigation patterns region of user.Certainly, also can adopting the corresponding sample of mark p1, viewed in a network to browse weight good, the click weight clicked in a network with sample, with the time span weight of the nearest viewed moment distance reference instant of corresponding sample, the weight of website of the corresponding website, sample place of mark p2 and this nearest viewed time, the region weight of the navigation patterns region of user.The embodiment of the present invention can carry out correspondence according to business demand.

So during each cluster, cluster task can be received, in this cluster task, then can service identification be set.So when obtaining initial sample, owing to needing the access weight obtaining each sample, then go according to service identification the access weight obtaining corresponding service demand.

Be appreciated that in embodiments of the present invention, various types of access weight can be calculated by server in advance, then that already present access weight is directly corresponding with service identification.Also can Real-time Obtaining, namely after receiving service identification, announcement server is going the access weight calculating each sample.The present invention is not limited it.

In embodiments of the present invention, the sample acquired and the access weight of sample are the basic datas calculated, and after above-mentioned data integrity, can enter step 120.

Step 120, is divided into an object of classification by each sample, and using the centre coordinate of the coordinate of respective sample as described object of classification, and using the access weight of the access weight of respective sample as described object of classification;

In embodiments of the present invention based on based on the sample of data and the access weight of sample, generate the object of classification of ground floor, this cocycle on the basis of the object of classification of ground floor of so follow-up cluster process performs.

For the object of classification of ground floor, when generating, using the centre coordinate of the coordinate of sample as object of classification, using the access weight of the access weight of sample as object of classification, then subsequent arithmetic can be participated in.

The such as coordinate A={a1 of aforementioned sample A, a2, a3}, access weight is number of visits and W (the A)=pv (A) of A, then give corresponding object of classification A by A={a1, a2, a3} _{classification}, then A _{classification}={ W (A)=pv (A) is given object of classification A by a1, a2, a3} _{classification}, then W (A _{classification})=pv (A).

Step 130, according to access weight and the centre coordinate of each object of classification, carries out cluster by each object of classification, obtains the brand classification respectively comprising at least one branding data.

Then namely can the access weight of each object of classification and centre coordinate, each object of classification is carried out cluster, so can obtain the classification of each brand, each brand classification comprises at least one branding data.Such as " Adidas ", " Nike ", " Li Ning " can gather in a brand classification.

Further, when branding data comprises user accesses data, also comprise: the tag along sort each user ID in user accesses data being stamped to the classification of affiliated brand.

Because directly classify based on the branding data of user accesses data, after classification, can know which classification is corresponding user ID belong to, so severally stamp label for relative users.So server is when pushing corresponding with branding data ad data, can from whois lookup the user ID that marks by corresponding branding data label, thus ad data to be pushed in the account or terminal that relative users identifies place.

The embodiment of the present invention, for various brands data, obtains its access weight, significance level when this access weight represents that described branding data is accessed.Then using this branding data and access weight as sample carry out in cluster process, the branding data participation that access weight is high is high, make object of classification center deviation to the high side of access weight, thus make final branding data cluster result more accurate, reduce the calculated amount in subsequent processes, also reduce the deviation of the result of subsequent treatment.

Embodiment two

With reference to Fig. 2, it illustrates the schematic flow sheet of a kind of clustering objects method of the present invention, specifically can comprise:

Step 210, obtains the access weight of sample to be clustered and each sample; Described access weight be described sample accessed time significance level; Described sample comprises branding data;

In the embodiment of the present invention, described access weight be described sample accessed time significance level.

Described sample is viewed in a network browses weight;

And/or, the click weight that described sample is clicked in a network;

And/or, the weight of website of website, described sample place;

And/or, search weight when described sample place is searched.

Above-mentioned various weight by the agency of in embodiment one, does not chat superfluous at this.

Sub-step S212, for each sample, obtains corresponding access weight according to service identification.

Certainly, in embodiments of the present invention, when obtaining sample in step 210, the coordinate of sample can also be obtained, as A={a1, a2, a3}, B={b1, b2, b3}

Step 220, is divided into an object of classification by each sample, and using the centre coordinate of the coordinate of respective sample as described object of classification, and using the access weight of the access weight of respective sample as described object of classification;

Just each sample is divided into an object of classification, obtains the object of classification of ground floor.Wherein, using the centre coordinate of the coordinate of sample as object of classification, using the access weight of the access weight of sample as object of classification, then subsequent arithmetic can be participated in.

The coordinate A={a1 of such as sample A, a2, a3}, access weight is number of visits and W (the A)=pv (A) of A, then give corresponding object of classification A by A={a1, a2, a3} _{classification}, then A _{classification}={ W (A)=pv (A) is given object of classification A by a1, a2, a3} _{classification}, then W (A _{classification})=pv (A).

The coordinate B={b1 of sample B, b2, b3}, access weight is number of visits and W (the B)=pv (B) of B, then give corresponding object of classification B by B={b1, b2, b3 _{classification}, then B _{classification}={ W (B)=pv (B) is given object of classification B by b1, b2, b3} _{classification}, then W (B _{classification})=pv (B).

Step 230, for each object of classification, according to the centre coordinate of each object of classification, calculates the distance between every two object of classification;

In embodiments of the present invention, for the object of classification of ground floor, as aforementioned A _{classification}=a1, a2, a3} and, B _{classification}={ a1, a2, a3}, can calculate the distance between two coordinates, thus obtain the distance between two object of classification.

If so there is N number of object of classification, calculates distance afterwards so between any two, the distance matrix I of a N × N can be obtained _{n × N}, row and column corresponding two classification respectively of each numerical value in this matrix.So find distance matrix I _{n × N}in minimum numerical value, the classification of the classification corresponding according to the row at place and the correspondence of row, can know two nearest classification.

Sub-step S232, for each object of classification, builds center vector according to centre coordinate;

In embodiments of the present invention, because each object of classification has a centre coordinate, and centre coordinate conceptive be not a vector, so centre coordinate will be converted to a vector by the present invention.Conveniently, be directly that benchmark builds vector with initial point in the embodiment of the present invention.With aforementioned three-dimensional coordinate, {, for example, each dimension directly subtracts origin for a1, a2, a3}, namely obtains vectorial A={a1, a2, a3}.Certainly, in actual applications, can directly think that coordinate figure is vector value.

Sub-step S234, calculates the COS distance between two center vectors corresponding to every two object of classification.

As vectorial A={a1, a2, a3}, B={b1, b2, b3}, COS distance is between the two

\frac{A \cdot B}{(| A | + | B |)} \frac{(a 1 \cdot b 1 + a 2 \cdot b 2 + a 3 \cdot b 3)}{\sqrt{{a 1}^{2} + {a 2}^{2} + {a 3}^{2}} + \sqrt{{b 1}^{2} + {b 2}^{2} + {b 3}^{2}}}

Then this COS distance can according to the arrangement position write distance matrix I of two sort merges _{n × N}in.Certainly, for the vector of other dimensions, by that analogy, COS distance can be obtained.

Distance between being appreciated that for two classification, can also adopt other modes to calculate, and such as directly calculates with the range formula between 2 o'clock.The present invention is not limited it.

Nearest two object of classification are polymerized to a new object of classification by step 240, and according to the centre coordinate of each object of classification and access weight, calculate centre coordinate and the access weight of described new object of classification;

So for nearest two object of classification, i.e. afore-mentioned distance matrix I _{n × N}in two object of classification corresponding to minimum value.A new object of classification can be merged into, such as aforesaid class object A _{classification}={ a1, a2, a3}, its access weight W (A _{classification})=pv (A); Object of classification B _{classification}={ a1, a2, a3}, its access weight W (B _{classification})=pv (B).

Can by A _{classification}and B _{classification}be polymerized to new object of classification AB _{classification}, according to A _{classification}and B _{classification}respective centre coordinate and access weight calculate AB _{classification}centre coordinate, according to A _{classification}and B _{classification}respective access weight calculates AB _{classification}access weight.

The centre coordinate calculating new object of classification can adopt various ways, for aforementioned A B _{classification}, its centre coordinate can calculate according to following formula:

AB _{classification}={ (a1*W (A _{classification})+b1*W (B _{classification}))/(W (A _{classification})+W (B _{classification})),

(a2*W (A _{classification})+b2*W (B _{classification}))/(W (A _{classification})+W (B _{classification})),

(a3*W (A _{classification})+b3*W (B _{classification}))/(W (A _{classification})+W (B _{classification})

(formula 1)

For AB _{classification}access weight, various ways also can be adopted to calculate, as W (AB _{classification})=W (A _{classification})+W (B _{classification}).

Preferably, the described centre coordinate according to each object of classification, the step calculating the centre coordinate of described new object of classification comprises:

Sub-step S236, according to service identification, calls the centre coordinate that corresponding coordinate computing function calculates described classification newly.

Be appreciated that before embodiment of the present invention step 210 performs, the service identification of current cluster task can be obtained.And for the centre coordinate of new classification, different account form Y=f (M, N, W (M), W (N)) can be adopted to calculate according to service needed.Wherein M, N represent two nearest object of classification coordinates separately, and W (M), W (N) represent two nearest object of classification access weight separately.F is computing function.Such as aforementioned formula 1, its centre coordinate

Y＝f(M，N，W(M)，W(N))＝{

(m1*W(M)+n1*W(N))/(W(M)+W(N))，

(m2*W(M)+n2*W(N))/(W(M)+W(N))，

(m3*W(M)+n3*W(N))/(W(M)+W(N)))

}

Wherein (m1, m2, m3) is the centre coordinate of object of classification M, and (n1, n2, n3) is the centre coordinate of object of classification N.So when computing, by A _{classification}and B _{classification}centre coordinate and access weight bring in above-mentioned f (M, N, W (M), W (N)).

Certain f (M, N, W (M), W (N)) also can arrange other computing functions according to actual needs, and the embodiment of the present invention is not limited it.

Step 250, judges whether to reach polymerization termination condition; If do not reach polymerization termination condition, by object of classification unpolymerized in new object of classification and current pass, return step 230 in the lump.If reach polymerization termination condition, then terminate polymerization process.

In actual applications, after step 240 is polymerized, can judge whether cluster process reaches polymerization termination condition.Distance between two such as nearest object of classification is greater than threshold value, or finally to gather be an object of classification.

If polymerization process reaches polymerization termination condition, then terminate polymerization process.

If polymerization process does not reach polymerization termination condition, then by object of classification unpolymerized in new object of classification and current pass, return step 230 in the lump.

Such as the object of classification of aforementioned ground floor comprises A _{classification}, B _{classification}, C _{classification}, D _{classification}, E _{classification}, D _point _class, F _{classification}, G _{classification}.As aforementioned by A _{classification}, B _{classification}merge into AB _{classification}, so by second layer object of classification AB _{classification}, C _{classification}, D _{classification}, E _{classification}, D _{classification}, F _{classification}, G _{classification}return step 230, calculate AB _{classification}, C _{classification}, D _{classification}, E _{classification}, D _{classification}, F _{classification}, G _{classification}distance between any two.

Suppose AB _{classification}, C _{classification}, D _{classification}, E _{classification}, D _{classification}, F _{classification}, G _{classification}in, C _{classification}, D _{classification}merge into an object of classification CD _{classification}if do not reach polymerization termination condition, then by AB _{classification}, CD _{classification}, E _{classification}, D _{classification}, F _{classification}, G _{classification}return step 230.Circulation like this, until reach polymerization termination condition.

Be appreciated that the object of classification of calculating has N number of, and its distance matrix is I when first time is polymerized _{n × N}; When second time is polymerized, the object of classification of calculating has become N-1, and its distance matrix is I _{(N-1) × (N-1)}, when third time is polymerized, the object of classification of calculating has become N-2, and its distance matrix is I _{(N-2) × (N-2)}.The like, until polymerization terminates.Final object of classification is brand classification.

In embodiments of the present invention, for each sample, obtain its access weight, significance level when this access weight represents that described sample is accessed.Then using this sample and access weight as sample carry out in cluster process, the sample participation that access weight is high is high, makes object of classification center deviation to the high side of access weight.Inventor finds that access weight is higher, and the contour sample of central point and the access frequency in object of classification of the object of classification of new synthesis in the embodiment of the present invention, click frequency, searching times is apart from nearer.Therefore cluster centre skew is less.Thus make final cluster result more accurate, reduce the calculated amount in subsequent processes, also reduce the deviation of the result of subsequent treatment.

During the object of classification that the merging of routine is new relatively, computing center sits calibration method, namely to object of classification A=[a1, a2, a3] and object of classification B=[b1, b2, b3], its centre coordinate computing formula is AB=[(a1+b1)/2, (a2+b2)/2, (a3+b3)/2], the embodiment of the present invention, in the process of iteration, cluster centre skew is less, and it is more concentrated that the embodiment of the present invention compares Clustering Effect, can greatly reduce bunch quantity of (classification).

Embodiment three

With reference to Fig. 3, it illustrates the schematic flow sheet of a kind of clustering objects method of the present invention, specifically can comprise:

Step 310, obtains the access weight of sample to be clustered and each sample; Described access weight be described sample accessed time significance level;

In embodiments of the present invention, the sample to be clustered of acquisition can sample in certain field, than

Described access weight comprises:

Described sample is viewed in a network browses weight;

And/or, the click weight that described sample is clicked in a network;

And/or, the weight of website of website, described sample place;

And/or, search weight when described sample place is searched.

Step 320, is divided into an object of classification by each sample, and using the centre coordinate of the coordinate of respective sample as described object of classification, and using the access weight of the access weight of respective sample as described object of classification;

Step 330, for each object of classification, according to the centre coordinate of each object of classification, calculates the distance between every two object of classification;

Nearest two object of classification are polymerized to a new object of classification by step 340, and according to the centre coordinate of each object of classification and access weight, calculate centre coordinate and the access weight of described new object of classification;

Step 350, judges whether to reach polymerization termination condition; If do not reach polymerization termination condition, by object of classification unpolymerized in new object of classification and current pass, return step 230 in the lump.If reach polymerization termination condition, then terminate polymerization process.

In actual applications, after step 340 is polymerized, can judge whether cluster process reaches polymerization termination condition.Distance between two such as nearest object of classification is greater than threshold value, or finally to gather be an object of classification.

If polymerization process reaches polymerization termination condition, then enter step 360.

If polymerization process does not reach polymerization termination condition, then by object of classification unpolymerized in new object of classification and current pass, return step 330 in the lump.Final object of classification is brand classification.

Step 360, for each user, according to user to the access behavioral data of user described in each to the branding data in all kinds of, stamps brand tag along sort to described user.

In embodiments of the present invention, for aforementioned branding data, its polymerization obtains less several brands classification, so the embodiment of the present invention can cluster terminate after each brand classification setting brand tag along sort, each branding data under this brand tag along sort in corresponding brand classification.

And in network, user may access various branding data, the present invention then can the user ID of each branding data of record access.After cluster terminates, below each brand classification of the embodiment of the present invention, be associated with each user ID by branding data.Therefore, in the embodiment of the present invention, the frequency of the generation of the access behavior of the lower each user of each brand classification can be added up.Such as brand classification 1, if its brand tag along sort is I, under have branding data A, branding data B, branding data C, branding data D.User Q accessed branding data A100 time.Accessed branding data B50 time, so user Q accesses the number of times of I is 150 times.So analogize, so can obtain the access frequency that each user classifies to each brand.Wherein, different brand classification, its brand tag along sort is different.So can counting user Q access each brand classification frequency, to its access frequency be greater than threshold value brand classification, the label that this brand is classified is given this user Q.Such as threshold value is 120 times, and the access times that user Q accesses brand classification 1 are greater than this threshold value, then for user Q stamps label I.

Certainly, in embodiments of the present invention, user obtains also carrying out statistics by access log to user described in each to the access behavioral data of the branding data in all kinds of.Its access behavior can comprise navigation patterns, click behavior, collection behavior, buying behavior etc.

Step 370, according to the brand tag along sort of described user, sends to described user place terminal by the 3rd object of described for correspondence mark; Described 3rd object comprises the ad data for described branding data.

After marking user ID, when subsequent user is accessed, server then can browse to this user by active push the 3rd object relevant to this mark.Wherein the 3rd object comprises the ad data relevant to sample.

Wherein, described 3rd object comprises the ad data for described branding data.When server needs to be thrown in by ad data to each user, in order to throw in targetedly, then search the branding data belonging to this ad data, then the brand tag along sort belonging to branding data, then according to brand tag along sort by ad data the user ID be pushed to this brand tag along sort show in the terminal.

Such as user Q has been stamped aforementioned brand tag along sort I, due to brand tag along sort I corresponding branding data A, branding data B, branding data C, and branding data D.And when receiving the 3rd object that will push, the such as ad data of branding data C, the so brand tag along sort I of the embodiment of the present invention then corresponding to branding data C, then according to the user Q that brand tag along sort I associates, find the terminal at user Q place.When user Q uses its terminal, the ad data of branding data C is pushed to user Q place terminal.

In said process, even if user Q did not access branding data C, due to through aforesaid class, the branding data A often accessed by user Q, branding data B and branding data C divide in order to a class, so according to corresponding relation, also the ad data of branding data C can be pushed to user Q place terminal.

In embodiments of the present invention, for each sample, obtain its access weight, significance level when this access weight represents that described sample is accessed.Then using this sample and access weight as sample carry out in cluster process, the sample participation that access weight is high is high, makes object of classification center deviation to the high side of access weight.Inventor finds that access weight is higher, and the contour sample of central point and the access frequency in object of classification of the object of classification of new synthesis in the embodiment of the present invention, click frequency, searching times is apart from nearer.Therefore cluster centre skew is less.Thus make final cluster result more accurate, make more precisely, the calculated amount of server to be reduced during follow-up advertisement data, reduce taking of system resource.

Embodiment four

With reference to Fig. 4, it illustrates the structural representation of a kind of clustering objects device of the present invention, specifically can comprise:

Initial object acquisition module 410, is suitable for the access weight obtaining sample to be clustered and each sample; Described access weight be described sample accessed time significance level; Described sample comprises branding data;

Divide module 420, be suitable for each sample to be divided into an object of classification, and using the centre coordinate of the coordinate of respective sample as described object of classification, and using the access weight of the access weight of respective sample as described object of classification;

Cluster module 430, is suitable for the access weight according to each object of classification and centre coordinate, and each object of classification is carried out cluster, obtains the brand classification respectively comprising at least one branding data.

Preferably, described access weight comprises:

Described sample is viewed in a network browses weight;

And/or, the click weight that described sample is clicked in a network;

And/or, the weight of website of website, described sample place;

And/or, search weight when described sample place is searched.

Preferably, described distance calculation module comprises:

Preferably, described initial object acquisition module comprises:

Preferably, described first aggregation module comprises:

Embodiment five

With reference to Fig. 5, it illustrates the structural representation of a kind of clustering objects device of the present invention, specifically can comprise:

Initial object acquisition module 510, is suitable for the access weight obtaining sample to be clustered and each sample; Described access weight be described sample accessed time significance level; Described sample comprises branding data;

Divide module 520, be suitable for each sample to be divided into an object of classification, and using the centre coordinate of the coordinate of respective sample as described object of classification, and using the access weight of the access weight of respective sample as described object of classification;

Cluster module 530, specifically comprises:

Distance calculation module 531, is suitable for for each object of classification, according to the centre coordinate of each object of classification, calculates the distance between every two object of classification;

Aggregation module 532, is suitable for nearest two object of classification to be polymerized to a new object of classification, and according to the centre coordinate of each object of classification and access weight, calculates centre coordinate and the access weight of described new object of classification;

Judge module 533, is suitable for judging whether to reach polymerization termination condition; If do not reach polymerization termination condition, by object of classification unpolymerized in new object of classification and current pass, in the lump layback computing module 531, until reach polymerization termination condition, obtain the brand classification respectively comprising at least one branding data.

Embodiment six

With reference to Fig. 6, it illustrates the structural representation of a kind of clustering objects device of the present invention, specifically can comprise:

Initial object acquisition module 610, is suitable for the access weight obtaining sample to be clustered and each sample; Described access weight be described sample accessed time significance level; Described sample comprises branding data;

Divide module 620, be suitable for each sample to be divided into an object of classification, and using the centre coordinate of the coordinate of respective sample as described object of classification, and using the access weight of the access weight of respective sample as described object of classification;

Cluster module 630, specifically comprises:

Distance calculation module 631, is suitable for for each object of classification, according to the centre coordinate of each object of classification, calculates the distance between every two object of classification;

Aggregation module 632, is suitable for nearest two object of classification to be polymerized to a new object of classification, and according to the centre coordinate of each object of classification and access weight, calculates centre coordinate and the access weight of described new object of classification;

Judge module 633, is suitable for judging whether to reach polymerization termination condition; If do not reach polymerization termination condition, by object of classification unpolymerized in new object of classification and current pass, in the lump layback computing module 631, until reach polymerization termination condition, obtain the brand classification respectively comprising at least one branding data.

Mark module 634, is suitable for for each user, according to user to the access behavioral data of user described in each to the branding data in all kinds of, stamps brand tag along sort to described user.

Object sending module 635, is suitable for the brand tag along sort according to described user, and the 3rd object of described for correspondence mark is sent to described user place terminal, and described 3rd object comprises the ad data for described branding data.

Intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with display at this algorithm provided.Various general-purpose system also can with use based on together with this teaching.According to description above, the structure constructed required by this type systematic is apparent.In addition, the present invention is not also for any certain programmed language.It should be understood that and various programming language can be utilized to realize content of the present invention described here, and the description done language-specific is above to disclose preferred forms of the present invention.

In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.

Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.

Those skilled in the art are appreciated that and adaptively can change the module in the equipment in embodiment and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, equivalent or similar object alternative features replaces.

In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can use with arbitrary array mode.

All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that and microprocessor or digital signal processor (DSP) can be used in practice to realize according to the embodiment of the present invention clustering objectsthe some or all functions of the some or all parts in equipment.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.

The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.

The invention discloses A1, a kind of clustering objects method, comprising:

A2, method as described in A1, the described access weight according to each object of classification and centre coordinate, comprise the step that each object of classification carries out cluster:

A3, method as described in A1 or A2, described access weight comprises:

Described sample is viewed in a network browses weight;

And/or, the click weight that described sample is clicked in a network;

And/or, the weight of website of website, described sample place;

And/or, search weight when described sample place is searched.

A4, method as described in A2, described for each object of classification, according to the centre coordinate of each object of classification, calculate the step of the distance between every two object of classification, comprising:

A5, method as described in A3, the step of each sample that described acquisition is initial, comprising:

A6, method as described in A1 or A2, according to the access weight of each object of classification and centre coordinate, after each object of classification being carried out the step of cluster, also comprise:

A7, method as described in A6, for each user, according to user to the access behavioral data of user described in each to the branding data in all kinds of, after stamping brand tag along sort step, also comprise described user:

A8, method as described in A2, the described centre coordinate according to each object of classification and access weight, the centre coordinate calculating described new object of classification comprises:

The invention also discloses B9, a kind of clustering objects device, comprising:

B10, device as described in B9, described cluster module comprises:

B11, device as described in B9 or 10, described access weight comprises:

Described sample is viewed in a network browses weight;

And/or, the click weight that described sample is clicked in a network;

And/or, the weight of website of website, described sample place;

And/or, search weight when described sample place is searched.

B12, device as described in B10, described distance calculation module comprises:

B13, device as described in B11, described initial object acquisition module comprises:

B14, device as described in A10 or B11, also comprise:

B15, device as described in B14, also comprise:

B16, device as described in B10, described first aggregation module comprises:

Claims

1. a clustering objects method, comprising:

2. the method for claim 1, is characterized in that, the described access weight according to each object of classification and centre coordinate, the step that each object of classification carries out cluster is comprised:

3. method as claimed in claim 1 or 2, it is characterized in that, described access weight comprises:

Described sample is viewed in a network browses weight;

And/or, the click weight that described sample is clicked in a network;

And/or, the weight of website of website, described sample place;

And/or, search weight when described sample place is searched.

4. method as claimed in claim 2, is characterized in that, described for each object of classification, according to the centre coordinate of each object of classification, calculates the step of the distance between every two object of classification, comprising:

5. method as claimed in claim 3, it is characterized in that, the step of each sample that described acquisition is initial, comprising:

6. method as claimed in claim 1 or 2, is characterized in that, according to the access weight of each object of classification and centre coordinate, after each object of classification being carried out the step of cluster, also comprises:

7. method as claimed in claim 6, is characterized in that, for each user, according to user to the access behavioral data of user described in each to the branding data in all kinds of, after stamping brand tag along sort step, also comprises described user:

8. method as claimed in claim 2, it is characterized in that, the described centre coordinate according to each object of classification and access weight, the centre coordinate calculating described new object of classification comprises:

9. a clustering objects device, comprising:

10. device as claimed in claim 9, it is characterized in that, described cluster module comprises: