CN105335368A

CN105335368A - Product clustering method and apparatus

Info

Publication number: CN105335368A
Application number: CN201410250664.3A
Authority: CN
Inventors: 陈海凯
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2014-06-06
Filing date: 2014-06-06
Publication date: 2016-02-17
Anticipated expiration: 2034-06-06
Also published as: CN105335368B

Abstract

The present application relates to electronic commerce technology, and particularly relates to a product clustering method and apparatus. The method comprises: based on similarity between products, sorting out products with similarity meeting a preset condition; determining cluster center products among the sorted out products based on a preset principle; and based on each cluster center product, classifying each non-cluster center product to a same cluster as a cluster center product with highest similarity to the non-cluster center product. The above method is not limited by the number of clusters, the product similarity needs to be calculated only once to construct a similarity network and achieve product clustering progressively based on a heuristic algorithm. In this way, not only the accuracy of the clustering result can be significantly improved, but also time complexity and spatial complexity of the product clustering can be greatly reduced, thus preventing a heavy operation load from being caused to a system and controlling the implementation cost within an ideal range. The method and apparatus are particularly applicable to large-scale product clustering.

Description

A kind of product clustering method and device

Technical field

The application relates to e-commerce technology, particularly a kind of product clustering method and device.

Background technology

Along with the development of e-commerce technology, the product number day shown in e-commerce website is also huge, and Similarity Measure complexity between product and product is very high.Under normal circumstances, e-commerce website has several hundred million users, and the behavior of user is also enriched very much.But because the data volume of product is huge, the operation behavior (e.g., click, purchase, collection etc.) of user to product then seems very sparse.Because user is to data openness of product, in compute user preferences and when calculating user's similarity etc. parameter, often coverage rate is not high, and affects accuracy.

For the problems referred to above, under prior art, usually enough similar product can be aggregated into one bunch, again bunch in all user behaviors of relating to of product be aggregating, to increase the density of data, then on the basis of dense data, carry out the excavation of user preference and user's similarity, thus obtain higher coverage rate and accuracy rate; Further, also other Related products can be recommended based on the Result obtained to user.

Under prior art, the algorithm of cluster dividing is more, the algorithm of relatively conventional is spectral clustering.When adopting spectral clustering, first the number K of a setting cluster is needed, thus the similarity dimensionality reduction between product with product is become K (being namely be provided with target class number for magnanimity product), and then utilize k-means that the product after dimensionality reduction is carried out cluster.

Adopt the shortcoming of spectral clustering as follows:

First, the number K of cluster must be set.But after a given data acquisition, it is suitable that user is difficult to judge that the value of K needs to be set to much.

Secondly, need to carry out dimension-reduction treatment to product, generally fall into K dimension.But there is the product of magnanimity in e-commerce website, be millions of bunches by dimension-reduction treatment cluster, its time complexity calculated and space complexity can bring serious operating load to system.

Summary of the invention

The embodiment of the present application provides a kind of product clustering method and device, in order to solve the accurate cluster in order to realize magnanimity product existed in prior art, thus increases the problem of system cloud gray model load.

The concrete technical scheme that the embodiment of the present application provides is as follows:

A kind of product clustering method is provided, comprises:

The similarity between each product is calculated according to the operation behavior of user;

Filter out similarity based on the similarity between each product and meet pre-conditioned product;

In each product filtered out, a bunch hub products is determined further based on default principle, wherein, described default principle comprises: reach predetermined threshold value with the number that there is the product linked between bunch hub products, and, there is not link between different bunch hub products;

For a non-bunch of hub products, from each bunch of hub products, to determine that with this non-bunch of hub products has bunch hub products of highest similarity, and this non-bunch of hub products and the described bunch hub products having highest similarity with this non-bunch of hub products are classified as cluster.

Preferably, filter out similarity based on the similarity between each product and meet pre-conditioned product, comprising:

According to the similarity between each product that calculating obtains, preserve K the highest product of similarity for each product respectively, and establish the link between each product with corresponding each similar product that it is preserved;

Delete the link between unidirectional similar product, only retain the link between two-way similar product.

Preferably, after retaining the link between two-way similar product, comprise further:

Calculate respectively and judge whether the coincidence degree of the like product of two products at each link two ends reaches default registration threshold value, if so, then retaining corresponding link, otherwise, delete and link accordingly.

Preferably, in each product filtered out, determine a bunch hub products based on default principle further, comprising:

According to the link between each product, determine the number of degrees of each product respectively, wherein, the number of degrees of a product are the number that there are other products linked between a described product;

Filter out the number of degrees and be greater than all products of the first default number of degrees threshold value as to be selected bunch of hub products;

Each to be selected bunch of hub products is sorted according to number of degrees order from big to small;

According to each to be selected bunch of hub products of clooating sequence traversal, whenever determining that there are other products linked between a certain to be selected bunch of hub products is non-to be selected bunch of hub products, described a certain to be selected bunch of hub products is defined as a bunch hub products.

Preferably, for a non-bunch of hub products, from each bunch of hub products, to determine that with this non-bunch of hub products has bunch hub products of highest similarity, and this non-bunch of hub products and the described bunch hub products having highest similarity with this non-bunch of hub products are classified as cluster, comprising:

Determine the number of degrees of one non-bunch of hub products, and judge whether the number of degrees of described non-bunch of hub products are greater than the second default number of degrees threshold value, wherein, described second number of degrees threshold value is less than described first number of degrees threshold value;

If the number of degrees of described non-bunch of hub products are greater than the second default number of degrees threshold value, then obtain between all and described non-bunch of hub products and there is bunch hub products linked, and the maximum bunch hub products of number of links will be there is between described non-bunch of hub products and described non-bunch of hub products is classified as cluster;

If the number of degrees of described non-bunch of hub products are not more than the second default number of degrees threshold value, then determine to there is the every other product linked between described non-bunch of hub products, and determine the described each self-corresponding bunch of hub products of other products respectively, and the existence maximum bunch hub products of link and described non-bunch of hub products between other products described are classified as cluster.

A kind of product clustering apparatus is provided, comprises:

Computing unit, calculates the similarity between each product according to the operation behavior of user;

First processing unit, meets pre-conditioned product for filtering out similarity based on the similarity between each product;

Second processing unit, for determining a bunch hub products based on default principle further in each product filtered out, wherein, described default principle comprises: reach predetermined threshold value with the number that there is the product linked between bunch hub products, and, there is not link between different bunch hub products;

Cluster cell, for for a non-bunch of hub products, from each bunch of hub products, to determine that with this non-bunch of hub products has bunch hub products of highest similarity, and this non-bunch of hub products and the described bunch hub products having highest similarity with this non-bunch of hub products are classified as cluster.

Preferably, when filtering out similarity based on the similarity between each product and meeting pre-conditioned product, described first processing unit specifically for:

Preferably, after retaining the link between two-way similar product, described first processing unit is further used for:

Preferably, when determining bunch hub products in each product filtered out further based on default principle, described second processing unit specifically for:

Preferably, for a non-bunch of hub products, from each bunch of hub products, to determine that with this non-bunch of hub products has bunch hub products of highest similarity, and when this non-bunch of hub products and the described bunch hub products having highest similarity with this non-bunch of hub products are classified as cluster, described cluster cell specifically for:

In the embodiment of the present invention, e-commerce system calculates the similarity between each product according to the operation behavior of user, and filter out similarity based on the similarity between each product and meet pre-conditioned product, in each product filtered out, a bunch hub products is determined further again based on default principle, wherein, so-called principle of presetting comprises: reach predetermined threshold value with the number that there is the product linked between bunch hub products, and, bunch between hub products and bunch hub products, there is not link, finally, respectively based on each bunch of hub products, each non-bunch of hub products and the highest bunch hub products of similarity are classified as cluster.Adopt said method, not by the restriction of cluster number, only need to calculate primary production similarity, just can build similar network and progressively realize the cluster to product based on heuritic approach, like this, not only can increase substantially the accuracy of cluster result, greatly can also reduce the time complexity and space complexity that realize product cluster, thus avoid bringing serious operating load to system, and then cost control will be realized in ideal range, be particularly useful for large-scale product cluster scene.

Accompanying drawing explanation

Fig. 1 is product cluster process flow diagram in the embodiment of the present application;

Fig. 2 is product clustering apparatus first structural representation in the embodiment of the present application.

Embodiment

In order to solve the accurate cluster in order to realize magnanimity product existed in prior art, thus increase the problem of system cloud gray model load.In the embodiment of the present application, according to the top similar network that the similarity between product builds, then based on this top similar network, utilize didactic algorithm by the sub-clustering of product cluster.

Certainly, the enforcement of technical scheme relies on the analysis of a large number of users behavioral data, thus needs the parallel computing platform of similar hadoop and so on.On the other hand, technical scheme is not only applicable to product point race, also goes for user's sub-clustering, and other a sub-clustering scene of shop sub-clustering etc., does not repeat them here.

Below only for product, by reference to the accompanying drawings the application is preferred embodiment described in detail.

Consult shown in Fig. 1, in the embodiment of the present application, the realization flow of product cluster is as follows:

Step 100: calculate the similarity between each product according to the operation behavior of user.

Product described in the application can be understood as data object, such as, and an information of goods information data.

Described data object can be information of goods information data, also can be multimedia information data (such as audio-video frequency content).User to the operation behavior of product also namely user to the operation of a certain data object, comprise request of access (such as browsing described data object), storage resource request (such as collecting described data object), Forward-reques (such as described data object being recommended other users) etc.

In the embodiment of the present application, judge each product was performed assigned operation by which user respectively, and using the result of determination for each product as the feature of corresponding product, if different product was performed assigned operation (such as navigation patterns or send the behavior of transaction request) by same user, thought to there is similarity between these products in theory.

Namely obtain the feature (namely each product was performed assigned operation by which user) of each product after extracting the operation behavior of user, namely can calculate the similarity between each product by the method for collaborative filtering.Concrete, any one in cosine algorithm, jaccard algorithm and Pearson correlation coefficients algorithm etc.For cosine algorithm, formula one can be adopted to calculate similarity between two products:

sim (di, dj) = \frac{\underset{u}{Σ} w_{ui} w_{uj}}{\underset{u}{Σ} {w_{ui}}^{2} \underset{u}{Σ} {w_{uj}}^{2}}

Formula one

Wherein, di and dj represents two different products respectively, and sim (di, dj) represents the similarity between two different products; Wui and wuj represents whether user u performed assigned operation to two products respectively), if two products were performed assigned operation by more identical user, similarity was higher.Wui and wuj can use digital " 0 " or " 1 " to quantize.Such as, user u performed assigned operation to product i, and wui is denoted as " 1 "; Do not perform assigned operation to i, wui is denoted as " 0 ".

Step 110: filter out similarity based on the similarity between each product and meet pre-conditioned product.

In practical application, the similarity only calculated between product and product is far from being enough, because the similarity calculated so often has some noises (namely can there is similarity between uncorrelated two products of misinterpretation), thus, after calculating the similarity between each product, need on the basis of this similarity, construct similar network highly similar between a product, the product of each bunch of the inside that the last cluster of such guarantee obtains is similar.

Thus, perform step 110 time, first need according to calculate obtain each product between similarity, respectively for each product, preserve K the product the highest with this product similarity, and establish the link between each product with corresponding each similar product that it is preserved.In the embodiment of the present application, optionally, top40 the most similar individual product has been got in choosing for each product, i.e. K=40.

After this operation of execution, the relation between a part of product is turned into unidirectional relationship, e.g., has product B, but not necessarily have product A inside the topK of product B like product in topK like product of a product A.Now, each product just has self the number of degrees, and namely the so-called number of degrees refer in the undirected similar network built between product and product, have the quantity of the like product of linking relationship with this product.Suppose to link between a product and N number of product, so the number of degrees of this product are exactly N.

Then, build undirected top similar network, namely screen again on the basis of last action, delete the link between unidirectional similar product, only retain linking of two-way similar product and product.Such as, if product B is in topK like product of product A, and product A is also in topK like product of product B, the link just between retained product A and product B.

After two operations above, if think, noise is reduced to below threshold value, then can continue to perform step 120; Some noises may be also had in each product filtered out if think, then can do further screening.Preferably, can calculate respectively and judge whether the coincidence degree of the like product of two products at each link two ends reaches default registration threshold value, if so, then retaining corresponding link, otherwise, delete and link accordingly, to remove noise.

Such as, there is a side chain then product A and product B, so the like product of product A has a set, be called similar_auction (A), product B also has a similar set, claim cigarette similar_auction (B), calculate the union of two set similar_auction (A) and similar_auction (B), and screen the link between this of product A and product B according to the size of union, as, if union is less than 5, then delete the link between product A and product B, if union is more than or equal to 5, link then between retained product A and product B.

Step 120: determine a bunch hub products further based on default principle in each product filtered out; Wherein, so-called principle of presetting is: reach predetermined threshold value with the number that there is the product linked between bunch hub products, and, there is not link between different bunch hub products.

In the embodiment of the present application, after constructing a undirected top similar network, adopt didactic method to find as a bunch product for central point, hereinafter referred to as a bunch hub products.So-called didactic method needs to follow two cardinal rules, one, the number of degrees of bunch hub products are the bigger the better, namely with bunch hub products between there is the product linked number be the bigger the better, and need to reach predetermined threshold value, its two, can not link be there is between bunch hub products and bunch hub products, namely there is not similarity between each bunch of hub products.

Concrete, when performing step 130, can perform but be not limited to following operation:

A: first, based on the top similar network built, determines the number of degrees of each product, and wherein, the number of degrees of a product are the number that there are other products linked between this product.

Now, cluster hub products set center_auction can be set up, and be initialized as empty set.

B: secondly, filter out the number of degrees and be greater than all products of the first default number of degrees threshold value as to be selected bunch of hub products.

Preferably, the value of the first number of degrees threshold value can be 10, namely with other products between there are 10 products linked can as bunch hub products to be selected.

C: each to be selected bunch of hub products is sorted according to number of degrees order from big to small.

D: according to each to be selected bunch of hub products of clooating sequence traversal, whenever determining that there are other products linked between a certain to be selected bunch of hub products is non-to be selected bunch of hub products, above-mentioned a certain to be selected bunch of hub products is defined as a bunch hub products.

Such as, when checking all like products (namely with between product i there are all products linked) of to be selected bunch of hub products i (hereinafter referred to as product i), judge that all like products of product i are not whether all in bunch hub products set center_auction, if, then product i is added center_auction this bunch of hub products set, and continue the next to be selected bunch of hub products of traversal; Otherwise, directly continue the next to be selected bunch of hub products of traversal.

Step 130: for a non-bunch of hub products, from each bunch of hub products, to determine that with this non-bunch of hub products has bunch hub products of highest similarity, and this non-bunch of hub products and the above-mentioned bunch hub products having highest similarity with this non-bunch of hub products are classified as cluster.

For any one non-bunch of hub products (hereinafter referred to as product x), when performing step 140, need perform but be not limited to following operation:

First, determine the number of degrees of product x and judge whether the number of degrees of product x are greater than the second number of degrees threshold value of setting, wherein, the second number of degrees threshold value is less than the first number of degrees threshold value.

Secondly, if the number of degrees of product x are greater than the second number of degrees threshold value of setting, then necessarily exist between explanation product x with bunch hub products and link, then obtain all with product x between there is bunch hub products linked, and the maximum bunch hub products of number of links will be there is between product x and product x is classified as cluster.

And if the number of degrees of product x are not more than the second number of degrees threshold value of setting, then not necessarily exist between explanation product x with bunch hub products and link, so, now need to determine to there is the every other product linked between product x, and determine each self-corresponding bunch of hub products of other products above-mentioned respectively, then, the existence maximum bunch hub products of link and product x between other products above-mentioned are classified as cluster.

Such as, nearest neighbor algorithm (k-NearestNeighbor can be adopted, knn) algorithm, calculate and to there are other products linked belong to which bunch (namely respectively which bunch hub products corresponding) respectively between product x, judge in other products above-mentioned, which product belongs to same bunch again, if there is maximum products to belong to same bunch in the said goods, so just product x can be belonged to in this bunch.

As, be product a, product b, product c, product d with there are other products linked between product x, wherein, product a and product b all belongs to same cluster with a bunch hub products M, product c and a bunch hub products N belong to same cluster, product d and a bunch hub products L belong to same gang, so, product x and a bunch hub products M can be attributed to same cluster.

Through above-mentioned flow process, e-commerce system can carry out accurate sub-clustering to product, like this, when user browses a certain product, e-commerce system can according to this product belong to bunch, the high recommended products of similarity is presented to user, accurately embody user preference and user's similarity, improve coverage rate and the accuracy rate of recommendation information, also shorten user simultaneously and search time required for target product, while improve the search efficiency of user, effectively improve the service quality of electronic commerce network.

Based on above-described embodiment, consult shown in Fig. 2, in the embodiment of the present application, the device realizing product cluster comprises:

Consult shown in Fig. 2, in the embodiment of the present application, the device realizing product cluster comprises computing unit 20, first processing unit 21, second processing unit 22 and cluster cell 23, wherein,

Computing unit 20, calculates the similarity between each product according to the operation behavior of user;

First processing unit 21, meets pre-conditioned product for filtering out similarity based on the similarity between each product;

Second processing unit 22, for determining a bunch hub products based on default principle further in each product filtered out, wherein, default principle comprises: reach predetermined threshold value with the number that there is the product linked between bunch hub products, and, there is not link between bunch hub products and bunch hub products;

Cluster cell 23, for for a non-bunch of hub products, from each bunch of hub products, to determine that with this non-bunch of hub products has bunch hub products of highest similarity, and this non-bunch of hub products and the above-mentioned bunch hub products having highest similarity with this non-bunch of hub products are classified as cluster.

Preferably, when filtering out similarity based on the similarity between each product and meeting pre-conditioned product, the first processing unit 21 specifically for:

Preferably, after retaining the link between two-way similar product, the first processing unit 21 is further used for:

Preferably, when determining bunch hub products in each product filtered out further based on default principle, the second processing unit 22 specifically for:

According to the link between each product, determine the number of degrees of each product respectively, wherein, the number of degrees of a product are the number that there are other products linked between a product;

According to each to be selected bunch of hub products of clooating sequence traversal, whenever determining that there are other products linked between a certain to be selected bunch of hub products is non-to be selected bunch of hub products, a certain to be selected bunch of hub products is defined as a bunch hub products.

Preferably, for a non-bunch of hub products, from each bunch of hub products, to determine that with this non-bunch of hub products has bunch hub products of highest similarity, and when this non-bunch of hub products and the above-mentioned bunch hub products having highest similarity with this non-bunch of hub products are classified as cluster, cluster cell 23 specifically for:

Determine the number of degrees of one non-bunch of hub products, and judge whether the number of degrees of this non-bunch of hub products are greater than the second default number of degrees threshold value, wherein, the second number of degrees threshold value is less than the first number of degrees threshold value;

If the number of degrees of this non-bunch of hub products are greater than the second default number of degrees threshold value, then obtain all with this non-bunch of hub products between there is bunch hub products linked, and the maximum bunch hub products of number of links will be there is between this non-bunch of hub products and this non-bunch of hub products is classified as cluster;

If the number of degrees of this non-bunch of hub products are not more than the second default number of degrees threshold value, then determine between non-bunch of hub products, to there is the every other product linked with this, and determine each self-corresponding bunch of hub products of other products respectively, and the existence maximum bunch hub products of link and this non-bunch of hub products between other products are classified as cluster.

In sum, in the embodiment of the present application, e-commerce system calculates the similarity between each product according to the operation behavior of user, and filter out similarity based on the similarity between each product and meet pre-conditioned product, in each product filtered out, a bunch hub products is determined further again based on default principle, wherein, so-called principle of presetting comprises: reach predetermined threshold value with the number that there is the product linked between bunch hub products, and, bunch between hub products and bunch hub products, there is not link, finally, respectively based on each bunch of hub products, each non-bunch of hub products and the highest bunch hub products of similarity are classified as cluster.

Adopt said method, not by the restriction of cluster number, only need to calculate primary production similarity, just can build similar network and progressively realize the cluster to product based on heuritic approach, like this, not only can increase substantially the accuracy of cluster result, greatly can also reduce the time complexity and space complexity that realize product cluster, thus avoid bringing serious operating load to system, and then cost control will be realized in ideal range, be particularly useful for large-scale product cluster scene.

Those skilled in the art should understand, the embodiment of the application can be provided as method, system or computer program.Therefore, the application can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect.And the application can adopt in one or more form wherein including the upper computer program implemented of computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) of computer usable program code.

The application describes with reference to according to the process flow diagram of the method for the embodiment of the present application, equipment (system) and computer program and/or block scheme.Should understand can by the combination of the flow process in each flow process in computer program instructions realization flow figure and/or block scheme and/or square frame and process flow diagram and/or block scheme and/or square frame.These computer program instructions can being provided to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, making the instruction performed by the processor of computing machine or other programmable data processing device produce device for realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.

These computer program instructions also can be stored in can in the computer-readable memory that works in a specific way of vectoring computer or other programmable data processing device, the instruction making to be stored in this computer-readable memory produces the manufacture comprising command device, and this command device realizes the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.

These computer program instructions also can be loaded in computing machine or other programmable data processing device, make on computing machine or other programmable devices, to perform sequence of operations step to produce computer implemented process, thus the instruction performed on computing machine or other programmable devices is provided for the step realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.

Although described the preferred embodiment of the application, those skilled in the art once obtain the basic creative concept of cicada, then can make other change and amendment to these embodiments.So claims are intended to be interpreted as comprising preferred embodiment and falling into all changes and the amendment of the application's scope.

Obviously, those skilled in the art can carry out various change and modification to the embodiment of the present application and not depart from the spirit and scope of the embodiment of the present application.Like this, if these amendments of the embodiment of the present application and modification belong within the scope of the application's claim and equivalent technologies thereof, then the application is also intended to comprise these change and modification.

Claims

1. a product clustering method, is characterized in that, comprising:

2. the method for claim 1, is characterized in that, filters out similarity and meets pre-conditioned product, comprising based on the similarity between each product:

3. method as claimed in claim 2, is characterized in that, after retaining the link between two-way similar product, comprises further:

4. method as claimed in claim 2, is characterized in that, determine a bunch hub products further, comprising based on default principle in each product filtered out:

5. method as claimed in claim 2, it is characterized in that, for a non-bunch of hub products, from each bunch of hub products, to determine that with this non-bunch of hub products has bunch hub products of highest similarity, and this non-bunch of hub products and the described bunch hub products having highest similarity with this non-bunch of hub products are classified as cluster, comprising:

6. a product clustering apparatus, is characterized in that, comprising:

7. device as claimed in claim 6, is characterized in that, when filtering out similarity based on the similarity between each product and meeting pre-conditioned product, described first processing unit specifically for:

8. device as claimed in claim 7, is characterized in that, after retaining the link between two-way similar product, described first processing unit is further used for:

9. device as claimed in claim 7, is characterized in that, when determining bunch hub products in each product filtered out further based on default principle, described second processing unit specifically for:

10. device as claimed in claim 7, it is characterized in that, for a non-bunch of hub products, from each bunch of hub products, to determine that with this non-bunch of hub products has bunch hub products of highest similarity, and when this non-bunch of hub products and the described bunch hub products having highest similarity with this non-bunch of hub products are classified as cluster, described cluster cell specifically for: