CN105447117A

CN105447117A - User clustering method and apparatus

Info

Publication number: CN105447117A
Application number: CN201510783263.9A
Authority: CN
Inventors: 牛凯; 杜帅
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2015-11-16
Filing date: 2015-11-16
Publication date: 2016-03-30
Anticipated expiration: 2035-11-16
Also published as: CN105447117B

Abstract

Embodiments of the present invention disclose a user clustering method and apparatus. The method comprises: receiving a clustering request, wherein the clustering request carries a category of to-be-collected user data, and collecting user data according to the clustering request; processing the user data, to obtain main properties and affiliated properties of each user data, and obtaining multi-dimensional data corresponding to each main property according to all the affiliated properties; obtaining relevance between each main property and all the affiliated properties according to multi-dimensional data corresponding to each main property; and performing fuzzy clustering according to the relevance between each main property and all affiliated properties, to obtain a clustering result. According to the method, analysis may be performed on user data from multiple dimensions, so that a clustering result meeting an actual situation can be obtained.

Description

A kind of method and apparatus of user clustering

Technical field

The present invention relates to the computer application field of data mining, particularly a kind of method and apparatus of user clustering.

Background technology

At present, the new data that human society produced in every day all increases rapidly with explosive manner, these mass datas of real-time analysis process, and excavates the problem that its internal relations person that is analysis decision pays special attention to.Such as, the development of China's information science is very rapid, the patent of scientific research project, the paper delivered and application is all difficult to counting, analyze these scientific research projects, relational network between paper and the knowledge data of patent, and the study hotspot of prediction this technical field following or focus, can help the Scientific research management department more effectively management of project implementation and examination & approval, the researchist for this field opens up new study hotspot direction.Social media field, the number that Adds User is growing with exchanging between user, analyzes friend relation, the community structure between user, recommends to throw in, analyze user behavior to orientation, and the management of zone user differentiated also has great significance.Commodity transaction field, no matter which kind of sales mode, every day all can produce a large amount of commodity transactions, and customer volume, trade company's amount and commodity amount all can reach necessarily even more than one hundred million rank, and classification kind is wherein all a lot, and the classification of user also has a lot.In the face of the data of the One's name is legion relevant to all types of user, kind complexity, only by the data of single kind, user is analyzed, carry out cluster and obviously do not meet actual conditions.

Existing method is setting data excacation stream, and this workflow comprises multiple parallel data processing task, data processing task, obtains corresponding result by mapping/concluding machine-processed executed in parallel.In prior art, be just limited to one dimension angle to the excavation cluster of data, the analysis of various dimensions multi-angle cannot be carried out data, with make to the analysis of data and understanding comprehensive not, affect Clustering Effect.

Summary of the invention

The object of the embodiment of the present invention is the method and apparatus providing a kind of user clustering, and the cluster result of user is tallied with the actual situation.

First aspect, the embodiment of the invention discloses a kind of method of user clustering, is applied to cluster server, comprises step:

Receive cluster request, gather user data according to described cluster request, the classification of the user data that will gather is carried in described cluster request;

According to the user data collected described in preset rules process, obtain the primary attribute of each user data and attached attribute, according to primary attribute and the attached attribute of each user data obtained, determine all attached attributes, according to all attached attributes, obtain the multi-dimensional data that each primary attribute is corresponding; Wherein, described primary attribute comprises user ID, and attached attribute comprises the relevant information of this user obtained from each user data; Described multi-dimensional data identifies the relation of having of this primary attribute and all attached attributes or nothing;

The multi-dimensional data corresponding according to each primary attribute, obtains the degree of correlation of each primary attribute and all attached attributes;

According to the degree of correlation of each primary attribute and all attached attributes, carry out fuzzy clustering, obtain cluster result,

Comprise: according to default classifying rules, all primary attributes are classified, obtain first distribution situation of each primary attribute in each classification, according to the degree of correlation and described first distribution situation of described each primary attribute and all attached attributes, determine second distribution situation of each attached attribute of described multi-dimensional data in each classification, when wherein classifying, ensure to there is at least one primary attribute in each classification;

According to the second described distribution situation, use default fuzzy clustering algorithm, carry out interative computation, obtain the cluster result of user.

Preferably, the described user data according to collecting described in preset rules process, obtains the primary attribute of each user data and attached attribute, comprising:

Word segmentation processing, filtering useless word and unallowable instruction digit process are carried out to the described user data collected;

Obtain a unique primary attribute and at least one attached attribute of each user data.

Preferably, described determine second distribution situation of each attached attribute of described multi-dimensional data in each classification after, also comprise:

According to second distribution situation of each attached attribute in each classification of described multi-dimensional data, determine that each attached attribute of described multi-dimensional data accounts for the weight of each classification described;

The second distribution situation described in described basis, uses default fuzzy clustering algorithm, carries out interative computation, obtains the cluster result of user, for:

Account for the weight of each classification described according to each attached attribute of described multi-dimensional data, use default fuzzy clustering algorithm, carry out interative computation, obtain the cluster result of user.

Preferably, the described each attached attribute according to described multi-dimensional data accounts for the weight of each classification described, uses default fuzzy clustering algorithm, carries out interative computation, obtains the cluster result of user, comprising:

S1: the weight accounting for each classification described according to each attached attribute of described multi-dimensional data, determines that each primary attribute of described multi-dimensional data is to the membership vector of each classification; Wherein, each primary attribute of described multi-dimensional data is determined by all attached attributes the membership vector of each classification;

S2: according to each primary attribute of described multi-dimensional data to the membership vector of each classification, determine the center vector of the cluster centre that each classification is current, the center vector of the cluster centre that each classification described is current be all primary attributes of existing in each classification to the mean value of such other degree of membership, described membership vector comprises each primary attribute of described multi-dimensional data to the degree of membership of each classification;

S3: the mould of the difference of the center vector of cluster centre that each classification relatively more described is current and the center vector of the previous cluster centre of each classification and the size setting threshold value;

S4: if comparative result is for being less than or equal to described setting threshold value, then judge cluster result convergence, terminate cluster process;

S5: if comparative result is for being greater than described setting threshold value, then judge that cluster result is not restrained, continue cluster process, by each primary attribute of described multi-dimensional data to the membership vector of each classification, the first distribution situation that described in each classification being defined as new round cluster process, each primary attribute of multi-dimensional data is new in each classification, according to the degree of correlation and described first distribution situation newly of described each primary attribute and all attached attributes, determine second distribution situation of each attached attribute of described multi-dimensional data in each classification, according to second distribution situation of each attached attribute in each classification of described multi-dimensional data, determine that each attached attribute of described multi-dimensional data accounts for the weight of each classification described, return step S1.

Preferably, described judgement cluster result convergence, also comprises after terminating cluster process:

By primary attribute described in current cluster process to the membership vector of each classification, be defined as the ownership probability of described primary attribute for each classification, according to the ownership probability of described primary attribute for each classification, sort in each classification.

Second aspect, the embodiment of the present invention additionally provides a kind of device of user clustering, is applied to cluster server, and described device comprises:

Cluster request receiving module: for receiving cluster request, gather user data according to described cluster request, the classification of the user data that will gather is carried in described cluster request;

Multi-dimensional data acquisition module: for according to the user data collected described in preset rules process, obtain the primary attribute of each user data and attached attribute, according to primary attribute and the attached attribute of each user data obtained, determine all attached attributes, according to all attached attributes, obtain the multi-dimensional data that each primary attribute is corresponding; Wherein, described primary attribute comprises user ID, and attached attribute comprises the relevant information of this user obtained from each user data; Described multi-dimensional data identifies the relation of having of this primary attribute and all attached attributes or nothing;

Degree of correlation acquisition module: for the multi-dimensional data corresponding according to each primary attribute, obtains the degree of correlation of each primary attribute and all attached attributes;

Fuzzy clustering module: for the degree of correlation according to each primary attribute and all attached attributes, carry out fuzzy clustering, obtains cluster result,

Described fuzzy clustering module comprises distribution situation determination submodule and cluster result obtains submodule,

Described distribution situation determination submodule specifically for: according to default classifying rules, all primary attributes are classified, obtain first distribution situation of each primary attribute in each classification, according to the degree of correlation and described first distribution situation of described each primary attribute and all attached attributes, determine second distribution situation of each attached attribute of described multi-dimensional data in each classification, when wherein classifying, ensure to there is at least one primary attribute in each classification;

Described cluster result obtain submodule specifically for: according to the second described distribution situation, use default fuzzy clustering algorithm, carry out interative computation, obtain the cluster result of user.

Preferably, described multi-dimensional data acquisition module is according to the user data collected described in preset rules process, when obtaining the primary attribute of each user data and attached attribute, word segmentation processing, filtering useless word and unallowable instruction digit process are carried out to the described user data collected; Unique primary attribute of each user data obtained and at least one attached attribute.

Preferably, described distribution situation determination submodule, after determining the distribution situation of each attached attribute of described multi-dimensional data in each classification, also comprises:

Preferably, described cluster result obtains submodule and comprises: membership vector determination submodule, center vector determination submodule, comparison sub-module, the first decision sub-module and the second decision sub-module,

Described membership vector determination submodule: for accounting for the weight of each classification described according to each attached attribute of described multi-dimensional data, determine that each primary attribute of described multi-dimensional data is to the membership vector of each classification; Wherein, each primary attribute of described multi-dimensional data is determined by all attached attributes the membership vector of each classification;

Described center vector determination submodule: for according to each primary attribute of described multi-dimensional data to the membership vector of each classification, determine the center vector of the cluster centre that each classification is current, the center vector of the cluster centre that each classification described is current be all primary attributes of existing in each classification to the mean value of such other degree of membership, described membership vector comprises each primary attribute of described multi-dimensional data to the degree of membership of each classification;

Described comparison sub-module: the mould of the center vector of cluster centre current for each classification relatively more described and the difference of the center vector of the previous cluster centre of each classification and the size setting threshold value, if comparative result is for being less than or equal to described setting threshold value, then trigger described first decision sub-module, if comparative result is for being greater than described setting threshold value, then trigger described second decision sub-module

Described first decision sub-module: for judging that cluster result is restrained, terminates cluster process;

Described second decision sub-module: for judging that cluster result is not restrained, continue cluster process, by each primary attribute of described multi-dimensional data to the membership vector of each classification, the first distribution situation that described in each classification being defined as new round cluster process, each primary attribute of multi-dimensional data is new in each classification, according to the degree of correlation and described first distribution situation newly of described each primary attribute and all attached attributes, determine second distribution situation of each attached attribute of described multi-dimensional data in each classification, according to second distribution situation of each attached attribute in each classification of described multi-dimensional data, determine that each attached attribute of described multi-dimensional data accounts for the weight of each classification described, trigger described membership vector determination submodule.

Preferably, also order module is comprised:

Described order module specifically for: by primary attribute described in current cluster process to the membership vector of each classification, be defined as the ownership probability of described primary attribute for each classification, according to the ownership probability of described primary attribute for each classification, sort in each classification.

As seen from the above technical solutions, the embodiment of the invention discloses a kind of method and apparatus of user clustering, receive cluster request, gather user data according to this cluster request, the classification of the user data that will gather is carried in this cluster request; According to this user data collected of preset rules process, obtain the primary attribute of each user data and attached attribute, according to primary attribute and the attached attribute of each user data obtained, determine all attached attributes, according to all attached attributes, obtain the multi-dimensional data that each primary attribute is corresponding; Wherein, this primary attribute comprises user ID, and attached attribute comprises the relevant information of this user obtained from each user data; This multi-dimensional data identifies the relation of having of this primary attribute and all attached attributes or nothing; The multi-dimensional data corresponding according to each primary attribute, obtains the degree of correlation of each primary attribute and all attached attributes; According to the degree of correlation of each primary attribute and all attached attributes, carry out fuzzy clustering, obtain cluster result, comprise: according to default classifying rules, all primary attributes are classified, obtain first distribution situation of each primary attribute in each classification, according to the degree of correlation and this first distribution situation of this each primary attribute and all attached attributes, determine second distribution situation of each attached attribute of this multi-dimensional data in each classification, when wherein classifying, ensure to there is at least one primary attribute in each classification; According to this second distribution situation, use default fuzzy clustering algorithm, carry out interative computation, obtain the cluster result of user.

Visible, the primary attribute of each user data and attached attribute is collected in the embodiment of the present invention, then can various dimensions multi-angle to user carry out analysis describe, according to follow-up default fuzzy clustering algorithm, various dimensions multi-angle cluster is carried out to user, carry out iteration, avoid, from single dimension, cluster is carried out to user, obtain the cluster result tallied with the actual situation.Certainly, arbitrary product of the present invention is implemented or method must not necessarily need to reach above-described all advantages simultaneously.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is the schematic flow sheet of a kind of method embodiments providing user clustering;

Fig. 2 is the structural representation of the device embodiments providing a kind of user clustering.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

Embodiments provide a kind of method and apparatus of user clustering, to carry out the analysis of various dimensions multi-angle to user data, more comprehensive to the analysis of user data, obtain the cluster result tallied with the actual situation.

Below by specific embodiment, the present invention is described in detail.

The method of a kind of user clustering that the embodiment of the present invention provides, as shown in Figure 1, is applied to cluster server, can comprises the steps:

S101: receive cluster request, gather user data according to described cluster request, the classification of the user data that will gather is carried in described cluster request.

Certainly, it should be noted that, do not limit in the application to the instrument that data acquisition uses, any possible instrument carrying out data acquisition can be applied in the application.

When gathering user data according to described cluster request, reptile instrument can be used to gather user data, also can use open application programming interfaces API online acquisition user data, or reptile instrument is combined collection user data with open application programming interfaces API.

Concrete example as: cluster server receives a cluster request, the classification that the user data that will gather is carried in described cluster request is social media class user, be specially microblog users, then cluster server is the cluster request of microblog users according to the classification of the user data carried, reptile instrument is used to gather microblog users data, or use open application programming interfaces API online acquisition microblog users data, or reptile instrument is combined with the application programming interfaces API of opening and gathers microblog users data.Wherein, described microblog users data can comprise: the relevant information of the interest tags chosen during microblog users registration, the microblogging delivered, the comment of participation and the good friend of interaction.

Or suppose that the classification that the user data that will gather is carried in described cluster request is commodity transaction class user, be specially shopping user, then cluster server is the cluster request of shopping user according to the classification of the user data carried, use reptile instrument collection shopping user data, or use open application programming interfaces API online acquisition shopping user data, or reptile instrument is combined with the application programming interfaces API of opening and gathers user data of doing shopping.Wherein, described shopping user data can comprise: the information of the trade company of the commodity that user bought and kind, the commodity browsed, collect or paid close attention to and kind and concern or collection.

Or suppose that the classification that the user data that will gather is carried in described cluster request is sciemtifec and technical sphere expert user, then cluster server is the cluster request of sciemtifec and technical sphere expert user according to the classification of the user data carried, reptile instrument is used to gather sciemtifec and technical sphere expert user data, or use open application programming interfaces API online acquisition sciemtifec and technical sphere expert user data, or reptile instrument is combined with the application programming interfaces API of opening and gathers sciemtifec and technical sphere expert user data.Wherein, described sciemtifec and technical sphere expert user data can comprise: the information of the expert of the paper that described expert user was delivered, the investigation conferencing information participated in or scientific research project information, cooperation.

Certainly, it should be noted that, do not limit in the application to user data, any possible user data can be applied in the application.

S102: according to the user data collected described in preset rules process, obtain the primary attribute of each user data and attached attribute, according to primary attribute and the attached attribute of each user data obtained, determine all attached attributes, according to all attached attributes, obtain the multi-dimensional data that each primary attribute is corresponding; Wherein, described primary attribute comprises user ID, and attached attribute comprises the relevant information of this user obtained from each user data; Described multi-dimensional data identifies the relation of having of this primary attribute and all attached attributes or nothing.

Concrete, the described user data according to collecting described in preset rules process, obtains the primary attribute of each user data and attached attribute, can comprise:

Concrete for microblog users, microblog users A, when register account number, can choose interest tags according to the interest place of self.Collect these interest tags, the relevant information of the good friend of the microblogging that user A delivers, the comment participated in and interaction, wherein, the title A of microblog users can be confirmed as primary attribute, its pass through described in the interest determined of the data message collected can be confirmed as attached attribute, the data collected are carried out word segmentation processing, filtering useless word and unallowable instruction digit process, obtains can having A, B, C, D, E as the primary attribute of user data; Can as the attached attribute of user data have first, second, third, fourth, penta.Wherein, the primary attribute that first user data is corresponding is A and attached attribute is first, second; The primary attribute that second user data is corresponding is B and attached attribute is second, third; Primary attribute corresponding to third party data is C and attached attribute is first, the third; 4th primary attribute that user data is corresponding is D and attached attribute is fourth; 5th primary attribute that user data is corresponding is E and attached attribute is penta.After relevant information according to user data after treatment, can determine the multi-dimensional data that each primary attribute is corresponding, described multi-dimensional data identifies the relation of having of this primary attribute and all attached attributes or nothing, and namely described multi-dimensional data can be expressed as:

The multi-dimensional data that primary attribute A is corresponding comprise attached attribute first, second, third, fourth, penta, wherein, primary attribute A has attached attribute first, second, without attached attribute third, fourth, penta;

The multi-dimensional data that primary attribute B is corresponding comprise attached attribute first, second, third, fourth, penta, wherein, primary attribute B has attached attribute second, the third, without attached attribute first, fourth, penta;

The multi-dimensional data that primary attribute C is corresponding comprise attached attribute first, second, third, fourth, penta, wherein, primary attribute C has attached attribute first, the third, without attached attribute second, fourth, penta;

The multi-dimensional data that primary attribute D is corresponding comprise attached attribute first, second, third, fourth, penta, wherein, primary attribute D has attached attribute fourth, without attached attribute first, second, third, penta;

The multi-dimensional data that primary attribute E is corresponding comprise attached attribute first, second, third, fourth, penta, wherein, primary attribute E has attached attribute penta, without attached attribute first, second, third, fourth;

S103: the multi-dimensional data corresponding according to each primary attribute, obtains the degree of correlation of each primary attribute and all attached attributes.

Concrete, according to step S102, the degree of correlation of described each primary attribute and all attached attributes can represent with adjacency matrix W, is expressed as

W = [\begin{matrix} 1 & 1 & 0 & 0 & 0 \\ 0 & 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \end{matrix}]

Wherein, the degree of correlation arranging the attached attribute of primary attribute and its existence is 1, and the correlativity arranging primary attribute non-existent attached attribute with it is 0, the first row represent primary attribute A respectively with attached attribute first, second, third, fourth, penta correlativity; Second row represent primary attribute B respectively with attached attribute first, second, third, fourth, penta correlativity; The third line represent primary attribute C respectively with attached attribute first, second, third, fourth, penta correlativity; Fourth line represent primary attribute D respectively with attached attribute first, second, third, fourth, penta correlativity; Fifth line represent primary attribute E respectively with attached attribute first, second, third, fourth, penta correlativity.

S104: according to the degree of correlation of each primary attribute and all attached attributes, carry out fuzzy clustering, obtains cluster result, comprising:

According to default classifying rules, all primary attributes are classified, obtain first distribution situation of each primary attribute in each classification, according to the degree of correlation and described first distribution situation of described each primary attribute and all attached attributes, determine second distribution situation of each attached attribute of described multi-dimensional data in each classification, when wherein classifying, ensure to there is at least one primary attribute in each classification;

Concrete, described determine second distribution situation of each attached attribute of described multi-dimensional data in each classification after, can also comprise:

Concrete example as: according to step S103, according to preset classifying rules all primary attributes are classified, can be random assortment, also can be average classification.

Described primary attribute A, B, C, D, E are divided into two classes at random, are respectively d ₁and d ₂, wherein d ₁comprise A and B, d ₂comprise C, D, E, and then can obtain first distribution situation of each primary attribute in each classification, described first distribution situation can represent with the first distribution matrix, can be expressed as:

\overset{&RightArrow;}{X_{A}} = [\begin{matrix} 1 & 0 \end{matrix}], \overset{&RightArrow;}{X_{B}} = [\begin{matrix} 1 & 0 \end{matrix}], \overset{&RightArrow;}{X_{C}} = [\begin{matrix} 0 & 1 \end{matrix}], \overset{&RightArrow;}{X_{D}} = [\begin{matrix} 0 & 1 \end{matrix}], \overset{&RightArrow;}{X_{E}} = [\begin{matrix} 0 & 1 \end{matrix}],

Wherein with

\overset{&RightArrow;}{X_{A}} = [\begin{matrix} 1 & 0 \end{matrix}]

For example, represent that primary attribute A is respectively at d ₁and d ₂in the first distribution situation, A has been assigned randomly to d ₁in, i.e. A and d ₁intersect, with d ₂non-intersect, determine that A belongs to d ₁, be then expressed as

\overset{&RightArrow;}{X_{A}} = [\begin{matrix} 1 & 0 \end{matrix}],

Other primary attribute is at d ₁and d ₂in first distribution situation statement similar, no longer repeat at this.

According to the degree of correlation and described first distribution situation of described each primary attribute and all attached attributes, determine second distribution situation of each attached attribute of described multi-dimensional data in each classification, described second distribution situation is d ₁in comprise attached attribute first, second, the third, d ₂in comprise attached attribute first, the third, fourth, penta.

According to described second distribution situation, determine that each attached attribute of described multi-dimensional data accounts for the weight of each classification described, can represent with the second distribution matrix, can be expressed as:

With for example, represent that attached attribute first accounts for d ₁and d ₂weight, " 1 " above represents that first is at d ₁weight, " 1 " below represents that first is at d ₂weight.

Then account for the weight of each classification described according to each attached attribute of described multi-dimensional data, use default fuzzy clustering algorithm, carry out interative computation, obtain the cluster result of user.

Concrete, each attached attribute of described multi-dimensional data accounts for the weight of each classification described, uses default fuzzy clustering algorithm, carries out interative computation, obtains the cluster result of user, can comprise:

Concrete example as, use default fuzzy clustering algorithm, enter interative computation:

According to above-mentioned steps, suppose that primary attribute comprises X ₁, X ₂, X ₃, X ₄and X ₅, attached attribute comprises Y ₁, Y ₂, Y ₃, Y ₄, Y ₅, wherein, X ₁comprise attached attribute Y ₁and Y ₂, X ₂comprise attached attribute Y ₂and Y ₃, X ₃comprise attached attribute Y ₁and Y ₃, X ₄comprise attached attribute Y ₄, X ₅comprise attached attribute Y ₅,

By X ₁, X ₂assign to d ₁in, X ₃, X ₄and X ₅assign to d ₂in, then have

\overset{&RightArrow;}{X_{1}} = [\begin{matrix} 1 & 0 \end{matrix}], \overset{&RightArrow;}{X_{2}} = [\begin{matrix} 1 & 0 \end{matrix}],

\overset{&RightArrow;}{X_{3}} = [\begin{matrix} 0 & 1 \end{matrix}], \overset{&RightArrow;}{X_{4}} = [\begin{matrix} 0 & 1 \end{matrix}], \overset{&RightArrow;}{X_{5}} = [\begin{matrix} 0 & 1 \end{matrix}],

According to the degree of correlation and described first distribution situation of described each primary attribute and all attached attributes, determine second distribution situation of each attached attribute of described multi-dimensional data in each classification, determine d ₁in comprise attached attribute Y ₁, Y ₂, Y ₃, d ₂in comprise attached attribute Y ₁, Y ₄, Y ₅.According to second distribution situation of each attached attribute in each classification of described multi-dimensional data, determine that each attached attribute of described multi-dimensional data accounts for the weight of each classification described, then have

\overset{&RightArrow;}{Y_{1}} = [\begin{matrix} 1 & 1 \end{matrix}], \overset{&RightArrow;}{Y_{2}} = [\begin{matrix} 2 & 0 \end{matrix}], \overset{&RightArrow;}{Y_{3}} = [\begin{matrix} 1 & 1 \end{matrix}], \overset{&RightArrow;}{Y_{4}} = [\begin{matrix} 0 & 1 \end{matrix}], \overset{&RightArrow;}{Y_{5}} = [\begin{matrix} 0 & 1 \end{matrix}] .

Determine that each primary attribute of described multi-dimensional data is to the membership vector of each classification:

By the membership vector normalization of described each primary attribute to each classification, obtain

According to required each primary attribute to the membership vector of each classification, determine the center vector of the cluster centre that each classification is current: because this iteration is first time iteration, the center vector of the previous cluster centre of its each classification is assumed to be the mould calculating the difference of the current center vector of cluster centre of each classification described and the center vector of the previous cluster centre of each classification is:

| \overset{&RightArrow;}{P_{1}} - \overset{&RightArrow;}{P_{0}} | = \sqrt{{(\frac{2}{5} - 1)}^{2} + {(\frac{3}{5} - 0)}^{2}} = \frac{3 \sqrt{2}}{5},

Suppose that setting threshold value is then have judge that cluster result is not restrained, continue cluster process, therefore carry out an iteration again;

Carry out second time iteration, the first distribution situation that described in each classification membership vector of described each primary attribute to each classification being defined as new round cluster process, each primary attribute of multi-dimensional data is new in each classification, according to the degree of correlation and described first distribution situation newly of described each primary attribute and all attached attributes, determine second distribution situation of each attached attribute of described multi-dimensional data in each classification, and then determine that each attached attribute of described multi-dimensional data accounts for the weight of each classification described

Namely have

Now, each primary attribute of described multi-dimensional data to the membership vector of each classification is:

Be normalized to:

According to required each primary attribute to the membership vector of each classification, determine the center vector of the cluster centre that each classification is current: the center vector of the previous cluster centre of each classification: the mould calculating the difference of the current center vector of cluster centre of each classification described and the center vector of the previous cluster centre of each classification is: setting threshold value is then have then judge cluster result convergence, terminate cluster process.

Above-described embodiment is only for example, does not limit the concrete enforcement of the method for the user clustering in the application.In actual applications, more accurate in order to ensure the result of user clustering, the smaller the better to the setting of this threshold value.Same, in order to make the result of user clustering more accurate, the number of times carrying out cluster is also The more the better.Certain consideration problem such as computing time and cost consumption in actual applications, after the mould in the difference calculating the current center vector of cluster centre of each classification described and the center vector of the previous cluster centre of each classification is less than and sets threshold value, carry out the iteration of preset times again, ensure that result is in setting threshold range, cluster process can be stopped, such as carry out 3 ~ 5 iteration again, ensure that result is in setting threshold range, can stop cluster process.

It is emphasized that the user data of asking for user clustering request in this programme is more, user data type is more under complicated situation, the advantage embodied can be more obvious.

Concrete, described judgement cluster result convergence, can also comprise after terminating cluster process:

By the primary attribute of multi-dimensional data described in current cluster process to the membership vector of each classification, be defined as the ownership probability of described primary attribute for each classification, according to the ownership probability of described primary attribute for each classification, sort in each classification.

According to above-mentioned steps, the membership vector of primary attribute to each classification of described multi-dimensional data can be determined,

X can be determined ₁have belong to d ₁, belong to d ₂, i.e. X ₁belong to d ₁ownership probability be x ₁belong to d ₂ownership probability be

X ₂have belong to d ₁, belong to d ₂, i.e. X ₂belong to d ₁ownership probability be x ₂belong to d ₂ownership probability be

X ₃have belong to d ₁, belong to d ₂, i.e. X ₃belong to d ₁ownership probability be x ₃belong to d ₂ownership probability be

X ₄0 is had to belong to d ₁, 1 belongs to d ₂, i.e. X ₄belong to d ₁ownership probability be 0, X ₄belong to d ₂ownership probability be 1;

X ₅0 is had to belong to d ₁, 1 belongs to d ₂, i.e. X ₅belong to d ₁ownership probability be 0, X ₅belong to d ₂ownership probability be 1.

According to the ownership probability of described primary attribute for each classification, sort in each classification, at d ₁middle ownership probability puts in order from high to low as X ₁, X ₂; At d ₂middle ownership probability puts in order from high to low as X ₄, X ₅, X ₃.

After sorting, according to the height of the ownership probability of each primary attribute in classification, estimation setting can also be carried out to each primary attribute to such other disturbance degree.Wherein, ownership probability is higher, and each primary attribute is higher to such other disturbance degree.According to described disturbance degree, can follow-up work carried out, as in social media field, can according to disturbance degree, to the better more fully commending friends of other users; In commodity transaction field, to the better more fully Recommendations of shopping user; At sciemtifec and technical sphere, for user better more fully recommends the expert in this field.

The application embodiment of the present invention, can carry out the analysis of various dimensions multi-angle to user data, more comprehensive to the analysis of user data, application fuzzy clustering algorithm, can carry out cluster more accurately to user, makes it reduced by the impact of first time classification.

Correspond to said method embodiment, the device of a kind of user clustering that the embodiment of the present invention provides, as shown in Figure 2, be applied to cluster server, described device can comprise: cluster request receiving module 201, multi-dimensional data acquisition module 202, degree of correlation acquisition module 203 and fuzzy clustering module 204

Fuzzy clustering module 201: for receiving cluster request, gather user data according to described cluster request, the classification of the user data that will gather is carried in described cluster request.

Multi-dimensional data acquisition module 202: for according to the user data collected described in preset rules process, obtain the primary attribute of each user data and attached attribute, according to primary attribute and the attached attribute of each user data obtained, determine all attached attributes, according to all attached attributes, obtain the multi-dimensional data that each primary attribute is corresponding; Wherein, described primary attribute comprises user ID, and attached attribute comprises the relevant information of this user obtained from each user data; Described multi-dimensional data identifies the relation of having of this primary attribute and all attached attributes or nothing.

Concrete, described multi-dimensional data acquisition module is according to the user data collected described in preset rules process, when obtaining the primary attribute of each user data and attached attribute, word segmentation processing, filtering useless word and unallowable instruction digit process are carried out to the described user data collected; Unique primary attribute of each user data obtained and at least one attached attribute.

Degree of correlation acquisition module 203: for the multi-dimensional data corresponding according to each primary attribute, obtains the degree of correlation of each primary attribute and all attached attributes.

Fuzzy clustering module 204: for the degree of correlation according to each primary attribute and all attached attributes, carry out fuzzy clustering, obtains cluster result,

Described fuzzy clustering module 204 comprises distribution situation determination submodule 2041 and cluster result obtains submodule 2042, (not marking in figure)

Described distribution situation determination submodule 2041 specifically for: according to default classifying rules, all primary attributes are classified, obtain first distribution situation of each primary attribute in each classification, according to the degree of correlation and described first distribution situation of described each primary attribute and all attached attributes, determine second distribution situation of each attached attribute of described multi-dimensional data in each classification, when wherein classifying, ensure to there is at least one primary attribute in each classification;

Described cluster result obtain submodule 2042 specifically for: according to the second described distribution situation, use default fuzzy clustering algorithm, carry out interative computation, obtain the cluster result of user.

Concrete, described distribution situation determination submodule 2041, after determining the distribution situation of each attached attribute of described multi-dimensional data in each classification, can also comprise:

The second distribution situation described in described basis, uses default fuzzy clustering algorithm, carries out interative computation, obtains the cluster result of user, Ke Yiwei:

Concrete, described cluster result obtains submodule 2042 and can comprise: membership vector determination submodule, center vector determination submodule, comparison sub-module, the first decision sub-module and the second decision sub-module, (not marking in figure)

Concrete, order module (not marking in figure) can also be comprised:

For device embodiment, because it is substantially similar to embodiment of the method, so description is fairly simple, relevant part illustrates see the part of embodiment of the method.

It should be noted that, in this article, the such as relational terms of first and second grades and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.

One of ordinary skill in the art will appreciate that all or part of step realized in said method embodiment is that the hardware that can carry out instruction relevant by program has come, described program can be stored in computer read/write memory medium, here the alleged storage medium obtained, as: ROM/RAM, magnetic disc, CD etc.

The foregoing is only preferred embodiment of the present invention, be not intended to limit protection scope of the present invention.All any amendments done within the spirit and principles in the present invention, equivalent replacement, improvement etc., be all included in protection scope of the present invention.

Claims

1. a method for user clustering, is characterized in that, is applied to cluster server, and described method comprises step:

2. method according to claim 1, is characterized in that, the described user data according to collecting described in preset rules process, obtains the primary attribute of each user data and attached attribute, comprising:

3. method according to claim 1, is characterized in that, described determine second distribution situation of each attached attribute of described multi-dimensional data in each classification after, also comprise:

4. method according to claim 3, is characterized in that, the described each attached attribute according to described multi-dimensional data accounts for the weight of each classification described, uses default fuzzy clustering algorithm, carries out interative computation, obtains the cluster result of user, comprising:

5. method according to claim 4, is characterized in that, described judgement cluster result convergence, also comprises after terminating cluster process:

6. a device for user clustering, is characterized in that, is applied to cluster server, and described device comprises:

7. device according to claim 6, it is characterized in that, described multi-dimensional data acquisition module is according to the user data collected described in preset rules process, when obtaining the primary attribute of each user data and attached attribute, word segmentation processing, filtering useless word and unallowable instruction digit process are carried out to the described user data collected; Unique primary attribute of each user data obtained and at least one attached attribute.

8. device according to claim 6, is characterized in that, described distribution situation determination submodule, after determining the distribution situation of each attached attribute of described multi-dimensional data in each classification, also comprises:

9. device according to claim 8, is characterized in that, described cluster result obtains submodule and comprises: membership vector determination submodule, center vector determination submodule, comparison sub-module, the first decision sub-module and the second decision sub-module,

10. device according to claim 6, is characterized in that, also comprises order module: