CN106055699A

CN106055699A - Method and device for feature clustering

Info

Publication number: CN106055699A
Application number: CN201610421683.7A
Authority: CN
Inventors: 陈明星; 陈谦; 万伟
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2016-06-15
Filing date: 2016-06-15
Publication date: 2016-10-26
Anticipated expiration: 2036-06-15
Also published as: CN106055699B

Abstract

The invention discloses a method for feature clustering. The method comprises the following steps: obtaining account information and attribute information corresponding to the account information; preprocessing the account information and the attribute information corresponding to the account information to obtain model input data; utilizing a topic model algorithm to process the model input data to obtain the probability of each topic contained in the account information, wherein the probability of each topic corresponds to one feature; and utilizing a clustering algorithm to cluster the features contained in the account information. The method, which is provided by the embodiment of the invention, for feature clustering can cluster the account information and the attribute information corresponding to the account information through a topic probability way, an endless feature exploration process can be effectively avoided, and the problem of excessive feature dimensions can be effectively solved so as to improve feature clustering efficiency.

Description

A kind of method and device of feature clustering

Technical field

The present invention relates to field of computer technology, be specifically related to the method and device of a kind of feature clustering.

Background technology

Along with the high speed development of Internet technology, on network, the kind of application gets more and more, as a example by social networking application, at present Social networking application the online exchange between user can not only be provided, it is also possible to push various types of content for user.

Such as: various types of public number in social networking application, can be opened, user can be by paying close attention to the public affairs oneself liked Subscribe to, so, when there being new article to deliver under this public number, this new article will be pushed to this user for many numbers, thus User is conducive to watch new article in time.

Because a public number can be subscribed to by numerous users, a user can also subscribe to multiple public number, therefore, for The user group of each public number of more preferable analysis, or the tendentiousness of public number liked by user, it usually needs to the public Number or user cluster.

Clustering method of the prior art is typically to set each sample the feature of different dimensions, but different dimensions Feature generally requires the knowledge frequently in corresponding field and completes to arrange, and Exploration on Characteristics is a veryest long process, and feature may Can be a lot, therefore it is easily caused characteristic dimension disaster, causes feature clustering inefficiency.

Summary of the invention

For solving prior art carries out feature clustering by the way of different dimensions feature-set, feature clustering is caused to be imitated The problem that rate is low, the embodiment of the present invention provide a kind of feature clustering method, can to account and with account believe The attribute information that breath is corresponding, clusters by the way of theme probability, can not only effectively prevent very long Exploration on Characteristics Process, moreover it is possible to effectively reduce the problem that characteristic dimension is too much, thus improve the efficiency of feature clustering.The embodiment of the present invention also carries Supply corresponding clustering apparatus.

First aspect present invention provides a kind of method of feature clustering, including:

Obtain account, and the attribute information corresponding with described account；

To described account, and the attribute information corresponding with described account carries out pretreatment, obtains model defeated Enter data；

Utilize topic model algorithm, described mode input data are processed, obtain what described account was comprised The probability of each theme, the corresponding feature of the probability of each theme；

The feature utilizing clustering algorithm to be comprised described account clusters.

Second aspect present invention provides the device of a kind of feature clustering, including:

Acquiring unit, is used for obtaining account, and the attribute information corresponding with described account；

Pretreatment unit, for the account that described acquiring unit is obtained and corresponding with described account Attribute information carries out pretreatment, obtains mode input data；

Processing unit, is used for utilizing topic model algorithm, and the mode input data obtaining described pretreatment unit are carried out Process, obtain the probability of each theme that described account is comprised, the corresponding feature of the probability of each theme；

Cluster cell, the feature that the described account for utilizing clustering algorithm to obtain described processing unit is comprised Cluster.

By the way of different dimensions feature-set, carry out feature clustering with prior art, cause feature clustering efficiency low Under compare, the method for feature clustering that the embodiment of the present invention provides, can be to account and corresponding with account information Attribute information, clusters by the way of theme probability, can not only effectively prevent very long Exploration on Characteristics process, moreover it is possible to Effectively reduce the problem that characteristic dimension is too much, thus improve the efficiency of feature clustering.

Accompanying drawing explanation

For the technical scheme being illustrated more clearly that in the embodiment of the present invention, in embodiment being described below required for make Accompanying drawing be briefly described, it should be apparent that, below describe in accompanying drawing be only some embodiments of the present invention, for From the point of view of those skilled in the art, on the premise of not paying creative work, it is also possible to obtain the attached of other according to these accompanying drawings Figure.

Fig. 1 is an embodiment schematic diagram of the method for feature clustering in the embodiment of the present invention；

Fig. 2 is another embodiment schematic diagram of the method for feature clustering in the embodiment of the present invention；

Fig. 3 is another embodiment schematic diagram of the method for feature clustering in the embodiment of the present invention；

Fig. 4 is an embodiment schematic diagram of the device of feature clustering in the embodiment of the present invention；

Fig. 5 is an embodiment schematic diagram of server in the embodiment of the present invention.

Detailed description of the invention

The embodiment of the present invention provides a kind of method of feature clustering, can be to account and corresponding with account information Attribute information, cluster by the way of theme probability, very long Exploration on Characteristics process can not only be effectively prevent, also Can effectively reduce the problem that characteristic dimension is too much, thus improve the efficiency of feature clustering.The embodiment of the present invention additionally provides phase The clustering apparatus answered.It is described in detail individually below.

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Describe, it is clear that described embodiment is only a part of embodiment of the present invention rather than whole embodiments wholely.Based on Embodiment in the present invention, the every other enforcement that those skilled in the art are obtained under not making creative work premise Example, broadly falls into the scope of protection of the invention.

For the ease of understanding the content in the embodiment of the present invention, do below for the noun involved by the embodiment of the present invention Lower simple introduction.

Account: refer to the information for representing account number, network virtual service account numbers can be included, and at network User in virtual service desk registers account number etc..

Network virtual service account numbers: refer to the public number of registration in network virtual service platform.

User registers account number: refer to the account number of the social networking application of user.

The attribute information that account is corresponding: referring to account information is the information of tree structure.

Such as: in embodiments of the present invention, when account is network virtual service account numbers, then network virtual service account Number corresponding attribute information is to subscribe to the user profile under this network virtual service account numbers, including, user account number.

When account be the user in network virtual service platform register account number time, then user to register account number corresponding The network virtual service account numbers that attribute information is paid close attention to by this user account number.

Topic model algorithm: (English full name Latent Dirichlet Allocation, English abbreviation " LDA "), theme Model is as the term suggests being exactly a kind of modeling method to theme implicit in word, and topic model can use formulaRepresent.

The formula of above-mentioned topic model is to represent with the form of document, and wherein, p (word document) represents every document In the probability that occurs of each word, p (word theme) represents the probability that each word in each theme occurs, p (theme literary composition Shelves) represent is the probability that in every document, each theme occurs.

If representing by the form of matrix, above-mentioned model formation is also denoted as C=Φ * Θ.

Wherein C, Φ and Θ are matrixes, and when as a example by article, wherein, C represents that in every document, each word occurs Probability, namely p (word document), Φ represents the Probability p (word theme) that each word in each theme occurs, Θ table Show is the Probability p (subject document) that in every document, each theme occurs.

Theme is exactly the conditional probability distribution of word on vocabulary, the corresponding feature of the probability of each theme, such as: example As: in one scenario, p (notebook Baidu)=0.000001, p (notebook association)=0.2, then 0.000001 correspondence Being characterized as Baidu, 0.2 characteristic of correspondence is association.

Feature clustering: exactly similar feature is gathered an apoplexy due to endogenous wind.

Cluster process can be first arbitrarily to select k object as initial cluster center from n data object, and k is little In n, and for other object remaining, then according to them and the similarity (distance) of these cluster centres, respectively by they point (representated by cluster centre) cluster that dispensing is most like with it；Calculate the average of all objects in this cluster the most again, it is thus achieved that The cluster centre of new cluster, constantly repeats this process until canonical measure function starts convergence.Typically all use mean square Difference is as canonical measure function.

Wherein, k cluster has the following characteristics that each cluster itself is the compactest, and between respectively clustering as far as possible Separately.

Represent with mathematical way and can be:

Step 1: input: k, data [n]；

Step 2, k initial center point of selection, such as c [0]=data [0] ... c [k-1]=data [k-1]；

Step 3, for data [0] ... .data [n], respectively with c [0] ... c [k-1] compares, it is assumed that minimum with c difference, just It is labeled as i；

Step 4, be labeled as i point for all, recalculate data [j] sum of all i of being labeled as of c={/it is labeled as i Number；

Repeat (3) (4), until the change of all c values is less than given threshold value.

Being above the introduction to the related names involved by the embodiment of the present invention, the explanation present invention is real below in conjunction with the accompanying drawings Execute the embodiment of the method for feature clustering in example.

It should be noted that the device realizing embodiment of the present invention feature clustering can be an independent physical machine, also Can be the physical machine cluster that formed of multiple physical machine, it is also possible to be multiple dependence void of being divided out from physical resource Plan machine.Server belongs to a kind of form of expression of physical machine.

Fig. 1 is an embodiment schematic diagram of the method for feature clustering in the embodiment of the present invention,

As it is shown in figure 1, the embodiment of the method for feature clustering that the embodiment of the present invention is provided includes:

101, account is obtained, and the attribute information corresponding with described account.

When account is network virtual service account numbers, the attribute information corresponding with described account can be to pay close attention to The user of this network virtual service account numbers registers account number.

Such as: when network virtual service account numbers is public number, the attribute information corresponding with described account can be The user subscribing to this public number registers account number, and the attribute information the most corresponding with described account is not limited to subscribe to this public number User register account number, it is also possible to including booking reader's quantity, any active ues quantity is, and interactive vermicelli quantity etc..

When account be the user in network virtual service platform register account number time, corresponding with described account Attribute information can register the network virtual service account numbers ordered by account number for this user,

Such as: this user registers the public number that account number is paid close attention to, user register public number that account number paid close attention to can be from public affairs Acquire above crowd's platform in the public number list ordered by each user and search.The genus the most corresponding with described account Property information is not limited to the public number that this user is paid close attention to, it is also possible to include the upstream message that user sends to each wechat public number Number, pay number of times, check article number of times and click on menu number of times etc..

102, to described account, and the attribute information corresponding with described account carries out pretreatment, obtains mould Type input data.

The process of pretreatment can be that the form between account and attribute information generates, and the filtration of data.

103, utilize topic model algorithm, described mode input data are processed, obtain described account and wrapped The probability of each theme contained, the corresponding feature of the probability of each theme.

Utilize topic model algorithm, described mode input data are carried out process and can utilize formulaOr mode input data are carried out by formula C=Φ * Θ Process, obtain the probability of each theme, so that it is determined that each theme characteristic of correspondence.

104, the feature utilizing clustering algorithm to be comprised described account clusters.

The process of cluster can be refering to the description of explanation of nouns part:

Step 1: input: k, data [n]；

Realizing feature clustering by this process, the most in embodiments of the present invention, the data of input are account.

Alternatively, on the basis of the content described by above-described embodiment, the feature clustering that the embodiment of the present invention is provided Method another embodiment in, described account is network virtual service account numbers, the most described to described account, and The attribute information corresponding with described account carries out pretreatment, obtains mode input data, may include that

To described network virtual service account numbers, and the attribute information corresponding with described network virtual service account numbers carries out pre- Process, obtain mode input data.

Further, described to described network virtual service account numbers and corresponding with described network virtual service account numbers Attribute information carries out pretreatment, obtains mode input data, may include that

Generate network virtual service account numbers and subscribe to that the user of described network virtual service account numbers registers between account number right Should be related to；

Filter out user to register account number and be unsatisfactory for the described corresponding relation of prerequisite.

In the embodiment of the present invention, network virtual service account numbers and user register the corresponding relation between account number can be by closing The form of series of tables represents.

As shown in table 1, public number and subscribe to the mapping table registering between user of this public number and can be:

As shown in table 1, public number " knows force of labor " and the corresponding relation registered between user of concern " knowing force of labor " can be used The such as form of table 1 represents, certain table 1 simply citing, it practice, most of public number all can have substantial amounts of registration user to close Note.

In addition, it is necessary to explanation a bit, what the vermicelli in the embodiment of the present invention referred to is also registration user, some places make With vermicelli, some places employ registration user, simply engage the bluntization statement that concrete scene is done, but should will not register User and vermicelli do different understanding.

Describe when account is network virtual service account numbers below in conjunction with Fig. 2, the spy that the embodiment of the present invention is provided Levy the process of the method for cluster.

As in figure 2 it is shown, as a example by public number, another embodiment of the method for the feature clustering that the embodiment of the present invention is provided Including:

201, from public number platform, each public number is gathered, and the attribute information that each public number is corresponding.

Attribute information corresponding to public number includes that the user subscribing to each public number registers account number, also includes but does not limit reaction The data such as booking reader's number of public number scale, active users, interactive user number.

202, the user data under each public number is carried out pretreatment.

The process of pretreatment includes: generating preprocessed data Data, form can be: the use of public number t correspondence public number Family registration Accounts List.

After the user of generation public number registers Accounts List, the data in list are done filtration and clean:

Data are done filtration cleaning need to carry out in terms of two, are on the one hand to carry out filtering from the angle of public number cleaning, On the other hand it is to carry out filtering from the angle of user cleaning.

For from the angle of statistical distribution, in a data acquisition system, king-sized data and the least data are the most not It is suitable for statistics, so cleaning data to need to wash king-sized data and the least data in data acquisition system, about especially The cleaning embodiment of the present invention of big data and the least data enumerates two schemes:

First introduce to carry out filtering from the angle of public number and clean.

Carry out filtering from the angle of public number cleaning and refer to filter out user's public number many especially and user is the fewest Public number.Two kinds of filtering schemes are respectively as follows:

The first is: filters the public number washing registration number of users more than first threshold U, and filters out registration user The number public number less than Second Threshold B.

The second is: the registration number of users distribution of statistics public number, filters out more than 95 points of positions (or other point of position) Public number, and 5 be divided into (or other point of position) public number below.Point position refers to the distribution position of data statistically Put.

It is described below to carry out filtering from the angle of user and cleans.

Carry out filtering from the angle of user cleaning to also refer to filter out data acquisition system and subscribe to the user that public number is the fewest Subscribing to, with filtering out, the user that public number is many especially, two kinds of filtering schemes are respectively as follows:

The first is: filter out subscription public's count less than a certain threshold value (such as: 5) or more than some threshold values The user of (such as: 100000).

The second is: counting user subscribes to the distribution of public number, filters out the use of more than 95 points of positions (or other point of position) Family, and 5 be divided into (or other point of position) user below.

203, utilize topic model algorithm, carry out theme study, obtain each public number probability distribution at each theme.

The process of theme study can use topic model lightLDA or the degree of depth study mould supporting Distributed Calculation Type.

204, each public number theme probability distribution result in output step 203.

After the output of each public number theme probability distribution result, carry out manual evaluation, carry out by constantly adjusting model parameter The Optimized Iterative of step 203, makes final result reach perfect condition as far as possible.

Final data form is: public number t theme 1: probit 1 theme 2: probit 2... theme N: probit N

205, for the distribution situation of each theme corresponding to public number of output in step 204, each theme correspondence one Individual feature, then utilizes cluster that public number is carried out feature clustering.

Above step 201-205 is the process prescription combining public number to feature clustering, the public in the embodiment of the present invention Number can be wechat public number, it is also possible to be the public number in other social networking applications.

Alternatively, on the basis of the content described by above-described embodiment, the feature clustering that the embodiment of the present invention is provided Method another embodiment in, described account is that the user in network virtual service platform registers account number, then described To described account, and the attribute information corresponding with described account carries out pretreatment, obtains mode input data, bag Include:

Described user is registered account number, and registers attribute information corresponding to account number with described user and carry out pretreatment, To mode input data.

Further, described described user is registered account number, and register, with described user, the attribute information that account number is corresponding Carry out pretreatment, obtain mode input data, may include that

Generate that user registers that account number registers between the network virtual service account numbers ordered by account number with described user is corresponding Relation；

Filter out network virtual service account numbers and be unsatisfactory for the described corresponding relation of prerequisite.

In the embodiment of the present invention, the corresponding relation that user registers between account number and ordered network virtual service account numbers can To be represented by the form of relation list.

As shown in table 2, user registers the mapping table between account number and ordered public number and can be:

As shown in table 2, user registers the corresponding relation between account number 13415666333 and ordered public number and can use The such as form of table 2 represents, certain table 2 simply citing, it practice, this user is also possible to have subscribed more public number.

When feature clustering, the similarity between each public number to be paid close attention to.

Below in conjunction with Fig. 3 describe when account be the user in network virtual service platform register account number time, this The process of the method for the feature clustering that bright embodiment is provided.

As it is shown on figure 3, another embodiment of the method for feature clustering that the embodiment of the present invention is provided includes:

301, from public number platform, the public number list ordered by each user is gathered.

In addition to public number list, it is also possible to gather some statistical indicators of the public number that each registration user monthly subscribes to Information, wherein can include upstream message number that user sends to each wechat public number, pay number of times, check article Number, click menu number of times etc..

The data that 302, each user registers account number carry out pretreatment.

The process of pretreatment includes: generate data Data, form is: user register account number t its subscribe to public number row Table.

After generating the public number list of user, the process that the data in list are done filtration cleaning may is that

It is described below to carry out filtering from the angle of user and cleans.

303, utilize topic model algorithm, carry out theme study, obtain each public number probability distribution at each theme.

304, each public number theme probability distribution result in output step 303.

During modelling effect optimizes, in addition to based on subscribing relationship, also can be based on registration user and public number Interactive relationship cluster, interactive relationship be defined as upstream message number, pay number of times, check article number of times, click on dish Some index numbers such as single number reach certain numerical value.Exporting the potential applications theme distribution that each user is corresponding, form is: note Volume user t theme 1: probit 1 theme 2: probit 2... theme N: probit N.

305, for the distribution situation of each theme corresponding to public number of output in step 304, each theme correspondence one Individual feature, then utilizes cluster that public number is carried out feature clustering.

Above step 301-305 is the process prescription combining public number to feature clustering, the public in the embodiment of the present invention Number can be wechat public number, it is also possible to be the public number in other social networking applications.

The method of the cluster that the embodiment of the present invention is provided, text data involved during cluster include but not It is limited to such as text message structure correlated characteristic data such as the pet name, brief introduction, signature and articles.

The topic model algorithm used includes but not limited to such as the study of the latent semantic model such as degree of depth and topic model Various receptor models, it is also possible to include singular value decomposition (English full name Singular value decomposition, English letter Claim " SVD ") etc. various clustering algorithms carry out being identified according to potential applications information.

It addition, in the embodiment described by Fig. 2 and Fig. 3, the replacement of the relation of public number and registration user, such as but not Limit and click on wechat public number and the relation of its corresponding article, the forwarding relation of wechat public number article, wechat public number user Relation etc..

Above, the method for the feature clustering that the embodiment of the present invention is provided, produced beneficial effect may include that

One, very long Exploration on Characteristics process can be effectively prevent, moreover it is possible to effectively reduce the problem that characteristic dimension is too much.

Two: utilize distributed topic model effectively to support large-scale clustered demand.

Three: by wechat public number or vermicelli user are clustered, can use same in follow-up excacation The individual wechat public number of individual theme agency or user data, the most effectively solve long-tail part Sparse Problem.

Four: wechat public number cluster result has the place of a lot of potential use, including the recommendation of similar wechat public number, wechat The fields such as the recommendation of public number article, wechat public number advertisement broadcasting.

It is above the description of the method to feature clustering, the device of the feature clustering being described below in the embodiment of the present invention 20。

Fig. 4 is an embodiment schematic diagram of the device 20 of feature clustering in the embodiment of the present invention.

Refering to Fig. 4, an embodiment of the device 40 of the feature clustering that the embodiment of the present invention is provided includes:

Acquiring unit 401, is used for obtaining account, and the attribute information corresponding with described account；

Pretreatment unit 402, for described acquiring unit 401 obtain account, and with described account Corresponding attribute information carries out pretreatment, obtains mode input data；

Processing unit 403, is used for utilizing topic model algorithm, the mode input number obtaining described pretreatment unit 402 According to processing, obtain the probability of each theme that described account is comprised, the corresponding feature of the probability of each theme；

Cluster cell 404, is comprised for the described account utilizing clustering algorithm to obtain described processing unit 403 Feature cluster.

In the embodiment of the present invention, acquiring unit 401 obtains account, and the attribute letter corresponding with described account Breath；The account that described acquiring unit 401 is obtained by pretreatment unit 402, and the attribute corresponding with described account Information carries out pretreatment, obtains mode input data；Processing unit 403 utilizes topic model algorithm, to described pretreatment unit The mode input data that 402 obtain process, and obtain the probability of each theme that described account is comprised, each theme The corresponding feature of probability；Cluster cell 404, for the described account number utilizing clustering algorithm to obtain described processing unit 403 The feature that information is comprised clusters.By the way of different dimensions feature-set, feature clustering is carried out with prior art, Cause feature clustering inefficiency to be compared, the device of feature clustering that the embodiment of the present invention provides, can to account and The attribute information corresponding with account information, clusters by the way of theme probability, can not only effectively prevent very long Exploration on Characteristics process, moreover it is possible to effectively reduce the problem that characteristic dimension is too much, thus improve the efficiency of feature clustering.

Alternatively, on the basis of the embodiment of the device 40 of features described above cluster, the feature that the embodiment of the present invention provides In another embodiment of the device 40 of cluster,

Described pretreatment unit, for when described account is network virtual service account numbers, to described network virtual Service account numbers, and the attribute information corresponding with described network virtual service account numbers carry out pretreatment, obtains mode input data.

Further, described pretreatment unit is used for:

Described pretreatment unit, is used for when described account is that the user in network virtual service platform registers account number Time, described user is registered account number, and registers attribute information corresponding to account number with described user and carry out pretreatment, obtain model Input data.

Further, described pretreatment unit is used for:

The device of features above cluster can be realized by server, illustrates to be realized by server below in conjunction with Fig. 5 The device stating feature clustering realizes the process of cluster.

Fig. 5 is the structural representation of the server 50 that the embodiment of the present invention provides.Described server 50 includes processor 510, memorizer 550 and transceiver 530, memorizer 550 can include read only memory and random access memory, and to process Device 510 provides operational order and data.A part for memorizer 550 can also include nonvolatile RAM (NVRAM)。

In some embodiments, memorizer 550 stores following element, executable module or data structure, or Their subset of person, or their superset:

In embodiments of the present invention, by calling the operational order of memorizer 550 storage, (this operational order is storable in behaviour Make in system),

By the way of different dimensions feature-set, carry out feature clustering with prior art, cause feature clustering efficiency low Under compare, the embodiment of the present invention provide server, can to account and the attribute information corresponding with account information, Cluster by the way of theme probability, very long Exploration on Characteristics process can not only be effectively prevent, moreover it is possible to effectively reduce The problem that characteristic dimension is too much, thus improve the efficiency of feature clustering.

Processor 510 controls the operation of server 50, and processor 510 can also be referred to as CPU (Central Processing Unit, CPU).Memorizer 550 can include read only memory and random access memory, and to processor 510 Instruction and data is provided.A part for memorizer 550 can also include nonvolatile RAM (NVRAM).Specifically Application in each assembly of server 50 be coupled by bus system 520, wherein bus system 520 is except including data Outside bus, it is also possible to include power bus, control bus and status signal bus in addition etc..But for the sake of understanding explanation, Various buses are all designated as bus system 520 by figure.

The method that the invention described above embodiment discloses can apply in processor 510, or is realized by processor 510. Processor 510 is probably a kind of IC chip, has the disposal ability of signal.During realizing, said method each Step can be completed by the instruction of the integrated logic circuit of the hardware in processor 510 or software form.Above-mentioned process Device 510 can be general processor, digital signal processor (DSP), special IC (ASIC), ready-made programmable gate array Or other PLDs, discrete gate or transistor logic, discrete hardware components (FPGA).Can realize or Person performs disclosed each method, step and logic diagram in the embodiment of the present invention.General processor can be microprocessor or This processor of person can also be the processor etc. of any routine.Step in conjunction with the method disclosed in the embodiment of the present invention can be straight Connect and be presented as that hardware decoding processor has performed, or performed with the hardware in decoding processor and software module combination Become.Software module may be located at random access memory, flash memory, read only memory, and programmable read only memory or electrically-erasable can In the storage medium that this area such as programmable memory, depositor is ripe.This storage medium is positioned at memorizer 550, and processor 510 is read Information in access to memory 550, completes the step of said method in conjunction with its hardware.

Alternatively, processor 510 is used for:

When described account be the user in network virtual service platform register account number time, described user is registered account Number, and register attribute information corresponding to account number with described user and carry out pretreatment, obtain mode input data.

Processor 510 is used for further,

Alternatively, processor 510 is used for:

Processor 510 is used for further,

Above server 50 can understand refering to the description of Fig. 1 to Fig. 3 part, and this place does not do and too much repeats

One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can Completing instructing relevant hardware by program, this program can be stored in a computer-readable recording medium, storage Medium may include that ROM, RAM, disk or CD etc..

Method and the device of the feature clustering provided the embodiment of the present invention above are described in detail, herein Applying specific case to be set forth principle and the embodiment of the present invention, the explanation of above example is only intended to help Understand method and the core concept thereof of the present invention；Simultaneously for one of ordinary skill in the art, according to the thought of the present invention, The most all will change, in sum, this specification content should not be construed as this The restriction of invention.

Claims

1. the method for a feature clustering, it is characterised in that including:

To described account, and the attribute information corresponding with described account carries out pretreatment, obtains mode input number According to；

Utilize topic model algorithm, described mode input data are processed, obtain each master that described account is comprised The probability of topic, the corresponding feature of the probability of each theme；

Method the most according to claim 1, it is characterised in that described account is network virtual service account numbers, then institute State described account, and the attribute information corresponding with described account carry out pretreatment, obtains mode input data, Including:

To described network virtual service account numbers, and the attribute information corresponding with described network virtual service account numbers carries out pre-place Reason, obtains mode input data.

Method the most according to claim 1, it is characterised in that described account is in network virtual service platform User registers account number, the most described to described account, and the attribute information corresponding with described account carries out pretreatment, Obtain mode input data, including:

Described user is registered account number, and registers attribute information corresponding to account number with described user and carry out pretreatment, obtain mould Type input data.

Method the most according to claim 2, it is characterised in that described to described network virtual service account numbers, and with institute State attribute information corresponding to network virtual service account numbers and carry out pretreatment, obtain mode input data, including:

Generate the corresponding pass that network virtual service account numbers is registered between account number with the user subscribing to described network virtual service account numbers System；

Method the most according to claim 3, it is characterised in that described account number that described user is registered, and with described use The attribute information that account number is registered corresponding in family carries out pretreatment, obtains mode input data, including:

Generation user registers account number and described user registers the corresponding relation between the network virtual service account numbers ordered by account number；

6. the device of a feature clustering, it is characterised in that including:

Pretreatment unit, for the account that described acquiring unit is obtained, and the attribute corresponding with described account Information carries out pretreatment, obtains mode input data；

Processing unit, is used for utilizing topic model algorithm, and the mode input data obtaining described pretreatment unit process, Obtain the probability of each theme that described account is comprised, the corresponding feature of the probability of each theme；

Cluster cell, the feature that the described account for utilizing clustering algorithm to obtain described processing unit is comprised is carried out Cluster.

Device the most according to claim 6, it is characterised in that

Described pretreatment unit, for when described account is network virtual service account numbers, services described network virtual Account number, and the attribute information corresponding with described network virtual service account numbers carry out pretreatment, obtains mode input data.

Device the most according to claim 6, it is characterised in that

Described pretreatment unit, for when described account be the user in network virtual service platform register account number time, Described user is registered account number, and registers attribute information corresponding to account number with described user and carry out pretreatment, obtain model defeated Enter data.

Device the most according to claim 7, it is characterised in that

Described pretreatment unit is used for:

Device the most according to claim 8, it is characterised in that

Described pretreatment unit is used for: