CN106055699A - Method and device for feature clustering - Google Patents
Method and device for feature clustering Download PDFInfo
- Publication number
- CN106055699A CN106055699A CN201610421683.7A CN201610421683A CN106055699A CN 106055699 A CN106055699 A CN 106055699A CN 201610421683 A CN201610421683 A CN 201610421683A CN 106055699 A CN106055699 A CN 106055699A
- Authority
- CN
- China
- Prior art keywords
- account
- user
- virtual service
- network virtual
- attribute information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Abstract
The invention discloses a method for feature clustering. The method comprises the following steps: obtaining account information and attribute information corresponding to the account information; preprocessing the account information and the attribute information corresponding to the account information to obtain model input data; utilizing a topic model algorithm to process the model input data to obtain the probability of each topic contained in the account information, wherein the probability of each topic corresponds to one feature; and utilizing a clustering algorithm to cluster the features contained in the account information. The method, which is provided by the embodiment of the invention, for feature clustering can cluster the account information and the attribute information corresponding to the account information through a topic probability way, an endless feature exploration process can be effectively avoided, and the problem of excessive feature dimensions can be effectively solved so as to improve feature clustering efficiency.
Description
Technical field
The present invention relates to field of computer technology, be specifically related to the method and device of a kind of feature clustering.
Background technology
Along with the high speed development of Internet technology, on network, the kind of application gets more and more, as a example by social networking application, at present
Social networking application the online exchange between user can not only be provided, it is also possible to push various types of content for user.
Such as: various types of public number in social networking application, can be opened, user can be by paying close attention to the public affairs oneself liked
Subscribe to, so, when there being new article to deliver under this public number, this new article will be pushed to this user for many numbers, thus
User is conducive to watch new article in time.
Because a public number can be subscribed to by numerous users, a user can also subscribe to multiple public number, therefore, for
The user group of each public number of more preferable analysis, or the tendentiousness of public number liked by user, it usually needs to the public
Number or user cluster.
Clustering method of the prior art is typically to set each sample the feature of different dimensions, but different dimensions
Feature generally requires the knowledge frequently in corresponding field and completes to arrange, and Exploration on Characteristics is a veryest long process, and feature may
Can be a lot, therefore it is easily caused characteristic dimension disaster, causes feature clustering inefficiency.
Summary of the invention
For solving prior art carries out feature clustering by the way of different dimensions feature-set, feature clustering is caused to be imitated
The problem that rate is low, the embodiment of the present invention provide a kind of feature clustering method, can to account and with account believe
The attribute information that breath is corresponding, clusters by the way of theme probability, can not only effectively prevent very long Exploration on Characteristics
Process, moreover it is possible to effectively reduce the problem that characteristic dimension is too much, thus improve the efficiency of feature clustering.The embodiment of the present invention also carries
Supply corresponding clustering apparatus.
First aspect present invention provides a kind of method of feature clustering, including:
Obtain account, and the attribute information corresponding with described account;
To described account, and the attribute information corresponding with described account carries out pretreatment, obtains model defeated
Enter data;
Utilize topic model algorithm, described mode input data are processed, obtain what described account was comprised
The probability of each theme, the corresponding feature of the probability of each theme;
The feature utilizing clustering algorithm to be comprised described account clusters.
Second aspect present invention provides the device of a kind of feature clustering, including:
Acquiring unit, is used for obtaining account, and the attribute information corresponding with described account;
Pretreatment unit, for the account that described acquiring unit is obtained and corresponding with described account
Attribute information carries out pretreatment, obtains mode input data;
Processing unit, is used for utilizing topic model algorithm, and the mode input data obtaining described pretreatment unit are carried out
Process, obtain the probability of each theme that described account is comprised, the corresponding feature of the probability of each theme;
Cluster cell, the feature that the described account for utilizing clustering algorithm to obtain described processing unit is comprised
Cluster.
By the way of different dimensions feature-set, carry out feature clustering with prior art, cause feature clustering efficiency low
Under compare, the method for feature clustering that the embodiment of the present invention provides, can be to account and corresponding with account information
Attribute information, clusters by the way of theme probability, can not only effectively prevent very long Exploration on Characteristics process, moreover it is possible to
Effectively reduce the problem that characteristic dimension is too much, thus improve the efficiency of feature clustering.
Accompanying drawing explanation
For the technical scheme being illustrated more clearly that in the embodiment of the present invention, in embodiment being described below required for make
Accompanying drawing be briefly described, it should be apparent that, below describe in accompanying drawing be only some embodiments of the present invention, for
From the point of view of those skilled in the art, on the premise of not paying creative work, it is also possible to obtain the attached of other according to these accompanying drawings
Figure.
Fig. 1 is an embodiment schematic diagram of the method for feature clustering in the embodiment of the present invention;
Fig. 2 is another embodiment schematic diagram of the method for feature clustering in the embodiment of the present invention;
Fig. 3 is another embodiment schematic diagram of the method for feature clustering in the embodiment of the present invention;
Fig. 4 is an embodiment schematic diagram of the device of feature clustering in the embodiment of the present invention;
Fig. 5 is an embodiment schematic diagram of server in the embodiment of the present invention.
Detailed description of the invention
The embodiment of the present invention provides a kind of method of feature clustering, can be to account and corresponding with account information
Attribute information, cluster by the way of theme probability, very long Exploration on Characteristics process can not only be effectively prevent, also
Can effectively reduce the problem that characteristic dimension is too much, thus improve the efficiency of feature clustering.The embodiment of the present invention additionally provides phase
The clustering apparatus answered.It is described in detail individually below.
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Describe, it is clear that described embodiment is only a part of embodiment of the present invention rather than whole embodiments wholely.Based on
Embodiment in the present invention, the every other enforcement that those skilled in the art are obtained under not making creative work premise
Example, broadly falls into the scope of protection of the invention.
For the ease of understanding the content in the embodiment of the present invention, do below for the noun involved by the embodiment of the present invention
Lower simple introduction.
Account: refer to the information for representing account number, network virtual service account numbers can be included, and at network
User in virtual service desk registers account number etc..
Network virtual service account numbers: refer to the public number of registration in network virtual service platform.
User registers account number: refer to the account number of the social networking application of user.
The attribute information that account is corresponding: referring to account information is the information of tree structure.
Such as: in embodiments of the present invention, when account is network virtual service account numbers, then network virtual service account
Number corresponding attribute information is to subscribe to the user profile under this network virtual service account numbers, including, user account number.
When account be the user in network virtual service platform register account number time, then user to register account number corresponding
The network virtual service account numbers that attribute information is paid close attention to by this user account number.
Topic model algorithm: (English full name Latent Dirichlet Allocation, English abbreviation " LDA "), theme
Model is as the term suggests being exactly a kind of modeling method to theme implicit in word, and topic model can use formulaRepresent.
The formula of above-mentioned topic model is to represent with the form of document, and wherein, p (word document) represents every document
In the probability that occurs of each word, p (word theme) represents the probability that each word in each theme occurs, p (theme literary composition
Shelves) represent is the probability that in every document, each theme occurs.
If representing by the form of matrix, above-mentioned model formation is also denoted as C=Φ * Θ.
Wherein C, Φ and Θ are matrixes, and when as a example by article, wherein, C represents that in every document, each word occurs
Probability, namely p (word document), Φ represents the Probability p (word theme) that each word in each theme occurs, Θ table
Show is the Probability p (subject document) that in every document, each theme occurs.
Theme is exactly the conditional probability distribution of word on vocabulary, the corresponding feature of the probability of each theme, such as: example
As: in one scenario, p (notebook Baidu)=0.000001, p (notebook association)=0.2, then 0.000001 correspondence
Being characterized as Baidu, 0.2 characteristic of correspondence is association.
Feature clustering: exactly similar feature is gathered an apoplexy due to endogenous wind.
Cluster process can be first arbitrarily to select k object as initial cluster center from n data object, and k is little
In n, and for other object remaining, then according to them and the similarity (distance) of these cluster centres, respectively by they point
(representated by cluster centre) cluster that dispensing is most like with it;Calculate the average of all objects in this cluster the most again, it is thus achieved that
The cluster centre of new cluster, constantly repeats this process until canonical measure function starts convergence.Typically all use mean square
Difference is as canonical measure function.
Wherein, k cluster has the following characteristics that each cluster itself is the compactest, and between respectively clustering as far as possible
Separately.
Represent with mathematical way and can be:
Step 1: input: k, data [n];
Step 2, k initial center point of selection, such as c [0]=data [0] ... c [k-1]=data [k-1];
Step 3, for data [0] ... .data [n], respectively with c [0] ... c [k-1] compares, it is assumed that minimum with c difference, just
It is labeled as i;
Step 4, be labeled as i point for all, recalculate data [j] sum of all i of being labeled as of c={/it is labeled as i
Number;
Repeat (3) (4), until the change of all c values is less than given threshold value.
Being above the introduction to the related names involved by the embodiment of the present invention, the explanation present invention is real below in conjunction with the accompanying drawings
Execute the embodiment of the method for feature clustering in example.
It should be noted that the device realizing embodiment of the present invention feature clustering can be an independent physical machine, also
Can be the physical machine cluster that formed of multiple physical machine, it is also possible to be multiple dependence void of being divided out from physical resource
Plan machine.Server belongs to a kind of form of expression of physical machine.
Fig. 1 is an embodiment schematic diagram of the method for feature clustering in the embodiment of the present invention,
As it is shown in figure 1, the embodiment of the method for feature clustering that the embodiment of the present invention is provided includes:
101, account is obtained, and the attribute information corresponding with described account.
When account is network virtual service account numbers, the attribute information corresponding with described account can be to pay close attention to
The user of this network virtual service account numbers registers account number.
Such as: when network virtual service account numbers is public number, the attribute information corresponding with described account can be
The user subscribing to this public number registers account number, and the attribute information the most corresponding with described account is not limited to subscribe to this public number
User register account number, it is also possible to including booking reader's quantity, any active ues quantity is, and interactive vermicelli quantity etc..
When account be the user in network virtual service platform register account number time, corresponding with described account
Attribute information can register the network virtual service account numbers ordered by account number for this user,
Such as: this user registers the public number that account number is paid close attention to, user register public number that account number paid close attention to can be from public affairs
Acquire above crowd's platform in the public number list ordered by each user and search.The genus the most corresponding with described account
Property information is not limited to the public number that this user is paid close attention to, it is also possible to include the upstream message that user sends to each wechat public number
Number, pay number of times, check article number of times and click on menu number of times etc..
102, to described account, and the attribute information corresponding with described account carries out pretreatment, obtains mould
Type input data.
The process of pretreatment can be that the form between account and attribute information generates, and the filtration of data.
103, utilize topic model algorithm, described mode input data are processed, obtain described account and wrapped
The probability of each theme contained, the corresponding feature of the probability of each theme.
Utilize topic model algorithm, described mode input data are carried out process and can utilize formulaOr mode input data are carried out by formula C=Φ * Θ
Process, obtain the probability of each theme, so that it is determined that each theme characteristic of correspondence.
104, the feature utilizing clustering algorithm to be comprised described account clusters.
The process of cluster can be refering to the description of explanation of nouns part:
Step 1: input: k, data [n];
Step 2, k initial center point of selection, such as c [0]=data [0] ... c [k-1]=data [k-1];
Step 3, for data [0] ... .data [n], respectively with c [0] ... c [k-1] compares, it is assumed that minimum with c difference, just
It is labeled as i;
Step 4, be labeled as i point for all, recalculate data [j] sum of all i of being labeled as of c={/it is labeled as i
Number;
Repeat (3) (4), until the change of all c values is less than given threshold value.
Realizing feature clustering by this process, the most in embodiments of the present invention, the data of input are account.
By the way of different dimensions feature-set, carry out feature clustering with prior art, cause feature clustering efficiency low
Under compare, the method for feature clustering that the embodiment of the present invention provides, can be to account and corresponding with account information
Attribute information, clusters by the way of theme probability, can not only effectively prevent very long Exploration on Characteristics process, moreover it is possible to
Effectively reduce the problem that characteristic dimension is too much, thus improve the efficiency of feature clustering.
Alternatively, on the basis of the content described by above-described embodiment, the feature clustering that the embodiment of the present invention is provided
Method another embodiment in, described account is network virtual service account numbers, the most described to described account, and
The attribute information corresponding with described account carries out pretreatment, obtains mode input data, may include that
To described network virtual service account numbers, and the attribute information corresponding with described network virtual service account numbers carries out pre-
Process, obtain mode input data.
Further, described to described network virtual service account numbers and corresponding with described network virtual service account numbers
Attribute information carries out pretreatment, obtains mode input data, may include that
Generate network virtual service account numbers and subscribe to that the user of described network virtual service account numbers registers between account number right
Should be related to;
Filter out user to register account number and be unsatisfactory for the described corresponding relation of prerequisite.
In the embodiment of the present invention, network virtual service account numbers and user register the corresponding relation between account number can be by closing
The form of series of tables represents.
As shown in table 1, public number and subscribe to the mapping table registering between user of this public number and can be:
As shown in table 1, public number " knows force of labor " and the corresponding relation registered between user of concern " knowing force of labor " can be used
The such as form of table 1 represents, certain table 1 simply citing, it practice, most of public number all can have substantial amounts of registration user to close
Note.
In addition, it is necessary to explanation a bit, what the vermicelli in the embodiment of the present invention referred to is also registration user, some places make
With vermicelli, some places employ registration user, simply engage the bluntization statement that concrete scene is done, but should will not register
User and vermicelli do different understanding.
Describe when account is network virtual service account numbers below in conjunction with Fig. 2, the spy that the embodiment of the present invention is provided
Levy the process of the method for cluster.
As in figure 2 it is shown, as a example by public number, another embodiment of the method for the feature clustering that the embodiment of the present invention is provided
Including:
201, from public number platform, each public number is gathered, and the attribute information that each public number is corresponding.
Attribute information corresponding to public number includes that the user subscribing to each public number registers account number, also includes but does not limit reaction
The data such as booking reader's number of public number scale, active users, interactive user number.
202, the user data under each public number is carried out pretreatment.
The process of pretreatment includes: generating preprocessed data Data, form can be: the use of public number t correspondence public number
Family registration Accounts List.
After the user of generation public number registers Accounts List, the data in list are done filtration and clean:
Data are done filtration cleaning need to carry out in terms of two, are on the one hand to carry out filtering from the angle of public number cleaning,
On the other hand it is to carry out filtering from the angle of user cleaning.
For from the angle of statistical distribution, in a data acquisition system, king-sized data and the least data are the most not
It is suitable for statistics, so cleaning data to need to wash king-sized data and the least data in data acquisition system, about especially
The cleaning embodiment of the present invention of big data and the least data enumerates two schemes:
First introduce to carry out filtering from the angle of public number and clean.
Carry out filtering from the angle of public number cleaning and refer to filter out user's public number many especially and user is the fewest
Public number.Two kinds of filtering schemes are respectively as follows:
The first is: filters the public number washing registration number of users more than first threshold U, and filters out registration user
The number public number less than Second Threshold B.
The second is: the registration number of users distribution of statistics public number, filters out more than 95 points of positions (or other point of position)
Public number, and 5 be divided into (or other point of position) public number below.Point position refers to the distribution position of data statistically
Put.
It is described below to carry out filtering from the angle of user and cleans.
Carry out filtering from the angle of user cleaning to also refer to filter out data acquisition system and subscribe to the user that public number is the fewest
Subscribing to, with filtering out, the user that public number is many especially, two kinds of filtering schemes are respectively as follows:
The first is: filter out subscription public's count less than a certain threshold value (such as: 5) or more than some threshold values
The user of (such as: 100000).
The second is: counting user subscribes to the distribution of public number, filters out the use of more than 95 points of positions (or other point of position)
Family, and 5 be divided into (or other point of position) user below.
203, utilize topic model algorithm, carry out theme study, obtain each public number probability distribution at each theme.
The process of theme study can use topic model lightLDA or the degree of depth study mould supporting Distributed Calculation
Type.
204, each public number theme probability distribution result in output step 203.
After the output of each public number theme probability distribution result, carry out manual evaluation, carry out by constantly adjusting model parameter
The Optimized Iterative of step 203, makes final result reach perfect condition as far as possible.
Final data form is: public number t theme 1: probit 1 theme 2: probit 2... theme N: probit N
205, for the distribution situation of each theme corresponding to public number of output in step 204, each theme correspondence one
Individual feature, then utilizes cluster that public number is carried out feature clustering.
Above step 201-205 is the process prescription combining public number to feature clustering, the public in the embodiment of the present invention
Number can be wechat public number, it is also possible to be the public number in other social networking applications.
Alternatively, on the basis of the content described by above-described embodiment, the feature clustering that the embodiment of the present invention is provided
Method another embodiment in, described account is that the user in network virtual service platform registers account number, then described
To described account, and the attribute information corresponding with described account carries out pretreatment, obtains mode input data, bag
Include:
Described user is registered account number, and registers attribute information corresponding to account number with described user and carry out pretreatment,
To mode input data.
Further, described described user is registered account number, and register, with described user, the attribute information that account number is corresponding
Carry out pretreatment, obtain mode input data, may include that
Generate that user registers that account number registers between the network virtual service account numbers ordered by account number with described user is corresponding
Relation;
Filter out network virtual service account numbers and be unsatisfactory for the described corresponding relation of prerequisite.
In the embodiment of the present invention, the corresponding relation that user registers between account number and ordered network virtual service account numbers can
To be represented by the form of relation list.
As shown in table 2, user registers the mapping table between account number and ordered public number and can be:
As shown in table 2, user registers the corresponding relation between account number 13415666333 and ordered public number and can use
The such as form of table 2 represents, certain table 2 simply citing, it practice, this user is also possible to have subscribed more public number.
When feature clustering, the similarity between each public number to be paid close attention to.
Below in conjunction with Fig. 3 describe when account be the user in network virtual service platform register account number time, this
The process of the method for the feature clustering that bright embodiment is provided.
As it is shown on figure 3, another embodiment of the method for feature clustering that the embodiment of the present invention is provided includes:
301, from public number platform, the public number list ordered by each user is gathered.
In addition to public number list, it is also possible to gather some statistical indicators of the public number that each registration user monthly subscribes to
Information, wherein can include upstream message number that user sends to each wechat public number, pay number of times, check article
Number, click menu number of times etc..
The data that 302, each user registers account number carry out pretreatment.
The process of pretreatment includes: generate data Data, form is: user register account number t its subscribe to public number row
Table.
After generating the public number list of user, the process that the data in list are done filtration cleaning may is that
First introduce to carry out filtering from the angle of public number and clean.
Carry out filtering from the angle of public number cleaning and refer to filter out user's public number many especially and user is the fewest
Public number.Two kinds of filtering schemes are respectively as follows:
The first is: filters the public number washing registration number of users more than first threshold U, and filters out registration user
The number public number less than Second Threshold B.
The second is: the registration number of users distribution of statistics public number, filters out more than 95 points of positions (or other point of position)
Public number, and 5 be divided into (or other point of position) public number below.Point position refers to the distribution position of data statistically
Put.
It is described below to carry out filtering from the angle of user and cleans.
Carry out filtering from the angle of user cleaning to also refer to filter out data acquisition system and subscribe to the user that public number is the fewest
Subscribing to, with filtering out, the user that public number is many especially, two kinds of filtering schemes are respectively as follows:
The first is: filter out subscription public's count less than a certain threshold value (such as: 5) or more than some threshold values
The user of (such as: 100000).
The second is: counting user subscribes to the distribution of public number, filters out the use of more than 95 points of positions (or other point of position)
Family, and 5 be divided into (or other point of position) user below.
303, utilize topic model algorithm, carry out theme study, obtain each public number probability distribution at each theme.
The process of theme study can use topic model lightLDA or the degree of depth study mould supporting Distributed Calculation
Type.
304, each public number theme probability distribution result in output step 303.
During modelling effect optimizes, in addition to based on subscribing relationship, also can be based on registration user and public number
Interactive relationship cluster, interactive relationship be defined as upstream message number, pay number of times, check article number of times, click on dish
Some index numbers such as single number reach certain numerical value.Exporting the potential applications theme distribution that each user is corresponding, form is: note
Volume user t theme 1: probit 1 theme 2: probit 2... theme N: probit N.
305, for the distribution situation of each theme corresponding to public number of output in step 304, each theme correspondence one
Individual feature, then utilizes cluster that public number is carried out feature clustering.
Above step 301-305 is the process prescription combining public number to feature clustering, the public in the embodiment of the present invention
Number can be wechat public number, it is also possible to be the public number in other social networking applications.
The method of the cluster that the embodiment of the present invention is provided, text data involved during cluster include but not
It is limited to such as text message structure correlated characteristic data such as the pet name, brief introduction, signature and articles.
The topic model algorithm used includes but not limited to such as the study of the latent semantic model such as degree of depth and topic model
Various receptor models, it is also possible to include singular value decomposition (English full name Singular value decomposition, English letter
Claim " SVD ") etc. various clustering algorithms carry out being identified according to potential applications information.
It addition, in the embodiment described by Fig. 2 and Fig. 3, the replacement of the relation of public number and registration user, such as but not
Limit and click on wechat public number and the relation of its corresponding article, the forwarding relation of wechat public number article, wechat public number user
Relation etc..
Above, the method for the feature clustering that the embodiment of the present invention is provided, produced beneficial effect may include that
One, very long Exploration on Characteristics process can be effectively prevent, moreover it is possible to effectively reduce the problem that characteristic dimension is too much.
Two: utilize distributed topic model effectively to support large-scale clustered demand.
Three: by wechat public number or vermicelli user are clustered, can use same in follow-up excacation
The individual wechat public number of individual theme agency or user data, the most effectively solve long-tail part Sparse Problem.
Four: wechat public number cluster result has the place of a lot of potential use, including the recommendation of similar wechat public number, wechat
The fields such as the recommendation of public number article, wechat public number advertisement broadcasting.
It is above the description of the method to feature clustering, the device of the feature clustering being described below in the embodiment of the present invention
20。
Fig. 4 is an embodiment schematic diagram of the device 20 of feature clustering in the embodiment of the present invention.
Refering to Fig. 4, an embodiment of the device 40 of the feature clustering that the embodiment of the present invention is provided includes:
Acquiring unit 401, is used for obtaining account, and the attribute information corresponding with described account;
Pretreatment unit 402, for described acquiring unit 401 obtain account, and with described account
Corresponding attribute information carries out pretreatment, obtains mode input data;
Processing unit 403, is used for utilizing topic model algorithm, the mode input number obtaining described pretreatment unit 402
According to processing, obtain the probability of each theme that described account is comprised, the corresponding feature of the probability of each theme;
Cluster cell 404, is comprised for the described account utilizing clustering algorithm to obtain described processing unit 403
Feature cluster.
In the embodiment of the present invention, acquiring unit 401 obtains account, and the attribute letter corresponding with described account
Breath;The account that described acquiring unit 401 is obtained by pretreatment unit 402, and the attribute corresponding with described account
Information carries out pretreatment, obtains mode input data;Processing unit 403 utilizes topic model algorithm, to described pretreatment unit
The mode input data that 402 obtain process, and obtain the probability of each theme that described account is comprised, each theme
The corresponding feature of probability;Cluster cell 404, for the described account number utilizing clustering algorithm to obtain described processing unit 403
The feature that information is comprised clusters.By the way of different dimensions feature-set, feature clustering is carried out with prior art,
Cause feature clustering inefficiency to be compared, the device of feature clustering that the embodiment of the present invention provides, can to account and
The attribute information corresponding with account information, clusters by the way of theme probability, can not only effectively prevent very long
Exploration on Characteristics process, moreover it is possible to effectively reduce the problem that characteristic dimension is too much, thus improve the efficiency of feature clustering.
Alternatively, on the basis of the embodiment of the device 40 of features described above cluster, the feature that the embodiment of the present invention provides
In another embodiment of the device 40 of cluster,
Described pretreatment unit, for when described account is network virtual service account numbers, to described network virtual
Service account numbers, and the attribute information corresponding with described network virtual service account numbers carry out pretreatment, obtains mode input data.
Further, described pretreatment unit is used for:
Generate network virtual service account numbers and subscribe to that the user of described network virtual service account numbers registers between account number right
Should be related to;
Filter out user to register account number and be unsatisfactory for the described corresponding relation of prerequisite.
Alternatively, on the basis of the embodiment of the device 40 of features described above cluster, the feature that the embodiment of the present invention provides
In another embodiment of the device 40 of cluster,
Described pretreatment unit, is used for when described account is that the user in network virtual service platform registers account number
Time, described user is registered account number, and registers attribute information corresponding to account number with described user and carry out pretreatment, obtain model
Input data.
Further, described pretreatment unit is used for:
Generate that user registers that account number registers between the network virtual service account numbers ordered by account number with described user is corresponding
Relation;
Filter out network virtual service account numbers and be unsatisfactory for the described corresponding relation of prerequisite.
The device of features above cluster can be realized by server, illustrates to be realized by server below in conjunction with Fig. 5
The device stating feature clustering realizes the process of cluster.
Fig. 5 is the structural representation of the server 50 that the embodiment of the present invention provides.Described server 50 includes processor
510, memorizer 550 and transceiver 530, memorizer 550 can include read only memory and random access memory, and to process
Device 510 provides operational order and data.A part for memorizer 550 can also include nonvolatile RAM
(NVRAM)。
In some embodiments, memorizer 550 stores following element, executable module or data structure, or
Their subset of person, or their superset:
In embodiments of the present invention, by calling the operational order of memorizer 550 storage, (this operational order is storable in behaviour
Make in system),
Obtain account, and the attribute information corresponding with described account;
To described account, and the attribute information corresponding with described account carries out pretreatment, obtains model defeated
Enter data;
Utilize topic model algorithm, described mode input data are processed, obtain what described account was comprised
The probability of each theme, the corresponding feature of the probability of each theme;
The feature utilizing clustering algorithm to be comprised described account clusters.
By the way of different dimensions feature-set, carry out feature clustering with prior art, cause feature clustering efficiency low
Under compare, the embodiment of the present invention provide server, can to account and the attribute information corresponding with account information,
Cluster by the way of theme probability, very long Exploration on Characteristics process can not only be effectively prevent, moreover it is possible to effectively reduce
The problem that characteristic dimension is too much, thus improve the efficiency of feature clustering.
Processor 510 controls the operation of server 50, and processor 510 can also be referred to as CPU (Central Processing
Unit, CPU).Memorizer 550 can include read only memory and random access memory, and to processor 510
Instruction and data is provided.A part for memorizer 550 can also include nonvolatile RAM (NVRAM).Specifically
Application in each assembly of server 50 be coupled by bus system 520, wherein bus system 520 is except including data
Outside bus, it is also possible to include power bus, control bus and status signal bus in addition etc..But for the sake of understanding explanation,
Various buses are all designated as bus system 520 by figure.
The method that the invention described above embodiment discloses can apply in processor 510, or is realized by processor 510.
Processor 510 is probably a kind of IC chip, has the disposal ability of signal.During realizing, said method each
Step can be completed by the instruction of the integrated logic circuit of the hardware in processor 510 or software form.Above-mentioned process
Device 510 can be general processor, digital signal processor (DSP), special IC (ASIC), ready-made programmable gate array
Or other PLDs, discrete gate or transistor logic, discrete hardware components (FPGA).Can realize or
Person performs disclosed each method, step and logic diagram in the embodiment of the present invention.General processor can be microprocessor or
This processor of person can also be the processor etc. of any routine.Step in conjunction with the method disclosed in the embodiment of the present invention can be straight
Connect and be presented as that hardware decoding processor has performed, or performed with the hardware in decoding processor and software module combination
Become.Software module may be located at random access memory, flash memory, read only memory, and programmable read only memory or electrically-erasable can
In the storage medium that this area such as programmable memory, depositor is ripe.This storage medium is positioned at memorizer 550, and processor 510 is read
Information in access to memory 550, completes the step of said method in conjunction with its hardware.
Alternatively, processor 510 is used for:
When described account be the user in network virtual service platform register account number time, described user is registered account
Number, and register attribute information corresponding to account number with described user and carry out pretreatment, obtain mode input data.
Processor 510 is used for further,
Generate network virtual service account numbers and subscribe to that the user of described network virtual service account numbers registers between account number right
Should be related to;
Filter out user to register account number and be unsatisfactory for the described corresponding relation of prerequisite.
Alternatively, processor 510 is used for:
When described account be the user in network virtual service platform register account number time, described user is registered account
Number, and register attribute information corresponding to account number with described user and carry out pretreatment, obtain mode input data.
Processor 510 is used for further,
Generate that user registers that account number registers between the network virtual service account numbers ordered by account number with described user is corresponding
Relation;
Filter out network virtual service account numbers and be unsatisfactory for the described corresponding relation of prerequisite.
Above server 50 can understand refering to the description of Fig. 1 to Fig. 3 part, and this place does not do and too much repeats
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can
Completing instructing relevant hardware by program, this program can be stored in a computer-readable recording medium, storage
Medium may include that ROM, RAM, disk or CD etc..
Method and the device of the feature clustering provided the embodiment of the present invention above are described in detail, herein
Applying specific case to be set forth principle and the embodiment of the present invention, the explanation of above example is only intended to help
Understand method and the core concept thereof of the present invention;Simultaneously for one of ordinary skill in the art, according to the thought of the present invention,
The most all will change, in sum, this specification content should not be construed as this
The restriction of invention.
Claims (10)
1. the method for a feature clustering, it is characterised in that including:
Obtain account, and the attribute information corresponding with described account;
To described account, and the attribute information corresponding with described account carries out pretreatment, obtains mode input number
According to;
Utilize topic model algorithm, described mode input data are processed, obtain each master that described account is comprised
The probability of topic, the corresponding feature of the probability of each theme;
The feature utilizing clustering algorithm to be comprised described account clusters.
Method the most according to claim 1, it is characterised in that described account is network virtual service account numbers, then institute
State described account, and the attribute information corresponding with described account carry out pretreatment, obtains mode input data,
Including:
To described network virtual service account numbers, and the attribute information corresponding with described network virtual service account numbers carries out pre-place
Reason, obtains mode input data.
Method the most according to claim 1, it is characterised in that described account is in network virtual service platform
User registers account number, the most described to described account, and the attribute information corresponding with described account carries out pretreatment,
Obtain mode input data, including:
Described user is registered account number, and registers attribute information corresponding to account number with described user and carry out pretreatment, obtain mould
Type input data.
Method the most according to claim 2, it is characterised in that described to described network virtual service account numbers, and with institute
State attribute information corresponding to network virtual service account numbers and carry out pretreatment, obtain mode input data, including:
Generate the corresponding pass that network virtual service account numbers is registered between account number with the user subscribing to described network virtual service account numbers
System;
Filter out user to register account number and be unsatisfactory for the described corresponding relation of prerequisite.
Method the most according to claim 3, it is characterised in that described account number that described user is registered, and with described use
The attribute information that account number is registered corresponding in family carries out pretreatment, obtains mode input data, including:
Generation user registers account number and described user registers the corresponding relation between the network virtual service account numbers ordered by account number;
Filter out network virtual service account numbers and be unsatisfactory for the described corresponding relation of prerequisite.
6. the device of a feature clustering, it is characterised in that including:
Acquiring unit, is used for obtaining account, and the attribute information corresponding with described account;
Pretreatment unit, for the account that described acquiring unit is obtained, and the attribute corresponding with described account
Information carries out pretreatment, obtains mode input data;
Processing unit, is used for utilizing topic model algorithm, and the mode input data obtaining described pretreatment unit process,
Obtain the probability of each theme that described account is comprised, the corresponding feature of the probability of each theme;
Cluster cell, the feature that the described account for utilizing clustering algorithm to obtain described processing unit is comprised is carried out
Cluster.
Device the most according to claim 6, it is characterised in that
Described pretreatment unit, for when described account is network virtual service account numbers, services described network virtual
Account number, and the attribute information corresponding with described network virtual service account numbers carry out pretreatment, obtains mode input data.
Device the most according to claim 6, it is characterised in that
Described pretreatment unit, for when described account be the user in network virtual service platform register account number time,
Described user is registered account number, and registers attribute information corresponding to account number with described user and carry out pretreatment, obtain model defeated
Enter data.
Device the most according to claim 7, it is characterised in that
Described pretreatment unit is used for:
Generate the corresponding pass that network virtual service account numbers is registered between account number with the user subscribing to described network virtual service account numbers
System;
Filter out user to register account number and be unsatisfactory for the described corresponding relation of prerequisite.
Device the most according to claim 8, it is characterised in that
Described pretreatment unit is used for:
Generation user registers account number and described user registers the corresponding relation between the network virtual service account numbers ordered by account number;
Filter out network virtual service account numbers and be unsatisfactory for the described corresponding relation of prerequisite.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610421683.7A CN106055699B (en) | 2016-06-15 | 2016-06-15 | A kind of method and device of feature clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610421683.7A CN106055699B (en) | 2016-06-15 | 2016-06-15 | A kind of method and device of feature clustering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106055699A true CN106055699A (en) | 2016-10-26 |
CN106055699B CN106055699B (en) | 2018-07-06 |
Family
ID=57167761
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610421683.7A Active CN106055699B (en) | 2016-06-15 | 2016-06-15 | A kind of method and device of feature clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106055699B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107403311A (en) * | 2017-06-27 | 2017-11-28 | 阿里巴巴集团控股有限公司 | The recognition methods of account purposes and device |
CN108287909A (en) * | 2018-01-31 | 2018-07-17 | 北京仁和汇智信息技术有限公司 | A kind of paper method for pushing and device |
TWI752485B (en) * | 2019-11-14 | 2022-01-11 | 大陸商支付寶(杭州)信息技術有限公司 | User clustering and feature learning method, device, and computer-readable medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101408901A (en) * | 2008-11-26 | 2009-04-15 | 东北大学 | Probability clustering method of cross-categorical data based on key word |
CN101770454A (en) * | 2010-02-13 | 2010-07-07 | 武汉理工大学 | Method for expanding feature space of short text |
CN102289487A (en) * | 2011-08-09 | 2011-12-21 | 浙江大学 | Network burst hotspot event detection method based on topic model |
CN104657375A (en) * | 2013-11-20 | 2015-05-27 | 中国科学院深圳先进技术研究院 | Image-text theme description method, device and system |
-
2016
- 2016-06-15 CN CN201610421683.7A patent/CN106055699B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101408901A (en) * | 2008-11-26 | 2009-04-15 | 东北大学 | Probability clustering method of cross-categorical data based on key word |
CN101770454A (en) * | 2010-02-13 | 2010-07-07 | 武汉理工大学 | Method for expanding feature space of short text |
CN102289487A (en) * | 2011-08-09 | 2011-12-21 | 浙江大学 | Network burst hotspot event detection method based on topic model |
CN104657375A (en) * | 2013-11-20 | 2015-05-27 | 中国科学院深圳先进技术研究院 | Image-text theme description method, device and system |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107403311A (en) * | 2017-06-27 | 2017-11-28 | 阿里巴巴集团控股有限公司 | The recognition methods of account purposes and device |
CN107403311B (en) * | 2017-06-27 | 2020-04-21 | 阿里巴巴集团控股有限公司 | Account use identification method and device |
CN108287909A (en) * | 2018-01-31 | 2018-07-17 | 北京仁和汇智信息技术有限公司 | A kind of paper method for pushing and device |
TWI752485B (en) * | 2019-11-14 | 2022-01-11 | 大陸商支付寶(杭州)信息技術有限公司 | User clustering and feature learning method, device, and computer-readable medium |
Also Published As
Publication number | Publication date |
---|---|
CN106055699B (en) | 2018-07-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104090919B (en) | Advertisement recommending method and advertisement recommending server | |
CN105247507B (en) | Method, system and storage medium for the influence power score for determining brand | |
Perros | Queueing networks with blocking | |
WO2014056408A1 (en) | Information recommending method, device and server | |
CN106570008A (en) | Recommendation method and device | |
CN108171528B (en) | Attribution method and attribution system | |
WO2010078060A1 (en) | Systems and methods for making recommendations using model-based collaborative filtering with user communities and items collections | |
CN107229730A (en) | Data query method and device | |
CN105718565A (en) | Data warehouse model construction method and construction apparatus | |
CN104376058A (en) | User interest model updating method and device | |
CN107302573A (en) | A kind of information-pushing method, device, electronic equipment and storage medium | |
CN111523072A (en) | Page access data statistical method and device, electronic equipment and storage medium | |
CN106055699A (en) | Method and device for feature clustering | |
CN107025565A (en) | A kind of method and system for improving e-commerce website conversion ratio | |
CN110020149A (en) | Labeling processing method, device, terminal device and the medium of user information | |
CN112256720A (en) | Data cost calculation method, system, computer device and storage medium | |
CN111415199A (en) | Customer prediction updating method and device based on big data and storage medium | |
CN103970753A (en) | Pushing method and pushing device for related knowledge | |
CN110473073A (en) | The method and device that linear weighted function is recommended | |
CN110472016A (en) | Article recommended method, device, electronic equipment and storage medium | |
CN110457288A (en) | Data model construction method, device, equipment and computer readable storage medium | |
CN112686717A (en) | Data processing method and system for advertisement recall | |
CN110222790A (en) | Method for identifying ID, device and server | |
ES2900746T3 (en) | Systems and methods to effectively distribute warning messages | |
CN111488531A (en) | Information recommendation method, device and medium based on collaborative filtering algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |