CN106055699A - Method and device for feature clustering - Google Patents

Method and device for feature clustering Download PDF

Info

Publication number
CN106055699A
CN106055699A CN201610421683.7A CN201610421683A CN106055699A CN 106055699 A CN106055699 A CN 106055699A CN 201610421683 A CN201610421683 A CN 201610421683A CN 106055699 A CN106055699 A CN 106055699A
Authority
CN
China
Prior art keywords
account
user
virtual service
network virtual
attribute information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610421683.7A
Other languages
Chinese (zh)
Other versions
CN106055699B (en
Inventor
陈明星
陈谦
万伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201610421683.7A priority Critical patent/CN106055699B/en
Publication of CN106055699A publication Critical patent/CN106055699A/en
Application granted granted Critical
Publication of CN106055699B publication Critical patent/CN106055699B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention discloses a method for feature clustering. The method comprises the following steps: obtaining account information and attribute information corresponding to the account information; preprocessing the account information and the attribute information corresponding to the account information to obtain model input data; utilizing a topic model algorithm to process the model input data to obtain the probability of each topic contained in the account information, wherein the probability of each topic corresponds to one feature; and utilizing a clustering algorithm to cluster the features contained in the account information. The method, which is provided by the embodiment of the invention, for feature clustering can cluster the account information and the attribute information corresponding to the account information through a topic probability way, an endless feature exploration process can be effectively avoided, and the problem of excessive feature dimensions can be effectively solved so as to improve feature clustering efficiency.

Description

A kind of method and device of feature clustering
Technical field
The present invention relates to field of computer technology, be specifically related to the method and device of a kind of feature clustering.
Background technology
Along with the high speed development of Internet technology, on network, the kind of application gets more and more, as a example by social networking application, at present Social networking application the online exchange between user can not only be provided, it is also possible to push various types of content for user.
Such as: various types of public number in social networking application, can be opened, user can be by paying close attention to the public affairs oneself liked Subscribe to, so, when there being new article to deliver under this public number, this new article will be pushed to this user for many numbers, thus User is conducive to watch new article in time.
Because a public number can be subscribed to by numerous users, a user can also subscribe to multiple public number, therefore, for The user group of each public number of more preferable analysis, or the tendentiousness of public number liked by user, it usually needs to the public Number or user cluster.
Clustering method of the prior art is typically to set each sample the feature of different dimensions, but different dimensions Feature generally requires the knowledge frequently in corresponding field and completes to arrange, and Exploration on Characteristics is a veryest long process, and feature may Can be a lot, therefore it is easily caused characteristic dimension disaster, causes feature clustering inefficiency.
Summary of the invention
For solving prior art carries out feature clustering by the way of different dimensions feature-set, feature clustering is caused to be imitated The problem that rate is low, the embodiment of the present invention provide a kind of feature clustering method, can to account and with account believe The attribute information that breath is corresponding, clusters by the way of theme probability, can not only effectively prevent very long Exploration on Characteristics Process, moreover it is possible to effectively reduce the problem that characteristic dimension is too much, thus improve the efficiency of feature clustering.The embodiment of the present invention also carries Supply corresponding clustering apparatus.
First aspect present invention provides a kind of method of feature clustering, including:
Obtain account, and the attribute information corresponding with described account;
To described account, and the attribute information corresponding with described account carries out pretreatment, obtains model defeated Enter data;
Utilize topic model algorithm, described mode input data are processed, obtain what described account was comprised The probability of each theme, the corresponding feature of the probability of each theme;
The feature utilizing clustering algorithm to be comprised described account clusters.
Second aspect present invention provides the device of a kind of feature clustering, including:
Acquiring unit, is used for obtaining account, and the attribute information corresponding with described account;
Pretreatment unit, for the account that described acquiring unit is obtained and corresponding with described account Attribute information carries out pretreatment, obtains mode input data;
Processing unit, is used for utilizing topic model algorithm, and the mode input data obtaining described pretreatment unit are carried out Process, obtain the probability of each theme that described account is comprised, the corresponding feature of the probability of each theme;
Cluster cell, the feature that the described account for utilizing clustering algorithm to obtain described processing unit is comprised Cluster.
By the way of different dimensions feature-set, carry out feature clustering with prior art, cause feature clustering efficiency low Under compare, the method for feature clustering that the embodiment of the present invention provides, can be to account and corresponding with account information Attribute information, clusters by the way of theme probability, can not only effectively prevent very long Exploration on Characteristics process, moreover it is possible to Effectively reduce the problem that characteristic dimension is too much, thus improve the efficiency of feature clustering.
Accompanying drawing explanation
For the technical scheme being illustrated more clearly that in the embodiment of the present invention, in embodiment being described below required for make Accompanying drawing be briefly described, it should be apparent that, below describe in accompanying drawing be only some embodiments of the present invention, for From the point of view of those skilled in the art, on the premise of not paying creative work, it is also possible to obtain the attached of other according to these accompanying drawings Figure.
Fig. 1 is an embodiment schematic diagram of the method for feature clustering in the embodiment of the present invention;
Fig. 2 is another embodiment schematic diagram of the method for feature clustering in the embodiment of the present invention;
Fig. 3 is another embodiment schematic diagram of the method for feature clustering in the embodiment of the present invention;
Fig. 4 is an embodiment schematic diagram of the device of feature clustering in the embodiment of the present invention;
Fig. 5 is an embodiment schematic diagram of server in the embodiment of the present invention.
Detailed description of the invention
The embodiment of the present invention provides a kind of method of feature clustering, can be to account and corresponding with account information Attribute information, cluster by the way of theme probability, very long Exploration on Characteristics process can not only be effectively prevent, also Can effectively reduce the problem that characteristic dimension is too much, thus improve the efficiency of feature clustering.The embodiment of the present invention additionally provides phase The clustering apparatus answered.It is described in detail individually below.
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Describe, it is clear that described embodiment is only a part of embodiment of the present invention rather than whole embodiments wholely.Based on Embodiment in the present invention, the every other enforcement that those skilled in the art are obtained under not making creative work premise Example, broadly falls into the scope of protection of the invention.
For the ease of understanding the content in the embodiment of the present invention, do below for the noun involved by the embodiment of the present invention Lower simple introduction.
Account: refer to the information for representing account number, network virtual service account numbers can be included, and at network User in virtual service desk registers account number etc..
Network virtual service account numbers: refer to the public number of registration in network virtual service platform.
User registers account number: refer to the account number of the social networking application of user.
The attribute information that account is corresponding: referring to account information is the information of tree structure.
Such as: in embodiments of the present invention, when account is network virtual service account numbers, then network virtual service account Number corresponding attribute information is to subscribe to the user profile under this network virtual service account numbers, including, user account number.
When account be the user in network virtual service platform register account number time, then user to register account number corresponding The network virtual service account numbers that attribute information is paid close attention to by this user account number.
Topic model algorithm: (English full name Latent Dirichlet Allocation, English abbreviation " LDA "), theme Model is as the term suggests being exactly a kind of modeling method to theme implicit in word, and topic model can use formulaRepresent.
The formula of above-mentioned topic model is to represent with the form of document, and wherein, p (word document) represents every document In the probability that occurs of each word, p (word theme) represents the probability that each word in each theme occurs, p (theme literary composition Shelves) represent is the probability that in every document, each theme occurs.
If representing by the form of matrix, above-mentioned model formation is also denoted as C=Φ * Θ.
Wherein C, Φ and Θ are matrixes, and when as a example by article, wherein, C represents that in every document, each word occurs Probability, namely p (word document), Φ represents the Probability p (word theme) that each word in each theme occurs, Θ table Show is the Probability p (subject document) that in every document, each theme occurs.
Theme is exactly the conditional probability distribution of word on vocabulary, the corresponding feature of the probability of each theme, such as: example As: in one scenario, p (notebook Baidu)=0.000001, p (notebook association)=0.2, then 0.000001 correspondence Being characterized as Baidu, 0.2 characteristic of correspondence is association.
Feature clustering: exactly similar feature is gathered an apoplexy due to endogenous wind.
Cluster process can be first arbitrarily to select k object as initial cluster center from n data object, and k is little In n, and for other object remaining, then according to them and the similarity (distance) of these cluster centres, respectively by they point (representated by cluster centre) cluster that dispensing is most like with it;Calculate the average of all objects in this cluster the most again, it is thus achieved that The cluster centre of new cluster, constantly repeats this process until canonical measure function starts convergence.Typically all use mean square Difference is as canonical measure function.
Wherein, k cluster has the following characteristics that each cluster itself is the compactest, and between respectively clustering as far as possible Separately.
Represent with mathematical way and can be:
Step 1: input: k, data [n];
Step 2, k initial center point of selection, such as c [0]=data [0] ... c [k-1]=data [k-1];
Step 3, for data [0] ... .data [n], respectively with c [0] ... c [k-1] compares, it is assumed that minimum with c difference, just It is labeled as i;
Step 4, be labeled as i point for all, recalculate data [j] sum of all i of being labeled as of c={/it is labeled as i Number;
Repeat (3) (4), until the change of all c values is less than given threshold value.
Being above the introduction to the related names involved by the embodiment of the present invention, the explanation present invention is real below in conjunction with the accompanying drawings Execute the embodiment of the method for feature clustering in example.
It should be noted that the device realizing embodiment of the present invention feature clustering can be an independent physical machine, also Can be the physical machine cluster that formed of multiple physical machine, it is also possible to be multiple dependence void of being divided out from physical resource Plan machine.Server belongs to a kind of form of expression of physical machine.
Fig. 1 is an embodiment schematic diagram of the method for feature clustering in the embodiment of the present invention,
As it is shown in figure 1, the embodiment of the method for feature clustering that the embodiment of the present invention is provided includes:
101, account is obtained, and the attribute information corresponding with described account.
When account is network virtual service account numbers, the attribute information corresponding with described account can be to pay close attention to The user of this network virtual service account numbers registers account number.
Such as: when network virtual service account numbers is public number, the attribute information corresponding with described account can be The user subscribing to this public number registers account number, and the attribute information the most corresponding with described account is not limited to subscribe to this public number User register account number, it is also possible to including booking reader's quantity, any active ues quantity is, and interactive vermicelli quantity etc..
When account be the user in network virtual service platform register account number time, corresponding with described account Attribute information can register the network virtual service account numbers ordered by account number for this user,
Such as: this user registers the public number that account number is paid close attention to, user register public number that account number paid close attention to can be from public affairs Acquire above crowd's platform in the public number list ordered by each user and search.The genus the most corresponding with described account Property information is not limited to the public number that this user is paid close attention to, it is also possible to include the upstream message that user sends to each wechat public number Number, pay number of times, check article number of times and click on menu number of times etc..
102, to described account, and the attribute information corresponding with described account carries out pretreatment, obtains mould Type input data.
The process of pretreatment can be that the form between account and attribute information generates, and the filtration of data.
103, utilize topic model algorithm, described mode input data are processed, obtain described account and wrapped The probability of each theme contained, the corresponding feature of the probability of each theme.
Utilize topic model algorithm, described mode input data are carried out process and can utilize formulaOr mode input data are carried out by formula C=Φ * Θ Process, obtain the probability of each theme, so that it is determined that each theme characteristic of correspondence.
104, the feature utilizing clustering algorithm to be comprised described account clusters.
The process of cluster can be refering to the description of explanation of nouns part:
Step 1: input: k, data [n];
Step 2, k initial center point of selection, such as c [0]=data [0] ... c [k-1]=data [k-1];
Step 3, for data [0] ... .data [n], respectively with c [0] ... c [k-1] compares, it is assumed that minimum with c difference, just It is labeled as i;
Step 4, be labeled as i point for all, recalculate data [j] sum of all i of being labeled as of c={/it is labeled as i Number;
Repeat (3) (4), until the change of all c values is less than given threshold value.
Realizing feature clustering by this process, the most in embodiments of the present invention, the data of input are account.
By the way of different dimensions feature-set, carry out feature clustering with prior art, cause feature clustering efficiency low Under compare, the method for feature clustering that the embodiment of the present invention provides, can be to account and corresponding with account information Attribute information, clusters by the way of theme probability, can not only effectively prevent very long Exploration on Characteristics process, moreover it is possible to Effectively reduce the problem that characteristic dimension is too much, thus improve the efficiency of feature clustering.
Alternatively, on the basis of the content described by above-described embodiment, the feature clustering that the embodiment of the present invention is provided Method another embodiment in, described account is network virtual service account numbers, the most described to described account, and The attribute information corresponding with described account carries out pretreatment, obtains mode input data, may include that
To described network virtual service account numbers, and the attribute information corresponding with described network virtual service account numbers carries out pre- Process, obtain mode input data.
Further, described to described network virtual service account numbers and corresponding with described network virtual service account numbers Attribute information carries out pretreatment, obtains mode input data, may include that
Generate network virtual service account numbers and subscribe to that the user of described network virtual service account numbers registers between account number right Should be related to;
Filter out user to register account number and be unsatisfactory for the described corresponding relation of prerequisite.
In the embodiment of the present invention, network virtual service account numbers and user register the corresponding relation between account number can be by closing The form of series of tables represents.
As shown in table 1, public number and subscribe to the mapping table registering between user of this public number and can be:
As shown in table 1, public number " knows force of labor " and the corresponding relation registered between user of concern " knowing force of labor " can be used The such as form of table 1 represents, certain table 1 simply citing, it practice, most of public number all can have substantial amounts of registration user to close Note.
In addition, it is necessary to explanation a bit, what the vermicelli in the embodiment of the present invention referred to is also registration user, some places make With vermicelli, some places employ registration user, simply engage the bluntization statement that concrete scene is done, but should will not register User and vermicelli do different understanding.
Describe when account is network virtual service account numbers below in conjunction with Fig. 2, the spy that the embodiment of the present invention is provided Levy the process of the method for cluster.
As in figure 2 it is shown, as a example by public number, another embodiment of the method for the feature clustering that the embodiment of the present invention is provided Including:
201, from public number platform, each public number is gathered, and the attribute information that each public number is corresponding.
Attribute information corresponding to public number includes that the user subscribing to each public number registers account number, also includes but does not limit reaction The data such as booking reader's number of public number scale, active users, interactive user number.
202, the user data under each public number is carried out pretreatment.
The process of pretreatment includes: generating preprocessed data Data, form can be: the use of public number t correspondence public number Family registration Accounts List.
After the user of generation public number registers Accounts List, the data in list are done filtration and clean:
Data are done filtration cleaning need to carry out in terms of two, are on the one hand to carry out filtering from the angle of public number cleaning, On the other hand it is to carry out filtering from the angle of user cleaning.
For from the angle of statistical distribution, in a data acquisition system, king-sized data and the least data are the most not It is suitable for statistics, so cleaning data to need to wash king-sized data and the least data in data acquisition system, about especially The cleaning embodiment of the present invention of big data and the least data enumerates two schemes:
First introduce to carry out filtering from the angle of public number and clean.
Carry out filtering from the angle of public number cleaning and refer to filter out user's public number many especially and user is the fewest Public number.Two kinds of filtering schemes are respectively as follows:
The first is: filters the public number washing registration number of users more than first threshold U, and filters out registration user The number public number less than Second Threshold B.
The second is: the registration number of users distribution of statistics public number, filters out more than 95 points of positions (or other point of position) Public number, and 5 be divided into (or other point of position) public number below.Point position refers to the distribution position of data statistically Put.
It is described below to carry out filtering from the angle of user and cleans.
Carry out filtering from the angle of user cleaning to also refer to filter out data acquisition system and subscribe to the user that public number is the fewest Subscribing to, with filtering out, the user that public number is many especially, two kinds of filtering schemes are respectively as follows:
The first is: filter out subscription public's count less than a certain threshold value (such as: 5) or more than some threshold values The user of (such as: 100000).
The second is: counting user subscribes to the distribution of public number, filters out the use of more than 95 points of positions (or other point of position) Family, and 5 be divided into (or other point of position) user below.
203, utilize topic model algorithm, carry out theme study, obtain each public number probability distribution at each theme.
The process of theme study can use topic model lightLDA or the degree of depth study mould supporting Distributed Calculation Type.
204, each public number theme probability distribution result in output step 203.
After the output of each public number theme probability distribution result, carry out manual evaluation, carry out by constantly adjusting model parameter The Optimized Iterative of step 203, makes final result reach perfect condition as far as possible.
Final data form is: public number t theme 1: probit 1 theme 2: probit 2... theme N: probit N
205, for the distribution situation of each theme corresponding to public number of output in step 204, each theme correspondence one Individual feature, then utilizes cluster that public number is carried out feature clustering.
Above step 201-205 is the process prescription combining public number to feature clustering, the public in the embodiment of the present invention Number can be wechat public number, it is also possible to be the public number in other social networking applications.
Alternatively, on the basis of the content described by above-described embodiment, the feature clustering that the embodiment of the present invention is provided Method another embodiment in, described account is that the user in network virtual service platform registers account number, then described To described account, and the attribute information corresponding with described account carries out pretreatment, obtains mode input data, bag Include:
Described user is registered account number, and registers attribute information corresponding to account number with described user and carry out pretreatment, To mode input data.
Further, described described user is registered account number, and register, with described user, the attribute information that account number is corresponding Carry out pretreatment, obtain mode input data, may include that
Generate that user registers that account number registers between the network virtual service account numbers ordered by account number with described user is corresponding Relation;
Filter out network virtual service account numbers and be unsatisfactory for the described corresponding relation of prerequisite.
In the embodiment of the present invention, the corresponding relation that user registers between account number and ordered network virtual service account numbers can To be represented by the form of relation list.
As shown in table 2, user registers the mapping table between account number and ordered public number and can be:
As shown in table 2, user registers the corresponding relation between account number 13415666333 and ordered public number and can use The such as form of table 2 represents, certain table 2 simply citing, it practice, this user is also possible to have subscribed more public number.
When feature clustering, the similarity between each public number to be paid close attention to.
Below in conjunction with Fig. 3 describe when account be the user in network virtual service platform register account number time, this The process of the method for the feature clustering that bright embodiment is provided.
As it is shown on figure 3, another embodiment of the method for feature clustering that the embodiment of the present invention is provided includes:
301, from public number platform, the public number list ordered by each user is gathered.
In addition to public number list, it is also possible to gather some statistical indicators of the public number that each registration user monthly subscribes to Information, wherein can include upstream message number that user sends to each wechat public number, pay number of times, check article Number, click menu number of times etc..
The data that 302, each user registers account number carry out pretreatment.
The process of pretreatment includes: generate data Data, form is: user register account number t its subscribe to public number row Table.
After generating the public number list of user, the process that the data in list are done filtration cleaning may is that
First introduce to carry out filtering from the angle of public number and clean.
Carry out filtering from the angle of public number cleaning and refer to filter out user's public number many especially and user is the fewest Public number.Two kinds of filtering schemes are respectively as follows:
The first is: filters the public number washing registration number of users more than first threshold U, and filters out registration user The number public number less than Second Threshold B.
The second is: the registration number of users distribution of statistics public number, filters out more than 95 points of positions (or other point of position) Public number, and 5 be divided into (or other point of position) public number below.Point position refers to the distribution position of data statistically Put.
It is described below to carry out filtering from the angle of user and cleans.
Carry out filtering from the angle of user cleaning to also refer to filter out data acquisition system and subscribe to the user that public number is the fewest Subscribing to, with filtering out, the user that public number is many especially, two kinds of filtering schemes are respectively as follows:
The first is: filter out subscription public's count less than a certain threshold value (such as: 5) or more than some threshold values The user of (such as: 100000).
The second is: counting user subscribes to the distribution of public number, filters out the use of more than 95 points of positions (or other point of position) Family, and 5 be divided into (or other point of position) user below.
303, utilize topic model algorithm, carry out theme study, obtain each public number probability distribution at each theme.
The process of theme study can use topic model lightLDA or the degree of depth study mould supporting Distributed Calculation Type.
304, each public number theme probability distribution result in output step 303.
During modelling effect optimizes, in addition to based on subscribing relationship, also can be based on registration user and public number Interactive relationship cluster, interactive relationship be defined as upstream message number, pay number of times, check article number of times, click on dish Some index numbers such as single number reach certain numerical value.Exporting the potential applications theme distribution that each user is corresponding, form is: note Volume user t theme 1: probit 1 theme 2: probit 2... theme N: probit N.
305, for the distribution situation of each theme corresponding to public number of output in step 304, each theme correspondence one Individual feature, then utilizes cluster that public number is carried out feature clustering.
Above step 301-305 is the process prescription combining public number to feature clustering, the public in the embodiment of the present invention Number can be wechat public number, it is also possible to be the public number in other social networking applications.
The method of the cluster that the embodiment of the present invention is provided, text data involved during cluster include but not It is limited to such as text message structure correlated characteristic data such as the pet name, brief introduction, signature and articles.
The topic model algorithm used includes but not limited to such as the study of the latent semantic model such as degree of depth and topic model Various receptor models, it is also possible to include singular value decomposition (English full name Singular value decomposition, English letter Claim " SVD ") etc. various clustering algorithms carry out being identified according to potential applications information.
It addition, in the embodiment described by Fig. 2 and Fig. 3, the replacement of the relation of public number and registration user, such as but not Limit and click on wechat public number and the relation of its corresponding article, the forwarding relation of wechat public number article, wechat public number user Relation etc..
Above, the method for the feature clustering that the embodiment of the present invention is provided, produced beneficial effect may include that
One, very long Exploration on Characteristics process can be effectively prevent, moreover it is possible to effectively reduce the problem that characteristic dimension is too much.
Two: utilize distributed topic model effectively to support large-scale clustered demand.
Three: by wechat public number or vermicelli user are clustered, can use same in follow-up excacation The individual wechat public number of individual theme agency or user data, the most effectively solve long-tail part Sparse Problem.
Four: wechat public number cluster result has the place of a lot of potential use, including the recommendation of similar wechat public number, wechat The fields such as the recommendation of public number article, wechat public number advertisement broadcasting.
It is above the description of the method to feature clustering, the device of the feature clustering being described below in the embodiment of the present invention 20。
Fig. 4 is an embodiment schematic diagram of the device 20 of feature clustering in the embodiment of the present invention.
Refering to Fig. 4, an embodiment of the device 40 of the feature clustering that the embodiment of the present invention is provided includes:
Acquiring unit 401, is used for obtaining account, and the attribute information corresponding with described account;
Pretreatment unit 402, for described acquiring unit 401 obtain account, and with described account Corresponding attribute information carries out pretreatment, obtains mode input data;
Processing unit 403, is used for utilizing topic model algorithm, the mode input number obtaining described pretreatment unit 402 According to processing, obtain the probability of each theme that described account is comprised, the corresponding feature of the probability of each theme;
Cluster cell 404, is comprised for the described account utilizing clustering algorithm to obtain described processing unit 403 Feature cluster.
In the embodiment of the present invention, acquiring unit 401 obtains account, and the attribute letter corresponding with described account Breath;The account that described acquiring unit 401 is obtained by pretreatment unit 402, and the attribute corresponding with described account Information carries out pretreatment, obtains mode input data;Processing unit 403 utilizes topic model algorithm, to described pretreatment unit The mode input data that 402 obtain process, and obtain the probability of each theme that described account is comprised, each theme The corresponding feature of probability;Cluster cell 404, for the described account number utilizing clustering algorithm to obtain described processing unit 403 The feature that information is comprised clusters.By the way of different dimensions feature-set, feature clustering is carried out with prior art, Cause feature clustering inefficiency to be compared, the device of feature clustering that the embodiment of the present invention provides, can to account and The attribute information corresponding with account information, clusters by the way of theme probability, can not only effectively prevent very long Exploration on Characteristics process, moreover it is possible to effectively reduce the problem that characteristic dimension is too much, thus improve the efficiency of feature clustering.
Alternatively, on the basis of the embodiment of the device 40 of features described above cluster, the feature that the embodiment of the present invention provides In another embodiment of the device 40 of cluster,
Described pretreatment unit, for when described account is network virtual service account numbers, to described network virtual Service account numbers, and the attribute information corresponding with described network virtual service account numbers carry out pretreatment, obtains mode input data.
Further, described pretreatment unit is used for:
Generate network virtual service account numbers and subscribe to that the user of described network virtual service account numbers registers between account number right Should be related to;
Filter out user to register account number and be unsatisfactory for the described corresponding relation of prerequisite.
Alternatively, on the basis of the embodiment of the device 40 of features described above cluster, the feature that the embodiment of the present invention provides In another embodiment of the device 40 of cluster,
Described pretreatment unit, is used for when described account is that the user in network virtual service platform registers account number Time, described user is registered account number, and registers attribute information corresponding to account number with described user and carry out pretreatment, obtain model Input data.
Further, described pretreatment unit is used for:
Generate that user registers that account number registers between the network virtual service account numbers ordered by account number with described user is corresponding Relation;
Filter out network virtual service account numbers and be unsatisfactory for the described corresponding relation of prerequisite.
The device of features above cluster can be realized by server, illustrates to be realized by server below in conjunction with Fig. 5 The device stating feature clustering realizes the process of cluster.
Fig. 5 is the structural representation of the server 50 that the embodiment of the present invention provides.Described server 50 includes processor 510, memorizer 550 and transceiver 530, memorizer 550 can include read only memory and random access memory, and to process Device 510 provides operational order and data.A part for memorizer 550 can also include nonvolatile RAM (NVRAM)。
In some embodiments, memorizer 550 stores following element, executable module or data structure, or Their subset of person, or their superset:
In embodiments of the present invention, by calling the operational order of memorizer 550 storage, (this operational order is storable in behaviour Make in system),
Obtain account, and the attribute information corresponding with described account;
To described account, and the attribute information corresponding with described account carries out pretreatment, obtains model defeated Enter data;
Utilize topic model algorithm, described mode input data are processed, obtain what described account was comprised The probability of each theme, the corresponding feature of the probability of each theme;
The feature utilizing clustering algorithm to be comprised described account clusters.
By the way of different dimensions feature-set, carry out feature clustering with prior art, cause feature clustering efficiency low Under compare, the embodiment of the present invention provide server, can to account and the attribute information corresponding with account information, Cluster by the way of theme probability, very long Exploration on Characteristics process can not only be effectively prevent, moreover it is possible to effectively reduce The problem that characteristic dimension is too much, thus improve the efficiency of feature clustering.
Processor 510 controls the operation of server 50, and processor 510 can also be referred to as CPU (Central Processing Unit, CPU).Memorizer 550 can include read only memory and random access memory, and to processor 510 Instruction and data is provided.A part for memorizer 550 can also include nonvolatile RAM (NVRAM).Specifically Application in each assembly of server 50 be coupled by bus system 520, wherein bus system 520 is except including data Outside bus, it is also possible to include power bus, control bus and status signal bus in addition etc..But for the sake of understanding explanation, Various buses are all designated as bus system 520 by figure.
The method that the invention described above embodiment discloses can apply in processor 510, or is realized by processor 510. Processor 510 is probably a kind of IC chip, has the disposal ability of signal.During realizing, said method each Step can be completed by the instruction of the integrated logic circuit of the hardware in processor 510 or software form.Above-mentioned process Device 510 can be general processor, digital signal processor (DSP), special IC (ASIC), ready-made programmable gate array Or other PLDs, discrete gate or transistor logic, discrete hardware components (FPGA).Can realize or Person performs disclosed each method, step and logic diagram in the embodiment of the present invention.General processor can be microprocessor or This processor of person can also be the processor etc. of any routine.Step in conjunction with the method disclosed in the embodiment of the present invention can be straight Connect and be presented as that hardware decoding processor has performed, or performed with the hardware in decoding processor and software module combination Become.Software module may be located at random access memory, flash memory, read only memory, and programmable read only memory or electrically-erasable can In the storage medium that this area such as programmable memory, depositor is ripe.This storage medium is positioned at memorizer 550, and processor 510 is read Information in access to memory 550, completes the step of said method in conjunction with its hardware.
Alternatively, processor 510 is used for:
When described account be the user in network virtual service platform register account number time, described user is registered account Number, and register attribute information corresponding to account number with described user and carry out pretreatment, obtain mode input data.
Processor 510 is used for further,
Generate network virtual service account numbers and subscribe to that the user of described network virtual service account numbers registers between account number right Should be related to;
Filter out user to register account number and be unsatisfactory for the described corresponding relation of prerequisite.
Alternatively, processor 510 is used for:
When described account be the user in network virtual service platform register account number time, described user is registered account Number, and register attribute information corresponding to account number with described user and carry out pretreatment, obtain mode input data.
Processor 510 is used for further,
Generate that user registers that account number registers between the network virtual service account numbers ordered by account number with described user is corresponding Relation;
Filter out network virtual service account numbers and be unsatisfactory for the described corresponding relation of prerequisite.
Above server 50 can understand refering to the description of Fig. 1 to Fig. 3 part, and this place does not do and too much repeats
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can Completing instructing relevant hardware by program, this program can be stored in a computer-readable recording medium, storage Medium may include that ROM, RAM, disk or CD etc..
Method and the device of the feature clustering provided the embodiment of the present invention above are described in detail, herein Applying specific case to be set forth principle and the embodiment of the present invention, the explanation of above example is only intended to help Understand method and the core concept thereof of the present invention;Simultaneously for one of ordinary skill in the art, according to the thought of the present invention, The most all will change, in sum, this specification content should not be construed as this The restriction of invention.

Claims (10)

1. the method for a feature clustering, it is characterised in that including:
Obtain account, and the attribute information corresponding with described account;
To described account, and the attribute information corresponding with described account carries out pretreatment, obtains mode input number According to;
Utilize topic model algorithm, described mode input data are processed, obtain each master that described account is comprised The probability of topic, the corresponding feature of the probability of each theme;
The feature utilizing clustering algorithm to be comprised described account clusters.
Method the most according to claim 1, it is characterised in that described account is network virtual service account numbers, then institute State described account, and the attribute information corresponding with described account carry out pretreatment, obtains mode input data, Including:
To described network virtual service account numbers, and the attribute information corresponding with described network virtual service account numbers carries out pre-place Reason, obtains mode input data.
Method the most according to claim 1, it is characterised in that described account is in network virtual service platform User registers account number, the most described to described account, and the attribute information corresponding with described account carries out pretreatment, Obtain mode input data, including:
Described user is registered account number, and registers attribute information corresponding to account number with described user and carry out pretreatment, obtain mould Type input data.
Method the most according to claim 2, it is characterised in that described to described network virtual service account numbers, and with institute State attribute information corresponding to network virtual service account numbers and carry out pretreatment, obtain mode input data, including:
Generate the corresponding pass that network virtual service account numbers is registered between account number with the user subscribing to described network virtual service account numbers System;
Filter out user to register account number and be unsatisfactory for the described corresponding relation of prerequisite.
Method the most according to claim 3, it is characterised in that described account number that described user is registered, and with described use The attribute information that account number is registered corresponding in family carries out pretreatment, obtains mode input data, including:
Generation user registers account number and described user registers the corresponding relation between the network virtual service account numbers ordered by account number;
Filter out network virtual service account numbers and be unsatisfactory for the described corresponding relation of prerequisite.
6. the device of a feature clustering, it is characterised in that including:
Acquiring unit, is used for obtaining account, and the attribute information corresponding with described account;
Pretreatment unit, for the account that described acquiring unit is obtained, and the attribute corresponding with described account Information carries out pretreatment, obtains mode input data;
Processing unit, is used for utilizing topic model algorithm, and the mode input data obtaining described pretreatment unit process, Obtain the probability of each theme that described account is comprised, the corresponding feature of the probability of each theme;
Cluster cell, the feature that the described account for utilizing clustering algorithm to obtain described processing unit is comprised is carried out Cluster.
Device the most according to claim 6, it is characterised in that
Described pretreatment unit, for when described account is network virtual service account numbers, services described network virtual Account number, and the attribute information corresponding with described network virtual service account numbers carry out pretreatment, obtains mode input data.
Device the most according to claim 6, it is characterised in that
Described pretreatment unit, for when described account be the user in network virtual service platform register account number time, Described user is registered account number, and registers attribute information corresponding to account number with described user and carry out pretreatment, obtain model defeated Enter data.
Device the most according to claim 7, it is characterised in that
Described pretreatment unit is used for:
Generate the corresponding pass that network virtual service account numbers is registered between account number with the user subscribing to described network virtual service account numbers System;
Filter out user to register account number and be unsatisfactory for the described corresponding relation of prerequisite.
Device the most according to claim 8, it is characterised in that
Described pretreatment unit is used for:
Generation user registers account number and described user registers the corresponding relation between the network virtual service account numbers ordered by account number;
Filter out network virtual service account numbers and be unsatisfactory for the described corresponding relation of prerequisite.
CN201610421683.7A 2016-06-15 2016-06-15 A kind of method and device of feature clustering Active CN106055699B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610421683.7A CN106055699B (en) 2016-06-15 2016-06-15 A kind of method and device of feature clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610421683.7A CN106055699B (en) 2016-06-15 2016-06-15 A kind of method and device of feature clustering

Publications (2)

Publication Number Publication Date
CN106055699A true CN106055699A (en) 2016-10-26
CN106055699B CN106055699B (en) 2018-07-06

Family

ID=57167761

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610421683.7A Active CN106055699B (en) 2016-06-15 2016-06-15 A kind of method and device of feature clustering

Country Status (1)

Country Link
CN (1) CN106055699B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107403311A (en) * 2017-06-27 2017-11-28 阿里巴巴集团控股有限公司 The recognition methods of account purposes and device
CN108287909A (en) * 2018-01-31 2018-07-17 北京仁和汇智信息技术有限公司 A kind of paper method for pushing and device
TWI752485B (en) * 2019-11-14 2022-01-11 大陸商支付寶(杭州)信息技術有限公司 User clustering and feature learning method, device, and computer-readable medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101408901A (en) * 2008-11-26 2009-04-15 东北大学 Probability clustering method of cross-categorical data based on key word
CN101770454A (en) * 2010-02-13 2010-07-07 武汉理工大学 Method for expanding feature space of short text
CN102289487A (en) * 2011-08-09 2011-12-21 浙江大学 Network burst hotspot event detection method based on topic model
CN104657375A (en) * 2013-11-20 2015-05-27 中国科学院深圳先进技术研究院 Image-text theme description method, device and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101408901A (en) * 2008-11-26 2009-04-15 东北大学 Probability clustering method of cross-categorical data based on key word
CN101770454A (en) * 2010-02-13 2010-07-07 武汉理工大学 Method for expanding feature space of short text
CN102289487A (en) * 2011-08-09 2011-12-21 浙江大学 Network burst hotspot event detection method based on topic model
CN104657375A (en) * 2013-11-20 2015-05-27 中国科学院深圳先进技术研究院 Image-text theme description method, device and system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107403311A (en) * 2017-06-27 2017-11-28 阿里巴巴集团控股有限公司 The recognition methods of account purposes and device
CN107403311B (en) * 2017-06-27 2020-04-21 阿里巴巴集团控股有限公司 Account use identification method and device
CN108287909A (en) * 2018-01-31 2018-07-17 北京仁和汇智信息技术有限公司 A kind of paper method for pushing and device
TWI752485B (en) * 2019-11-14 2022-01-11 大陸商支付寶(杭州)信息技術有限公司 User clustering and feature learning method, device, and computer-readable medium

Also Published As

Publication number Publication date
CN106055699B (en) 2018-07-06

Similar Documents

Publication Publication Date Title
CN104090919B (en) Advertisement recommending method and advertisement recommending server
CN105247507B (en) Method, system and storage medium for the influence power score for determining brand
Perros Queueing networks with blocking
WO2014056408A1 (en) Information recommending method, device and server
CN106570008A (en) Recommendation method and device
CN108171528B (en) Attribution method and attribution system
WO2010078060A1 (en) Systems and methods for making recommendations using model-based collaborative filtering with user communities and items collections
CN107229730A (en) Data query method and device
CN105718565A (en) Data warehouse model construction method and construction apparatus
CN104376058A (en) User interest model updating method and device
CN107302573A (en) A kind of information-pushing method, device, electronic equipment and storage medium
CN111523072A (en) Page access data statistical method and device, electronic equipment and storage medium
CN106055699A (en) Method and device for feature clustering
CN107025565A (en) A kind of method and system for improving e-commerce website conversion ratio
CN110020149A (en) Labeling processing method, device, terminal device and the medium of user information
CN112256720A (en) Data cost calculation method, system, computer device and storage medium
CN111415199A (en) Customer prediction updating method and device based on big data and storage medium
CN103970753A (en) Pushing method and pushing device for related knowledge
CN110473073A (en) The method and device that linear weighted function is recommended
CN110472016A (en) Article recommended method, device, electronic equipment and storage medium
CN110457288A (en) Data model construction method, device, equipment and computer readable storage medium
CN112686717A (en) Data processing method and system for advertisement recall
CN110222790A (en) Method for identifying ID, device and server
ES2900746T3 (en) Systems and methods to effectively distribute warning messages
CN111488531A (en) Information recommendation method, device and medium based on collaborative filtering algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant