CN104598601B - A kind of method, apparatus classified to user and content and computing device - Google Patents

A kind of method, apparatus classified to user and content and computing device Download PDF

Info

Publication number
CN104598601B
CN104598601B CN201510041042.4A CN201510041042A CN104598601B CN 104598601 B CN104598601 B CN 104598601B CN 201510041042 A CN201510041042 A CN 201510041042A CN 104598601 B CN104598601 B CN 104598601B
Authority
CN
China
Prior art keywords
user
content
type
visit capacity
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510041042.4A
Other languages
Chinese (zh)
Other versions
CN104598601A (en
Inventor
胡勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING QIERBULAITE TECHNOLOGY Co Ltd
Original Assignee
BEIJING QIERBULAITE TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING QIERBULAITE TECHNOLOGY Co Ltd filed Critical BEIJING QIERBULAITE TECHNOLOGY Co Ltd
Priority to CN201510041042.4A priority Critical patent/CN104598601B/en
Publication of CN104598601A publication Critical patent/CN104598601A/en
Application granted granted Critical
Publication of CN104598601B publication Critical patent/CN104598601B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23211Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with adaptive number of clusters

Abstract

The invention discloses a kind of method, apparatus classified to user and content and computing device.Described device includes:Initialization module, is suitable for each user type and specifies a user, and a content is specified for each content type;Visit capacity computing module, suitable for calculating the 3rd visit capacity of each user type to the first visit capacity of each content, each user to the second visit capacity of each content type and each user type to each content type;Similarity calculation module, suitable for according to the second visit capacity and the 3rd visit capacity, calculating the similarity between each user and each user type, according to the first visit capacity and the 3rd visit capacity, calculating the similarity between each content and each content type;Sort module, suitable for for each user, selection and its user type of similarity highest user type as the user, for each content, selection and its content type of similarity highest content type as the content.

Description

A kind of method, apparatus classified to user and content and computing device
Technical field
The present invention relates to computer and internet arena, and in particular to a kind of method classified to user and content, Device and computing device.
Background technology
The analysis that website accesses user content can provide reference for web site contents construction, operation.On Contents Construction, The faster commodity of growth can be accessed according to user further to cooperate or seek business opportunities with businessman., can on to user service According to user's commodity interested, targetedly to be recommended user., can be according to different user in operation management The income level that type is brought to site owners, each content type cost-benefit in conversion website are horizontal.Wherein, web site contents can To be that the page, the user of website post, the website such as website classification, trade name, commodity classification shows all the elements of user.
For this reason, it may be necessary to website user and content are classified.On to website user and classifying content, in general way It is that web site contents are artificially first divided into several types according to the construction of website, then, when user carrys out website visiting, according to user To the visit capacity of each content in website, by largely calculating, some types are separated the users into.But in actual applications some The more difficult manual sort of web site contents, such as the model of user's hair and chained address etc..
The algorithms most in use classified automatically to website user and content has K averages (Kmeans), probability potential applications point Analyse (probabilitistic Latent Semantic Analysis, PLSA) and latent Dirichletal location model (Latent Dirichlet Allocation, LDA) etc..These algorithms are typically first classified to web site contents, dimensionality reduction, Then user is classified again.But content is classified according to these algorithms, first should substantial many category Property, moreover, during using these algorithms, its iterative calculation amount is very big.
The content of the invention
In view of the above problems, it is proposed that the present invention so as to provide one kind overcome above mentioned problem or at least in part solve on State the method, apparatus classified to user and content and computing device of problem.
According to an aspect of the invention, there is provided a kind of device classified to user and content, resides in calculating It is the first predetermined number user type suitable for each user clustering during user is gathered in equipment, will be each in properties collection Content clustering is the second predetermined number content type, and described device includes:Initialization module, it is suitable for first predetermined number Each user type in mesh user type specifies one or more of user's set user, is second predetermined number One or more of each content type given content set in individual content type content;Visit capacity computing module, is suitable to Visit capacity according to user to content, each user type is calculated to the first visit capacity of each content, each user to each content type The second visit capacity and each user type to the 3rd visit capacity of each content type;Similarity calculation module, suitable for according to Second visit capacity and the 3rd visit capacity, calculate the similarity between each user and each user type, according to first visit capacity With the 3rd visit capacity, the similarity between each content and each content type is calculated;Sort module, suitable for for each user, choosing Select with its user type of similarity highest user type as the user, for each content, selection with its similarity most Content type of the high content type as the content, and trigger visit capacity computing module re-start visit capacity calculate and it is similar After degree computing module re-starts Similarity Measure, the selection is re-started, when predetermined condition meets, no longer carries out institute State triggering.
Alternatively, in the device classified to user and content according to the present invention, the initialization module enters one Step is suitable to:According to the mapping relations between existing user and user type, for the user type of existing one or more users One or more users are specified, and a user without user type is randomly assigned for the user type of no user;Root According to the mapping relations between existing content and content type, this is specified for the content type of existing one or more contents Or multiple contents, and it is randomly assigned a content without content type for sleazy content type.
Alternatively, in the device classified to user and content according to the present invention, for existing user with using Mapping relations between the type of family, the similarity calculation module do not calculate the similarity between the user and each user type, And the sort module does not change the user type of the user;For the mapping relations between existing content and content type, The similarity calculation module does not calculate the similarity between the content and each content type, and the sort module does not change this The content type of content.
Alternatively, in the device classified to user and content according to the present invention, the visit capacity computing module Visit capacity of some user type to some content is calculated as follows:Obtain all users that the user type includes; Obtain visit capacity of wherein each user to the content;All visit capacities are summed, obtain visit of the user type to the content The amount of asking;The visit capacity computing module calculates visit capacity of some user to some content type as follows:Obtaining should All the elements that content type includes;Obtain visit capacity of the user to wherein each content;All visit capacities are summed, obtained Visit capacity of the user to the content type;The visit capacity computing module calculates some user type to certain as follows The visit capacity of individual content type:Obtain all users that the user type includes and all the elements that the content type includes; Obtain visit capacity of wherein each user to wherein each content;All visit capacities are summed, it is interior to this to obtain the user type Hold the visit capacity of type.
Alternatively, in the device classified to user and content according to the present invention, the similarity is based on most Similarity factor, Pasteur's similarity factor or the cosine similarity factor of small value.
Alternatively, in the device classified to user and content according to the present invention, the similarity calculation module Before two vectorial similarities are calculated, first to the two vectorial domains take common factor or union after, then calculate the two to The similarity of amount.
Alternatively, in the device classified to user and content according to the present invention, the predetermined condition is:Triggering The visit capacity computing module and the number of similarity calculation module reach default number;Or this classification results with The classification results of last time are compared, and user's ratio that user type changes is less than default first thresholding and content type occurs The content ratio of change is less than default second thresholding.
According to another aspect of the present invention, there is provided a kind of method classified to user and content, in computing device Middle execution, it is the first predetermined number user type suitable for each user clustering during user is gathered, will be each in properties collection Content clustering is the second predetermined number content type, and methods described includes:Initialization step:For first predetermined number Each user type in user type specifies one or more of user's set user, is in second predetermined number Hold one or more of each content type given content set in type content;Visit capacity calculation procedure:According to user To the visit capacity of content, each user type is calculated to the second visit of the first visit capacity, each user of each content to each content type The 3rd visit capacity of the amount of asking and each user type to each content type;Similarity Measure step, according to second visit capacity and 3rd visit capacity, the similarity between each user and each user type is calculated, according to first visit capacity and the 3rd visit capacity, Calculate the similarity between each content and each content type;Classifying step:For each user, selection and its similarity highest User type of the user type as the user, for each content, selection is used as with its similarity highest content type should The content type of content, and trigger visit capacity calculation procedure re-start visit capacity calculate and Similarity Measure step re-start After Similarity Measure, the selection is re-started, when predetermined condition meets, no longer carries out the triggering.
Alternatively, in the method classified to user and content according to the present invention, in the initialization step, According to the mapping relations between existing user and user type, for existing one or more users user type specify this one Individual or multiple users, and it is randomly assigned a user without user type for the user type of no user;According to existing Mapping relations between content and content type, the content type for existing one or more contents are specified in the one or more Hold, and a content without content type is randomly assigned for sleazy content type.
Alternatively, in the method classified to user and content according to the present invention, for existing user with using Mapping relations between the type of family, do not calculated in the Similarity Measure step similar between the user and each user type Spend, and do not change the user type of the user in the classifying step;For reflecting between existing content and content type Relation is penetrated, does not calculate the similarity between the content and each content type in the Similarity Measure step, and at described point The content type of the content is not changed in class step.
Alternatively, in the method classified to user and content according to the present invention, calculate and walk in the visit capacity In rapid, visit capacity of some user type to some content is calculated as follows:Obtain that the user type includes is all User;Obtain visit capacity of wherein each user to the content;All visit capacities are summed, obtain the user type to the content Visit capacity;Visit capacity of some user to some content type is calculated as follows:Obtain what the content type included All the elements;Obtain visit capacity of the user to wherein each content;All visit capacities are summed, obtain the user to the content The visit capacity of type;Visit capacity of some user type to some content type is calculated as follows:Obtain the user class All the elements that all users and the content type that type includes include;Wherein each user is obtained to wherein each content Visit capacity;All visit capacities are summed, obtain visit capacity of the user type to the content type.
Alternatively, in the method classified to user and content according to the present invention, the similarity is based on most Similarity factor, Pasteur's similarity factor or the cosine similarity factor of small value.
Alternatively, in the method classified to user and content according to the present invention, walked in the Similarity Measure In rapid, before two vectorial similarities are calculated, first the two vectorial domains are taken occur simultaneously or union after, then calculate this two Individual vectorial similarity.
Alternatively, in the method classified to user and content according to the present invention, the predetermined condition is:Triggering The number of the visit capacity calculation procedure and Similarity Measure step reaches default number;Or this classification results with The classification results of last time are compared, and user's ratio that user type changes is less than default first thresholding and content type occurs The content ratio of change is less than default second thresholding.
According to another aspect of the invention, there is provided a kind of computing device, be populated with according to the present invention in the computing device The device classified to user and content.
Compared with prior art, in the scheme classified to user and content according to the present invention, using to website User and content carry out double focusing alanysis, it is not necessary to know many attributes of content, it is only necessary to according to each user to each content Visit capacity, it is possible to disposably user, content are classified simultaneously, user is grouped into each user type, content is grouped into respectively Content type.Moreover, the solution of the present invention is in each iterative calculation, it is not necessary to which traverse user number × content number, therefore, it changes It is much smaller compared to existing PLSA, LDA scheduling algorithm for amount of calculation.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, And can be practiced according to the content of specification, and in order to allow above and other objects of the present invention, feature and advantage can Become apparent, below especially exemplified by the embodiment of the present invention.
Brief description of the drawings
By reading the detailed description of hereafter preferred embodiment, it is various other the advantages of and benefit it is common for this area Technical staff will be clear understanding.Accompanying drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention Limitation.And in whole accompanying drawing, identical part is denoted by the same reference numerals.In the accompanying drawings:
Fig. 1 shows the schematic diagram of the user of use of the embodiment of the present invention and the double clustering methods of content;
Fig. 2 shows the flow chart of the method according to an embodiment of the invention classified to user and content;
Fig. 3 shows the structure chart of the device according to an embodiment of the invention classified to user and content;
Fig. 4 shows the calculating used time comparison diagram of double clustering algorithms and PLSA algorithms that the embodiment of the present invention uses;And
Fig. 5 is the Example Computing Device for being arranged as realizing the method classified to user and content according to the present invention Block diagram.
Embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although the disclosure is shown in accompanying drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here Limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure Completely it is communicated to those skilled in the art.
The scheme that the embodiment of the present invention uses is to carry out double focusing alanysis to website user and content, and its realization principle is such as Under:
User, content are regarded to two independent set of vertices of bipartite graph as, using user as left side point set L, using content as the right side Side point set R, using user to the visit capacity of content as the weight on side, target is that all users are polymerized into Nl classification, by institute There is content to be polymerized to Nr classification.
As shown in figure 1, before left figure is cluster, user A, B, C, D have access to content X, Y, Z, and there is corresponding power on each side Weight (this example weighted value is 1).It is right figure after algorithm clusters, user A, B gather to be gathered for a use for a user class L, user C, D Family class M, content X, Y is gathered is individually for a content class S for content a class R, content Z.Pass through cluster, access of the user to content Belong to access of the user type to content type.
It is now that the symbolic interpretation hereinafter used is as follows for ease of understanding:
Pickone (S) represents one element of taking-up from set S, can take out one at random.
D (F) represents mapping F domain, that is, maps F key (key) set;R (F) represents mapping F codomain, that is, maps F value (value) set.
F (x) represents that mapping F is mapped as x in domain the value (functional value corresponding to x) in codomain, that is, maps F key For x when corresponding value values.
F (x) or F (, x) represent the mapping one-dimensional value of F domains to be fixed as remaining sub- mapping after x, i.e., inclined letter Number, obtain not providing the mapping of domain that the subset of parameter formed to codomain in domain.
Argmax (F) represented to mapping F, the value in domain corresponding to the maximum in codomain.
Similarity (X, Y) represents the similarity between two vectorial X and Y.
Fig. 2 shows the flow chart of the method according to an embodiment of the invention classified to user and content, should Method performs in computing device, is the first predetermined number user type suitable for each user clustering during user is gathered, will Each content clustering in properties collection is the second predetermined number content type.
Reference picture 2, this method start from step S202 (initialization step).It is the first predetermined number in step S202 Each user type in user type specifies one or more of user's set user, is the second predetermined number content class One or more of each content type given content set in type content.
Can be according to the mapping relations between existing user and user type, for the user of existing one or more users Type specifies one or more users, and is randomly assigned the use without user type for the user type of no user Family;According to the mapping relations between existing content and content type, the content type for existing one or more contents is specified One or more contents, and it is randomly assigned a content without content type for sleazy content type.
If user's collection of the user including all pending clusters is combined into U, including the content of all pending clusters Hold collection and be combined into A, the visit capacity mapping relations of each content are F in each user and properties collection A in user's set UUA, and FUA= {(u,a)->fua|u∈U,a∈A,fua>0 }, in the mapping relations, (u, a)->fuaRepresent visit capacities of the user u to content a For fua
For example, U={ u1, u2, u3, u4, u5, u6, u7, u8, u9, u10 };
A={ a1, a2, a3, a4, a5, a6, a7, a8, a9, a10 };
FUA=(u6, a3)->4,(u5,a5)->8,(u9,a1)->8,(u7,a5)->7,(u7,a3)->2,(u8,a1)-> 3,(u9,a6)->8,(u4,a2)->8,(u8,a4)->10,(u1,a2)->2,(u8,a9)->2,(u10,a10)->4,(u4, a9)->10,(u1,a1)->10,(u2,a3)->5,(u10,a3)->8,(u5,a7)->9,(u3,a3)->3,(u4,a6)->6, (u7,a2)->4,(u4,a5)->10,(u7,a8)->3,(u9,a7)->3,(u1,a6)->2,(u3,a8)->9,(u4,a6)-> 3,(u7,a1)->1,(u7,a9)->9,(u5,a9)->6,(u3,a4)->8}。
Visit capacity mapping relations FUACorrespond to bivariate table see the table below, behavior user in table, be classified as content, respectively taking in table It is worth for visit capacity:
a1 a2 a3 a4 a5 a6 a7 a8 a9 a10
u1 10 2 2
u2 5
u3 3 8 9
u4 8 10 6 3 10
u5 8 9 6
u6 4
u7 1 4 2 7 3 9
u8 3 10 2
u9 8 8 3
u10 8 4
If the user type collection including the first predetermined number user type is combined into G, including the second predetermined number content The content type collection of type is combined into C.Wherein, the first predetermined number is to need to cluster all users in user's set U Classification number, the second predetermined number is the classification number for needing to cluster all the elements in properties collection A, such as first pre- Fixed number mesh and the second predetermined number are 3, and G={ g1, g2, g3 }, C={ c1, c2, c3 }.
In embodiments of the present invention, when classifying to user and content, the user type of user attaching is unique (i.e. one User can only correspond to a user type), and uniquely (i.e. a content can only correspond to a content to the content type of content ownership Type).
In initialization step, can not there is no user, all the elements without any priori conditions, i.e., all user types Type does not have content.Or, it is possibility to have some priori conditions, the priori conditions are by manually entering to certain customers and content Obtain that (now the user type for user's division is referred to as the initial user type of the user, is the interior of division of teaching contents after row division Hold the initial content type that type is the type), for example, there is certain customers' type to have one or more users, and/or, there is portion Content type is divided there are one or more contents.
User's collection provided with initial user type is combined into U0, U0In user to the mapping relations of user type be G0, and G0 ={ u->g|u∈U0, g ∈ G }, in the mapping relations, u->G represents that the user type of user u ownership is g;There is initial content The properties collection of type is A0, A0In content to the mapping relations of content type be C0, and C0={ a->c|a∈A0, c ∈ C }, In the mapping relations, a->C represents that the content type of content a ownership is c.For example, U0={ u1, u3 }, G0={ u1->g1,u3- >G2 }, A0={ a1, a3 }, C0={ a1->c1,a3->c2}.
If user to the mapping relations of user type be GU, and GU={ u->G | u ∈ U, g ∈ G, and initialising subscriber to use The mapping relations G of family typeU=G0;If content is to the mapping relations C of content typeA, and CA={ a->C | a ∈ A, c ∈ C }, and Content is initialized to the mapping relations C of content typeA=C0.According to upper example, GU={ u1->g1,u3->G2 }, CA={ a1->c1, a3->c2}。
Then, user is randomly choosed for the user type of no user, in the random selection of sleazy content type Hold, it is specific as follows:
(1) a user without user type, false code are randomly assigned for the user type of no user (Pseudocode) it is as follows:
GU+{pickone(U-D(GU))->pickone(G-R(GU))=>GU
According to upper example, g3 does not have user, specifies user u5, then GU={ u1->g1,u3->g2,u5->g3}.
(2) a content without content type is randomly assigned for sleazy content type, false code is as follows:
CA+{pickone(A-D(CA))->pickone(C-R(CA))=>CA
According to upper example, c3 does not have user, and it is a5 to specify user, then CA={ a1->c1,a3->c2,a5->c3.
So, allowing for all user types in user type set G has a user, in content type set C All the elements type has content.But by initialization step, each user not represented in user's set U has User type, each content also not represented in properties collection A have content type.
After initialization step, method enters step S204 (visit capacity calculation procedure).In step S204, according to Each user calculates each user type pair in user type set respectively to the visit capacity of each content in properties collection in user's set Each user is each interior in being closed to content set of types in the visit capacity (being referred to as the first visit capacity) of each content, user's set in properties collection Each user type is each interior in being closed to content set of types in the visit capacity (being referred to as the second visit capacity) of appearance type and user type set Hold the visit capacity (being referred to as the 3rd visit capacity) of type.
Visit capacity of some user type to some content can be calculated as follows:First, the user class is obtained All users that type includes;Then, each user is obtained in the user type to the visit capacity of the content;Finally, to acquisition All visit capacity summations, obtain visit capacity of the user type to the content.False code is as follows:
FGA=(g, a)->fga|fga=∑ FUA(u,a),GU(u)=g }
Wherein, FGARepresent that the visit capacity of each user type and each content in properties collection A in user type set G maps Relation, in the mapping relations, (g, a)->fgaRepresent that user type g is f to content a visit capacityga
Visit capacity of some user to some content type can be calculated as follows:First, the content class is obtained All the elements that type includes;Then, visit capacity of the user to each content in the content type is obtained;Finally, to acquisition All visit capacity summations, obtain visit capacity of the user to the content type.False code is as follows:
FUC=(u, c)->fuc|fuc=∑ FUA(u,a),CA(a)=c }
Wherein, FUCRepresent that each user in user's set U and the visit capacity of each content type in content type set C map Relation, in the mapping relations, (u, c)->fucRepresent that user u is f to content type c visit capacityuc
Visit capacity of some user type to some content type can be calculated as follows:First, the use is obtained All the elements that all users and the content type that family type includes include;Then, obtain every in the user type set Visit capacity of the individual user to each content in the content type set;All visit capacities of acquisition are summed, obtain the user class Visit capacity of the type to the content type.False code is as follows:
FGC=(g, c)->fgc|fgc=∑ FUA(u,a)+α,GU(u)=g, CA(a)=c }
Wherein, FGCRepresent each user type in user type set G and the visit of each content type in content type set C The amount of asking mapping relations, in the mapping relations, (g, c)->fgcRepresent that user type g is f to content type c visit capacitygc
In addition, make it that iterative model, can also be by visit of the user type being calculated to content type according to stabilization The amount of asking increases α, 0≤α≤1, and will increase visit capacity of the visit capacity as the user type to the content type after α.Rear In the description of text, α=1.
According to upper example, user type g1 includes user u1, then user type g1 is u1 to a1's to content a1 visit capacity Visit capacity 10, user type g1 are u1 to a2 visit capacity 2 to content a2 visit capacity, by that analogy, are obtained:FGA=(g1, a1)->10,(g1,a2)->2,(g1,a6)->2,(g2,a3)->3,(g2,a4)->8,(g2,a8)->9,(g3,a5)->8, (g3,a7)->9,(g3,a9)->6}。
Content type c1 includes user a1, then visit capacities 10 of the user u1 to content type c1 visit capacity for u1 to a1, Visit capacities of the user u2 to content type c1 visit capacity for u2 to a1, it is no not have to note when accessing, by that analogy, obtain:FUC= {(u1,c1)->10,(u2,c2)->5,(u3,c2)->3,(u4,c3)->10,(u5,c3)->8,(u6,c2)->4,(u7,c1)- >1,(u7,c2)->2,(u7,c3)->7,(u8,c1)->3,(u9,c1)->8,(u10,c2)->8}。
User type g1 includes user u1, and content type c1 includes user a1, then user type g1 is to content type c1's Visit capacity be u1 to a1 visit capacity 10 again plus 1 obtains 11, user type g1 is u1 to a3's to content type c2 visit capacity Visit capacity, no visit capacity is 0 again plus 1 obtains 1, by that analogy, obtains:FGC=(g1, c1)->11,(g1,c2)->1,(g1, c3)->1,(g2,c1)->1,(g2,c2)->4,(g2,c3)->1,(g3,c1)->1,(g3,c2)->1,(g3,c3)->9}。
After visit capacity calculation procedure, method enters step S206 (Similarity Measure step).In step S206, According to second visit capacity and the 3rd visit capacity, each user and each user type in user type set in user's set are calculated Between similarity, according to first visit capacity and the 3rd visit capacity, calculate each content and content type collection in properties collection Similarity in conjunction between each content type.
In a kind of implementation (hereinafter referred to as mode 1), some user in gathering for user, the use is obtained respectively The visit capacity of each content type, obtains a visit capacity vector during family is closed to content set of types;For in user type set Some user type, the visit capacity of each content type during the user type is closed to content set of types is obtained respectively, obtains another Visit capacity vector;Then, the similarity of the two visit capacities vector is calculated, and using the similarity obtained by calculating as the user With the similarity of the user type.
For some content in properties collection, visit of each user type to the content in user type set is obtained respectively The amount of asking, obtain a visit capacity vector;For some content type in content type set, user type set is obtained respectively In each user type to the visit capacity of the content type, obtain another visit capacity vector;Then, calculate the two visit capacities to The similarity of amount, and using the similarity obtained by calculating as the content and the similarity of the content type.
It is that similarity is directly calculated according to visit capacity in mode 1.It is (hereinafter referred to as square in another implementation Formula 2) in, in order to improve the degree of accuracy of algorithm, various visit capacity ratios are calculated always according to visit capacity, and according to various visit capacities Ratio calculates similarity, including:
A) according to corresponding to each user, the user to the visit capacity ratio of each content type, and, each user type The corresponding, user type is similar between each user and each user type to calculate to the visit capacity ratio of each content type Degree;
B) according to corresponding to each content, to the visit capacity ratio of the interior each user type for having access, and, each Corresponding to user type, the user type to the visit capacity ratio of each content type, come calculate each content and each content type it Between similarity.
So, it is necessary to calculate above-mentioned several visit capacity ratios before similarity is calculated.
Calculate the false code of corresponding, to the interior each user type for having access the visit capacity ratio of each content such as Under:
PGA=(g, a)->pga|pga=FGA(g,a)/∑FGA(,a)}
Wherein, PGARepresent the visit capacity ratio of each content in each user type and the properties collection A in user type set G Mapping relations, in the mapping relations, (g, a)->pgaExpression has in all user types of access to content a, user type G is p to content a visit capacity ratioga
Calculate corresponding to each user, the user it is as follows to the false code of the visit capacity ratio of each content type:
PUC=(u, c)->puc|puc=FUC(u,c)/∑FUC(u,)}
Wherein, PUCRepresent the visit capacity ratio of each content type in each user and content type set C in user's set U Mapping relations, in the mapping relations, (u, c)->pucRepresent in all the elements type that user u is accessed, user u is to interior The visit capacity ratio for holding type c is puc
Calculate corresponding to each user type, the user type to the false code of the visit capacity ratio of each content type such as Under:
PGC=(g, c)->pgc|pgc=FGC(g,c)/∑FGC(g,)}
Wherein, PGCRepresent each user type in user type set G and the visit of each content type in content type set C A kind of mapping relations of the amount of asking ratio, in the mapping relations, (g, c)->pgcRepresent all the elements that user type g is accessed In type, user type g is p to content type c visit capacity ratiogc
Calculate corresponding to each content type, have to the content type access each user type visit capacity ratio puppet Code is as follows:
QGC=(g, c)->qgc|qgc=FGC(g,c)/∑FGC(,c)}
Wherein, QGCRepresent each user type in user type set G and the visit of each content type in content type set C Another mapping relations of the amount of asking ratio, in the mapping relations, (g, c)->qgcExpression has access to own content type c In user type, user type g is q to content type c visit capacity ratiogc
According to upper example, it is calculated:
PGA=(g1, a1)->1,(g1,a2)->1,(g2,a3)->1,(g2,a4)->1,(g3,a5)->1,(g1,a6)-> 1,(g3,a7)->1,(g2,a8)->1,(g3,a9)->1};
PUC=(u1, c1)->1,(u2,c2)->1,(u3,c2)->1,(u4,c3)->1,(u5,c3)->1,(u6,c2)-> 1,(u7,c1)->0.1,(u7,c2)->0.2,(u7,c3)->0.7,(u8,c1)->1,(u9,c1)->1,(u10,c2)->1};
PGC=(g1, c1)->0.85,(g1,c2)->0.077,(g1,c3)->0.077,(g2,c1)->0.17,(g2, c2)->0.67,(g2,c3)->0.17,(g3,c1)->0.091,(g3,c2)->0.091,(g3,c3)->0.82};
QGC=(g1, c1)->0.85,(g2,c1)->0.077,(g3,c1)->0.077,(g1,c2)->0.17,(g2, c2)->0.67,(g3,c2)->0.17,(g1,c3)->0.091,(g2,c3)->0.091,(g3,c3)->0.82}。
In addition, in mode 1 and mode 2, for the mapping relations G between existing user and user type0, similar Spend in calculation procedure and do not calculate the similarity of the user (user with initial user type) between each user type, and The user type of the user is not changed in follow-up classifying step;For the mapping relations between existing content and content type C0, do not calculate the phase of the content (content with initial content type) between each content type in Similarity Measure step Like degree, and the content type of the content is not changed in follow-up classifying step.
After Similarity Measure step, method enters step S208 (classifying step).In step S208, for Each user (in addition to the user with initial user type) in the set of family, selection and its similarity highest user type As the user type of the user, for each content (in addition to the content with initial content type) in properties collection, Selection and its content type of similarity highest content type as the content.
In above-mentioned steps, calculate similar between each user and each user type in user type set in user's set Degree, and, it is that user selects the false code of user type as follows according to similarity:
Wherein, SGRepresent the set of the similarity between active user u and user type g ∈ G.
According to upper example, during employing mode 2, it is calculated:GU={ u1->g1,u2->g2,u3->g2,u4->g3,u5-> g3,u6->g2,u7->g3,u8->g1,u9->g1,u10->g2}。
In above-mentioned steps, calculate similar between each content and each content type in content type set in properties collection Degree, and, it is as follows for the false code of content selection content type according to similarity:
Wherein, SCRepresent the set of the similarity between Current Content a and content type c ∈ C.
According to upper example, during employing mode 2, it is calculated:CA={ a1->c1,a2->c1,a3->c2,a4->c2,a5-> c3,a6->c1,a7->c3,a8->c2,a9->c3}。
The similarity calculated between two vectors has many algorithms, and those skilled in the art can rationally select as needed Select.In addition, in embodiments of the present invention, because each visit capacity vector is sparse vector, to save amount of calculation and calculating process In storage overhead, in the Similarity Measure step, before two vectorial similarities are calculated, can first to the two to After the domain of each element (for mapping) of amount merges and (can take common factor or union), then calculate the two vectorial similarities.
The three kinds given below algorithms for calculating the similarity between two vectors.If x, y is n (n>0) dimensional vector, each dimension The value of degree is respectively x1,x2,…,xn, y1,y2,…,yn
Vectorial each dimension value summation:
∑ x=∑sixi=x1+x2+…+xn
∑ y=∑siyi=y1+y2+…+yn
Vector is normalized, vectorial p, q after normalization, the value of each dimension is respectively p1,p2,…,pn, q1,q2,…,qn, And:
(1) similarity between vector x, y uses the similarity factor (similarity factor based on min) based on minimum value, public Formula is as follows:
Minsim is the similarity being calculated.
(2) similarity between vector x, y uses Pasteur's similarity factor (Bhattacharyya coefficients), and formula is as follows:
BC is the similarity being calculated.
(3) similarity between vector x, y uses cosine similarity factor, and formula is as follows:
Cossim is the similarity being calculated.
After classifying step, method enters step S210.In step S210, predetermined condition (iteration ends are judged Condition) whether meet, if predetermined condition is unsatisfactory for, return to step S204 (enters next iteration), i.e. triggering accesses gauge Calculate step and re-start visit capacity and calculate and after Similarity Measure step re-starts Similarity Measure, in classifying step again Selected and classified;If predetermined condition meets, the triggering (algorithm terminates, and stops iteration) is no longer carried out, classification is walked Rapid classification results export as final result.Wherein, the predetermined condition can be:Trigger the visit capacity calculation procedure Reach default number (such as 30 times) with the number of Similarity Measure step;Or this classification results and point of last time Class result is compared, and user's ratio that user type changes is less than default first thresholding (such as 90%), and content type The content ratio to change is less than default second thresholding (such as 90%).
According to upper example, the 3rd iteration is identical with the 2nd iteration result, can terminate calculating, is as a result GU={ u1->g1, u2->g2,u3->g2,u4->g3,u5->g3,u6->g2,u7->g3,u8->g2,u9->g1,u10->g2};CA={ a1->c1, a2->c3,a3->c2,a4->c2,a5->c3,a6->c1,a7->c3,a8->c2,a9->c3,a10->c2}。
Fig. 3 shows the structure chart of the device according to an embodiment of the invention classified to user and content, should Device is resided in computing device, is the first predetermined number user type suitable for each user clustering during user is gathered, will Each content clustering in properties collection is the second predetermined number content type.
Reference picture 3, described device include initialization module 310, visit capacity computing module 320, similarity calculation module 330 With sort module 340.
Each user type that initialization module 310 is suitable in the first predetermined number user type specifies user's set One or more of user, be the second predetermined number content type in each content type given content set in one Individual or multiple contents.When performing initialization operation, can not there is no user without any priori conditions, i.e., all user types, All the elements type does not have content.Or, it is possibility to have some priori conditions, the priori conditions are by manually to certain customers Obtain that (now the user type for user's division is referred to as the initial user type of the user, is content after being divided with content The content type of division is the initial content type of the type), for example, there is certain customers' type to have one or more users, and/ Or, there is part content type there are one or more contents.
Therefore, initialization module 310 can be according to the mapping relations between existing user and user type, for existing one The user type of individual or multiple users specifies one or more users, and is randomly assigned one for the user type of no user There is no the user of user type;According to the mapping relations between existing content and content type, in existing one or more The content type of appearance specifies one or more contents, and is randomly assigned one for sleazy content type and does not have content class The content of type.
Visit capacity computing module 320 is suitable to the visit capacity according to user to content, calculates each user type to each content First visit capacity, each user are accessed the 3rd of each content type the second visit capacity of each content type and each user type Amount.
Visit capacity computing module 320 can calculate visit capacity of some user type to some content as follows: First, all users that the user type includes are obtained;Then, each access of the user to the content in the user type is obtained Amount;Finally, all visit capacities of acquisition are summed, obtains visit capacity of the user type to the content.
Visit capacity computing module 320 can calculate visit capacity of some user to some content type as follows: First, all the elements that the content type includes are obtained;Then, access of the user to each content in the content type is obtained Amount;Finally, all visit capacities of acquisition are summed, obtains visit capacity of the user to the content type.
Visit capacity computing module 320 can calculate access of some user type to some content type as follows Amount:First, all users that the user type includes and all the elements that the content type includes are obtained;Then, obtaining should Visit capacity of each user to each content in the content type set in user type set;All visit capacities of acquisition are asked (or after summation, then by summed result increase α, 0≤α≤1), obtains visit capacity of the user type to the content type.
Similarity calculation module 330 is suitable to according to second visit capacity and the 3rd visit capacity, calculates each user and each use Similarity between the type of family, according to first visit capacity and the 3rd visit capacity, calculate between each content and each content type Similarity.Wherein, do not calculated for the mapping relations between existing user and user type, similarity calculation module 330 Similarity between the user and each user type;For the mapping relations between existing content and content type, similarity Computing module 330 does not calculate the similarity between the content and each content type.
In one implementation, some user in gathering for user, obtains the user to content set of types respectively The visit capacity of each content type in conjunction, obtain a visit capacity vector;For some user type in user type set, divide The visit capacity of each content type during the user type is closed to content set of types is not obtained, obtains another visit capacity vector;Then, The similarity of the two visit capacities vector is calculated, and using the similarity obtained by calculating as the user and the phase of the user type Like degree.
For some content in properties collection, visit of each user type to the content in user type set is obtained respectively The amount of asking, obtain a visit capacity vector;For some content type in content type set, user type set is obtained respectively In each user type to the visit capacity of the content type, obtain another visit capacity vector;Then, calculate the two visit capacities to The similarity of amount, and using the similarity obtained by calculating as the content and the similarity of the content type.
In another implementation, in order to improve the degree of accuracy of algorithm, various visit capacities are calculated always according to visit capacity Ratio, and similarity is calculated according to various visit capacity ratios, including:
A) according to corresponding to each user, the user to the visit capacity ratio of each content type, and, each user type The corresponding, user type is similar between each user and each user type to calculate to the visit capacity ratio of each content type Degree;
B) according to corresponding to each content, to the visit capacity ratio of the interior each user type for having access, and, each Corresponding to user type, the user type to the visit capacity ratio of each content type, come calculate each content and each content type it Between similarity.
The similarity calculated between two vectors has many algorithms, and those skilled in the art can rationally select as needed Select.For example, the similarity is the similarity factor based on minimum value, Pasteur's similarity factor or cosine similarity factor.In addition, In the embodiment of the present invention, because each visit capacity vector is sparse vector, opened to save the storage in amount of calculation and calculating process Pin, similarity calculation module 330, can be first to the two vectorial each elements (to reflect before two vectorial similarities are calculated Penetrate) domain merge (can take common factor or union) after, then calculate the two vectorial similarities.
Sort module 340 is suitable to for each user, and selection is with its similarity highest user type as the user's User type, for each content, selection and its content type of similarity highest content type as the content, and trigger Visit capacity computing module 320 re-starts visit capacity and calculated and after similarity calculation module 330 re-starts Similarity Measure, weight The selection is newly carried out, when predetermined condition meets, no longer carries out the triggering.Wherein, for existing user and user Mapping relations between type, sort module 340 do not change the user type of the user;For existing content and content type Between mapping relations, sort module 340 do not change the content type of the content.
The predetermined condition can be:Triggering visit capacity computing module 320 and the number of similarity calculation module 330 reach Default number;Or this classification results are compared with the classification results of last time, user's ratio that user type changes It is less than default second thresholding less than the content ratio that default first thresholding and content type change.
It is double using being carried out to website user and content in the scheme classified to user and content according to the present invention Cluster analysis, it is not necessary to know many attributes of content, it is only necessary to the visit capacity according to each user to each content, it is possible to once Property user, content simultaneously classified, user is grouped into each user type, content is grouped into each content type.Moreover, this hair The iterative calculation amount of bright scheme is much smaller compared to existing PLSA, LDA scheduling algorithm.
Below to the calculating of the scheme (double clustering algorithms) according to embodiments of the present invention classified to user and content Complexity is analyzed as follows:
If the connection number between user and content is L, user type number is S, and content type number is T, every time iterative calculation Measure as O (L* (S+T)).For in general website, number of users M and content number N are bigger, and between user and content Connection is sparse, traverse user number × content number is not needed using double clustering algorithms, so amount of calculation and little.And PLSA, LDA each iterative calculation amount is O (M*N*S), and PLSA, LDA only have an intermediate layer to be believed that S=T.It is far smaller than M*N in L When, show clear superiority using double clustering algorithms.
The embodiment of the present invention, which can achieve the effect that, to be exemplified below:
With double clustering methods of the embodiment of the present invention to 672069 users, 722618 contents, 259255531 users Access to content is clustered, is polymerized to when 500 classes share 22 minutes, is polymerized to when 10000 classes share 4451 minutes.Use PLSA Are obtained to same data, internal memory overflows when calculating 10000 dimension, if press Linear Estimation application 485 minutes 500 dimensional vector used times 9704 minutes.When being polymerized to 500 classes, computational efficiency of the invention is approximately 22 times of PLSA, and it is PLSA to be polymerized to during 10000 classes 2.2 times, improved efficiency becomes apparent from when number of clusters and larger former data volume level difference.Compare as shown in Figure 4 in column diagram.
Note:Context of methods realizes that PLSA is realized with c program+MPI, from efficiency of code execution using spark calculating platforms C program should be more higher, so the lifting of algorithm actual efficiency should be bigger than data in text.
Fig. 5 is the Example Computing Device for being arranged as realizing the method classified to user and content according to the present invention 900 block diagram.
In basic configuration 902, computing device 900 typically comprise system storage 906 and one or more at Manage device 904.The communication that memory bus 908 can be used between processor 904 and system storage 906.
Depending on desired configuration, processor 904 can be any kind of processing, include but is not limited to:Microprocessor (μ P), microcontroller (μ C), digital information processor (DSP) or any combination of them.Processor 904 can be included such as The cache of one or more rank of on-chip cache 910 and second level cache 912 etc, processor core 914 and register 916.The processor core 914 of example can include arithmetic and logical unit (ALU), floating-point unit (FPU), Digital signal processing core (DSP core) or any combination of them.The Memory Controller 918 of example can be with processor 904 are used together, or in some implementations, Memory Controller 918 can be an interior section of processor 904.
Depending on desired configuration, system storage 906 can be any type of memory, include but is not limited to:Easily The property lost memory (RAM), nonvolatile memory (ROM, flash memory etc.) or any combination of them.System stores Device 906 can include operating system 920, one or more apply 922 and routine data 924.It can include quilt using 922 It is arranged for carrying out the device 926 classified to user and content for the method classified to user and content.Routine data 924 can include can be used for visit capacity 928 of the user as described here to content.In some embodiments, can using 922 To be arranged as being operated using routine data 924 on an operating system.
Computing device 900 can also include contributing to from various interface equipments (for example, output equipment 942, Peripheral Interface 944 and communication equipment 946) to basic configuration 902 via the communication of bus/interface controller 930 interface bus 940.Example Output equipment 942 include graphics processing unit 948 and audio treatment unit 950.They can be configured as contributing to via One or more A/V port 952 is communicated with the various external equipments of such as display or loudspeaker etc.Outside example If interface 944 can include serial interface controller 954 and parallel interface controller 956, they can be configured as contributing to Via one or more I/O port 958 and such as input equipment (for example, keyboard, mouse, pen, voice-input device, touch Input equipment) or the external equipment of other peripheral hardwares (such as printer, scanner etc.) etc communicated.The communication of example is set Standby 946 can include network controller 960, and it can be arranged to be easy to via one or more COM1 964 and one The communication that other individual or multiple computing devices 962 pass through network communication link.
Network communication link can be an example of communication media.Communication media can be generally presented as in such as carrier wave Or computer-readable instruction in the modulated data signal of other transmission mechanisms etc, data structure, program module, and can With including any information delivery media." modulated data signal " can such signal, one in its data set or more It is individual or it change can the mode of coding information in the signal carry out.As nonrestrictive example, communication media can be with Include the wire medium of such as cable network or private line network etc, and it is such as sound, radio frequency (RF), microwave, infrared (IR) the various wireless mediums or including other wireless mediums.Term computer-readable medium used herein can include depositing Both storage media and communication media.
Computing device 900 can be implemented as a part for portable (or mobile) electronic equipment of small size, and these electronics are set It is standby can be such as cell phone, personal digital assistant (PDA), it is personal media player device, wireless network browsing apparatus, individual People's helmet, application specific equipment or the mixing apparatus that any of the above function can be included.Computing device 900 can be with It is embodied as including desktop computer and the personal computer of notebook computer configuration.
Algorithm and display be not inherently related to any certain computer, virtual system or miscellaneous equipment provided herein. Various general-purpose systems can also be used together with teaching based on this.As described above, required by constructing this kind of system Structure be obvious.In addition, the present invention is not also directed to any certain programmed language.It should be understood that it can utilize various Programming language realizes the content of invention described herein, and the description done above to language-specific is to disclose this hair Bright preferred forms.
In the specification that this place provides, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the present invention Example can be put into practice in the case of these no details.In some instances, known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this description.
Similarly, it will be appreciated that in order to simplify the disclosure and help to understand one or more of each inventive aspect, Above in the description to the exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:I.e. required guarantor The application claims of shield features more more than the feature being expressly recited in each claim.It is more precisely, such as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following embodiment are expressly incorporated in the embodiment, wherein each claim is in itself Separate embodiments all as the present invention.
Those skilled in the art, which are appreciated that, to be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more equipment different from the embodiment.Can be the module or list in embodiment Member or component be combined into a module or unit or component, and can be divided into addition multiple submodule or subelement or Sub-component.In addition at least some in such feature and/or process or unit exclude each other, it can use any Combination is disclosed to all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so to appoint Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification (including adjoint power Profit requires, summary and accompanying drawing) disclosed in each feature can be by providing the alternative features of identical, equivalent or similar purpose come generation Replace.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included some features rather than further feature, but the combination of the feature of different embodiments means in of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed One of meaning mode can use in any combination.
The all parts embodiment of the present invention can be realized with hardware, or to be run on one or more processor Software module realize, or realized with combinations thereof.It will be understood by those of skill in the art that it can use in practice Microprocessor or digital signal processor (DSP) according to embodiments of the present invention are classified to realize to user and content The some or all functions of some or all parts in device.The present invention is also implemented as being used to perform being retouched here The some or all equipment or program of device (for example, computer program and computer program product) for the method stated. Such program for realizing the present invention can store on a computer-readable medium, or can have one or more signal Form.Such signal can be downloaded from internet website and obtained, either provide on carrier signal or with it is any its He provides form.
It should be noted that the present invention will be described rather than limits the invention for above-described embodiment, and ability Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between bracket should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" before element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of some different elements and being come by means of properly programmed computer real It is existing.In if the unit claim of equipment for drying is listed, several in these devices can be by same hardware branch To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and run after fame Claim.

Claims (15)

1. a kind of device classified to user and content, is resided in computing device, suitable for each use during user is gathered Family cluster is the first predetermined number user type, is the second predetermined number content class by each content clustering in properties collection Type, described device include:
Initialization module, it is suitable for each user type in the first predetermined number user type and specifies in user's set One or more users, be the second predetermined number content type in each content type given content set in One or more contents;
Visit capacity computing module, suitable for the visit capacity according to user to content, calculate each user type and visit the first of each content The 3rd visit capacity of the amount of asking, each user to the second visit capacity of each content type and each user type to each content type;
Similarity calculation module, suitable for according to second visit capacity and the 3rd visit capacity, calculating each user and each user type Between similarity, according to first visit capacity and the 3rd visit capacity, calculate similar between each content and each content type Degree;And
Sort module, suitable for for each user, selection and its user class of similarity highest user type as the user Type, for each content, selection and its content type of similarity highest content type as the content, and trigger visit capacity Computing module re-starts visit capacity and calculated and after similarity calculation module re-starts Similarity Measure, re-starts the choosing Select, when predetermined condition meets, no longer carry out the triggering.
2. device as claimed in claim 1, wherein, the initialization module is further adapted for:According to existing user with using Mapping relations between the type of family, one or more users are specified for the user type of existing one or more users, and be The user type for not having user is randomly assigned a user without user type;According between existing content and content type Mapping relations, specify one or more contents for the content type of existing one or more contents, and to be sleazy Content type is randomly assigned a content without content type.
3. device as claimed in claim 2, wherein, it is described for the mapping relations between existing user and user type Similarity calculation module does not calculate the similarity between the user and each user type, and the sort module does not change the user User type;
For the mapping relations between existing content and content type, the similarity calculation module do not calculate the content with it is each Similarity between content type, and the sort module does not change the content type of the content.
4. device as claimed in claim 2, wherein, the visit capacity computing module calculates some user class as follows Visit capacity of the type to some content:Obtain all users that the user type includes;Wherein each user is obtained to the content Visit capacity;All visit capacities are summed, obtain visit capacity of the user type to the content;
The visit capacity computing module calculates visit capacity of some user to some content type as follows:It is interior to obtain this Hold all the elements that type includes;Obtain visit capacity of the user to wherein each content;All visit capacities are summed, are somebody's turn to do Visit capacity of the user to the content type;
The visit capacity computing module calculates visit capacity of some user type to some content type as follows:Obtain All the elements that all users and the content type that the user type includes include;Wherein each user is obtained to wherein every The visit capacity of individual content;All visit capacities are summed, obtain visit capacity of the user type to the content type.
5. device as claimed in claim 4, wherein, the similarity is the similarity factor based on minimum value, the similar system of Pasteur Number or cosine similarity factor.
6. device as claimed in claim 5, wherein, the similarity calculation module calculate as follows some user with Similarity between some user type:Some user in gathering for user, obtains the user to content set of types respectively The visit capacity of each content type in conjunction, obtain a visit capacity vector;For some user type in user type set, divide The visit capacity of each content type during the user type is closed to content set of types is not obtained, obtains another visit capacity vector;Calculate The similarity of the two visit capacities vector, and the similarity obtained by calculating is similar to the user type as the user Degree;
The similarity calculation module calculates the similarity between some content and some content type as follows:For Some content in properties collection, each user type in user type set is obtained respectively and, to the visit capacity of the content, obtains one Individual visit capacity vector;For some content type in content type set, each user class in user type set is obtained respectively Type obtains another visit capacity vector to the visit capacity of the content type;The similarity of the two visit capacities vector is calculated, and will Similarity obtained by calculating is as the content and the similarity of the content type;
Wherein, the user type collection is combined into the set that the first predetermined number user type is formed, the content type Collection is combined into the set that the second predetermined number content type is formed;
Wherein, the similarity calculation module first takes before two vectorial similarities are calculated to the two vectorial domains After common factor or union, then calculate the two vectorial similarities.
7. device as claimed in claim 1, wherein, the predetermined condition is:Trigger the visit capacity computing module and similar The number of degree computing module reaches default number;Or this classification results are compared with the classification results of last time, user class User's ratio that type changes is less than default less than the content ratio that default first thresholding and content type change Second thresholding.
8. a kind of method classified to user and content, is performed in computing device, suitable for each use during user is gathered Family cluster is the first predetermined number user type, is the second predetermined number content class by each content clustering in properties collection Type, methods described include:
Initialization step:One in user's set is specified for each user type in the first predetermined number user type Individual or multiple users, it is one in each content type given content set in the second predetermined number content type Or multiple contents;
Visit capacity calculation procedure:Visit capacity according to user to content, calculate each user type to the first visit capacity of each content, Threeth visit capacity of each user to the second visit capacity of each content type and each user type to each content type;
Similarity Measure step, according to second visit capacity and the 3rd visit capacity, calculate between each user and each user type Similarity, according to first visit capacity and the 3rd visit capacity, calculate the similarity between each content and each content type;With And
Classifying step:For each user, selection and its user type of similarity highest user type as the user are right In each content, selection and its content type of similarity highest content type as the content, and trigger visit capacity calculating Step re-starts visit capacity and calculated and after Similarity Measure step re-starts Similarity Measure, re-starts the selection, When predetermined condition meets, the triggering is no longer carried out.
9. method as claimed in claim 8, wherein, in the initialization step, according to existing user and user type Between mapping relations, specify one or more users for the user type of existing one or more users, and not use The user type at family is randomly assigned a user without user type;According to the mapping between existing content and content type Relation, one or more contents are specified for the content type of existing one or more contents, and be sleazy content class Type is randomly assigned a content without content type.
10. method as claimed in claim 9, wherein, for the mapping relations between existing user and user type, in institute State and do not calculate similarity between the user and each user type in Similarity Measure step, and do not change in the classifying step Become the user type of the user;
For the mapping relations between existing content and content type, the content is not calculated in the Similarity Measure step With the similarity between each content type, and the content type of the content is not changed in the classifying step.
11. method as claimed in claim 9, wherein, in the visit capacity calculation procedure, some is calculated as follows Visit capacity of the user type to some content:Obtain all users that the user type includes;Wherein each user is obtained to this The visit capacity of content;All visit capacities are summed, obtain visit capacity of the user type to the content;
Visit capacity of some user to some content type is calculated as follows:Obtain that the content type includes it is all in Hold;Obtain visit capacity of the user to wherein each content;All visit capacities are summed, obtain the user to the content type Visit capacity;
Visit capacity of some user type to some content type is calculated as follows:Obtain the institute that the user type includes There is a user and all the elements that the content type includes;Obtain visit capacity of wherein each user to wherein each content;It is right All visit capacity summations, obtain visit capacity of the user type to the content type.
12. method as claimed in claim 11, wherein, the similarity is the similarity factor based on minimum value, Pasteur is similar Coefficient or cosine similarity factor.
13. method as claimed in claim 12, wherein, in the Similarity Measure step,
The similarity between some user and some user type is calculated as follows:Some use in gathering for user Family, the visit capacity of each content type during the user closes to content set of types is obtained respectively, obtains a visit capacity vector;For with Some user type in the type set of family, the access of each content type during the user type is closed to content set of types is obtained respectively Amount, obtain another visit capacity vector;The similarity of the two visit capacities vector is calculated, and the similarity obtained by calculating is made For the user and the similarity of the user type;And
The similarity between some content and some content type is calculated as follows:For in some in properties collection Hold, obtain each user type in user type set respectively and, to the visit capacity of the content, obtain a visit capacity vector;For interior Hold some content type in type set, obtain access of each user type to the content type in user type set respectively Amount, obtain another visit capacity vector;The similarity of the two visit capacities vector is calculated, and the similarity obtained by calculating is made For the content and the similarity of the content type;
Wherein, the user type collection is combined into the set that the first predetermined number user type is formed, the content type Collection is combined into the set that the second predetermined number content type is formed;
Wherein, before two vectorial similarities are calculated, after the two vectorial domains first are taken with common factor or union, then calculate The two vectorial similarities.
14. method as claimed in claim 8, wherein, the predetermined condition is:Trigger the visit capacity calculation procedure and similar The number of degree calculation procedure reaches default number;Or this classification results are compared with the classification results of last time, user class User's ratio that type changes is less than default less than the content ratio that default first thresholding and content type change Second thresholding.
15. a kind of computing device, including such as the dress according to any one of claims 1 to 7 classified to user and content Put.
CN201510041042.4A 2015-01-27 2015-01-27 A kind of method, apparatus classified to user and content and computing device Active CN104598601B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510041042.4A CN104598601B (en) 2015-01-27 2015-01-27 A kind of method, apparatus classified to user and content and computing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510041042.4A CN104598601B (en) 2015-01-27 2015-01-27 A kind of method, apparatus classified to user and content and computing device

Publications (2)

Publication Number Publication Date
CN104598601A CN104598601A (en) 2015-05-06
CN104598601B true CN104598601B (en) 2017-12-12

Family

ID=53124386

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510041042.4A Active CN104598601B (en) 2015-01-27 2015-01-27 A kind of method, apparatus classified to user and content and computing device

Country Status (1)

Country Link
CN (1) CN104598601B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170017583A (en) * 2015-08-07 2017-02-15 주식회사 더아이콘티비 Apparatus for providing contents
CN106021329A (en) * 2016-05-06 2016-10-12 西安电子科技大学 A user similarity-based sparse data collaborative filtering recommendation method
CN106101839A (en) * 2016-06-20 2016-11-09 徐汕 A kind of method identifying that television user gathers
CN107451170B (en) * 2017-03-10 2020-04-10 中山大学 Parallel PLSA method based on MPI computing framework
CN109409949A (en) * 2018-10-17 2019-03-01 北京字节跳动网络技术有限公司 Determination method, apparatus, electronic equipment and the storage medium of user group's classification
CN109933788B (en) * 2019-02-14 2023-05-23 北京百度网讯科技有限公司 Type determining method, device, equipment and medium
CN111176800A (en) * 2019-07-05 2020-05-19 腾讯科技(深圳)有限公司 Training method and device of document theme generation model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101685521A (en) * 2008-09-23 2010-03-31 北京搜狗科技发展有限公司 Method for showing advertisements in webpage and system
CN103198418A (en) * 2013-03-15 2013-07-10 北京亿赞普网络技术有限公司 Application recommendation method and application recommendation system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009163496A (en) * 2008-01-07 2009-07-23 Funai Electric Co Ltd Content reproduction system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101685521A (en) * 2008-09-23 2010-03-31 北京搜狗科技发展有限公司 Method for showing advertisements in webpage and system
CN103198418A (en) * 2013-03-15 2013-07-10 北京亿赞普网络技术有限公司 Application recommendation method and application recommendation system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于项目和用户双重聚类的协同过滤推荐算法;施华;《中国优秀硕士学位论文全文数据库》;20090601;第20-26页 *

Also Published As

Publication number Publication date
CN104598601A (en) 2015-05-06

Similar Documents

Publication Publication Date Title
CN104598601B (en) A kind of method, apparatus classified to user and content and computing device
Bai et al. A neural collaborative filtering model with interaction-based neighborhood
Kim et al. Simultaneous discovery of common and discriminative topics via joint nonnegative matrix factorization
CN110866181B (en) Resource recommendation method, device and storage medium
Tsiotas Detecting different topologies immanent in scale-free networks with the same degree distribution
Bickel et al. A nonparametric view of network models and Newman–Girvan and other modularities
US9536201B2 (en) Identifying associations in data and performing data analysis using a normalized highest mutual information score
US9208257B2 (en) Partitioning a graph by iteratively excluding edges
WO2021143267A1 (en) Image detection-based fine-grained classification model processing method, and related devices
CN107786943A (en) A kind of tenant group method and computing device
CN112085565B (en) Deep learning-based information recommendation method, device, equipment and storage medium
CN108021708B (en) Content recommendation method and device and computer readable storage medium
WO2017171826A1 (en) Entropic classification of objects
JP7083375B2 (en) Real-time graph-based embedding construction methods and systems for personalized content recommendations
Hare et al. Derivative-free optimization methods for finite minimax problems
CN107341233A (en) A kind of position recommends method and computing device
CN110647696A (en) Business object sorting method and device
Zhang et al. Advertisement click-through rate prediction based on the weighted-ELM and adaboost algorithm
CN112131261A (en) Community query method and device based on community network and computer equipment
Kagan et al. Probabilistic Search for Tracking Targets: Theory and Modern Application
CN110598123B (en) Information retrieval recommendation method, device and storage medium based on image similarity
CN112995414B (en) Behavior quality inspection method, device, equipment and storage medium based on voice call
CN113343713B (en) Intention recognition method and device, computer equipment and storage medium
Liu et al. A new robust model-free feature screening method for ultra-high dimensional right censored data
CN113408579A (en) Internal threat early warning method based on user portrait

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant