CN108415913A

CN108415913A - Crowd's orientation method based on uncertain neighbours

Info

Publication number: CN108415913A
Application number: CN201710072222.8A
Authority: CN
Inventors: 周孟; 朱福喜
Original assignee: Individual
Current assignee: Individual
Priority date: 2017-02-09
Filing date: 2017-02-09
Publication date: 2018-08-17

Abstract

The present invention is crowd's orientation method based on uncertain neighbours, belong to the research category that crowd orients in Internet advertising, it is related to recommending based on user, attacks the technical fields such as general picture prevention and similarity calculation, primarily directed to because being influenced by many factors, the phenomenon that poor quality that crowd orients in causing advertisement to be launched, the feature prediction model of user is established in access behavior based on user.According to user behavior, the similar crowd of behavior of seed crowd is selected, and using user behavior and user characteristics as foundation, the neighbours of seed user are selected in the similar crowd of subordinate act, using the neighbours of all seed users as candidate crowd.Then the method oriented by crowd, dynamic select go out the higher user of similarity as potential target group.The user that method in the present invention can be widely applied to electric business system recommends, the crowd of advertisement delivery system orientation etc., improves the quality of crowd's recommendation to a certain extent.

Description

Crowd's orientation method based on uncertain neighbours

Technical field

The invention belongs to the research category that crowd in Internet advertising orients, it is related to the recommendation based on user, attacks general picture Prevent and the technical fields such as similarity obtains, a kind of crowd's orientation method based on uncertain neighbours especially set out.

Background technology

Commending system：This project belongs to the research category of recommended technology.In recent years, commending system increasingly closes as scholars The focus of note, and many recommended technologies are proposed, the recommendation such as Cempetency-based education and the recommended technology based on collaborative filtering. PopesculA etc. extends model in terms of Hofmarm ' s, and integrates three kinds of user, product and product content data, then utilizes These data are by orientating products to consumer.Arora etc. has studied the individualized content of user, the i.e. interest of user, user's The different aspects such as the position of history and user, and recommend film to similar other users by these personalized contents. The problem of Ekstrand etc. has studied the specific tasks in commending system, information requirement and project fields etc., serious analysis The target of potential user and these users, and select the recommendation of a variety of methods progress users.Linden etc. is using clustering and search Rope algorithm generates the user recommended and product, and these recommendations are expanded to mass data and are concentrated, generates high in real time in line computation The recommendation of quality.Sequence is learnt to incorporate in commending system by Huang Zhenhua etc., and the feature by integrating a large amount of user and article, User preference demand model is built, to improve the performance and user satisfaction of proposed algorithm.Guo Lei etc. proposes a kind of combination and pushes away The algorithm that incidence relation between object is recommended is recommended, the social relationships between user are not only allowed for, and also contemplates recommendation pair As incidence relation.Rong Huigui etc. is proposed based on user's similarity collaborative filtering recommending method, and by between user Different social networks calculate the similarity between user.Chen Kehan etc. proposes the proposed algorithm of 2 Stage Clusterings, and figure is made a summary Method and algorithm based on content similarity combine, and realize the recommendation based on user interest.Wang etc. first carries out user Classification, and different weights is distributed different behaviors, the similarity between user is then calculated, and according to the similar row between user To generate corresponding user and recommending set.Koren proposes the proposed algorithm based on matrix decomposition.Recommended technology it is related at Fruit provides theoretical foundation for this item purpose research.

General picture is attacked to prevent：Mobasher etc. proposes the recommendation based on PLSA models based on the influence that user profile is attacked Algorithm clusters user by PLSA models.Mehta etc. proposes the proposed algorithm based on singular value decomposition, and weakens Influence of the general picture to recommendation is attacked, to improve the anti-attack ability of system.Sandivg etc. proposes the collaboration based on correlation rule Filter algorithm enhances the stability of commending system.Jamali etc. introduces the trusting relationship between user, it is proposed that random walk Model.Ma etc. proposes the method recommended based on matrix decomposition by mosaic society's information.Jia Dongyan etc. passes through user's Degree of belief proposes a kind of collaborative filtering based on dual neighbours' Selection Strategy, and the recommendation to target user is completed.This Project will be on the basis of work on hand, using user characteristics and user behavior, by the similarity between user, to find kind The neighbours of child user, and using all neighbours as candidate crowd.

The calculating of similarity：About the computational methods of similarity, has a large amount of research work.Nearest research includes： Relationship between the Zhong Zhao users having studied in microblogging such as full, and the phase between user is calculated by the concern of user and bean vermicelli Like degree.Liu Ming etc. proposes a kind of similarity calculating method of feature based weight quantization, and solves the problems, such as that data are inconsistent. Li Hailin etc. proposes two kinds of normal cloud model similarity calculating methods, and passes through the expectation curve of normal cloud model and maximum side Boundary's curve describes the general characteristic of normal cloud model.Xu Zhiming etc. is given by the relationship in community network based on use User's similarity calculating method of the various attribute informations (background information, microblogging text, social information) at family.Wu Yitao etc. will be from Scattered piece is blurred into Trapezoid Fuzzy Number, and calculates user's similarity by Trapezoid Fuzzy Number.In fact, this project is for use The characteristic of family behavior and user characteristics, it is proposed that different similarity calculating methods, and user is merged by the method for weighting Behavior similarity and user characteristics similarity, and then obtain the similarity between user.

Invention content

For because being influenced by a variety of elements, the user quality recommended in being oriented so as to cause crowd is not high, and current The relevant technologies are weaker to the processing of problems, the present invention is directed to design crowd's orientation method based on uncertain neighbours, User characteristics are predicted by the web page resources of browsing and the online media sites of access, and kind of a Ziren is selected according to user behavior is similar The similar crowd of behavior of group.Then using user behavior and user characteristics as foundation, seed use is selected in the similar crowd of subordinate act The neighbours at family, and using the neighbours of all seed users as candidate crowd.Finally, the method oriented by crowd, dynamic select Go out the higher user of similarity as target group.

To complete the above target, the present invention proposes a kind of crowd's orientation method based on uncertain neighbours, this method packet Include following steps：

A：Obtain the feature (ascribed characteristics of population and interest tags) of user；

B：The similar crowd of housing choice behavior, wherein according to given seed crowd, the online media sites accessed by user obtain Behavior similarity between user, and corresponding threshold value is set, select the user that similarity is not less than threshold value, the use selected Family set is used as the similar crowd of behavior；

C：The candidate crowd of selection, wherein according to user characteristics and user behavior, by user's similarity acquisition methods, from row To select the neighbor user of each seed in similar crowd, and using the seed-bearing neighbor user of institute as candidate crowd；With

D：For candidate crowd in step C, the method dynamic select oriented by crowd goes out the higher user of similarity and collects It closes, and using user as potential target group.

Step A further comprises following sub-step：

A1：According to the online media sites of access, the ascribed characteristics of population of user is predicted；With

A2：According to the webpage that user browses, the interest tags of user are predicted.

In the step A1, ascribed characteristics of population feature be divided into gender, the age, marital status, personal income, educational background, occupation and 7 subcharacters of industry, and the acquisition of subcharacter is mainly predicted by the following method：

Wherein M₁, M₂..., M_nIndicate n media,Indicate that the classification j of k-th of subcharacter is user,It indicates Have accessed media M_iAnd the user number counting of the classification j of k-th of subcharacter,For k-th of subcharacter of user Classification j probability.

The step B then uses following methods to obtain when obtaining the behavior similarity of user u and user v：

Wherein D_KL(P_u||P_v) indicate P_uAnd P_vDivergence, D_KL(P_v||P_u) indicate P_vAnd P_uDivergence, P_uIndicate user u's Media density, P_vThe media density for indicating user v, since divergence has asymmetry, D_KL(P_u||P_v) and D_KL(P_v|| P_u) may be inconsistent.

In addition, the acquisition about divergence, uses following methods：

Assuming that P_uAnd P_vIt is the cuclear density distribution for being user u, user v respectively, then P_uAnd P_vDivergence be：

Wherein M indicates the media collection accessed, P_u(i) and P_v(i) indicate that user u and user v access media M respectively_iIt is close Degree.

When estimating that user accesses media density, following methods are used to obtain：

And

Wherein M (u) indicates that the media collection that user u is accessed, h indicate window width,Indicate that user u accesses media M_jMeter Number,Indicate media M_iWith media M_jThe distance between, U_iExpression has accessed media M_iUser set, U_jExpression has accessed Media M_jUser set.

User characteristics similarity is utilized when obtaining user's similarity in the step C, and user characteristics similarity obtains It takes, uses following methods：

Wherein sim_P(u, v) is the ascribed characteristics of population similarity of two users, sim_I(u, v) is that the interest of two users is similar Degree.

The value of the ascribed characteristics of population is broadly divided into two kinds of numeric type and title type, when obtaining the similarity of the ascribed characteristics of population, makes With distance of two users on numeric type and title type.Then the similarity of the ascribed characteristics of population mainly obtains by the following method It takes：

Wherein D_numberIndicate distances of the user u and user v in all numeric type features, D_nominalIndicate user u and use Distances of the family v in all title type features.

For the range measurement in numeric type feature, then following methods are used to be obtained：

Wherein d_jThe distance in two users on subcharacter j is indicated, if all d_jAll be 0, then D_numberDefault value It is 1.

For the range measurement in title type feature, then following methods are used to be obtained：

Assuming that the value number of title type attribute is N, then manually graded in order to all values, i.e., all comments Grade is r₁, r₂..., r_NIf grading of two users on the attribute is respectively r_iAnd r_j, then two users are on the attribute Distance is | r_i-r_j|, distance of two users in all title type features is：

Wherein d'_jThe distance in two users on subcharacter j is indicated, if all d'_jBe all 0, then D_nominalAcquiescence Value is 1.

When obtaining the similarity of interest, the interest fingerprint of user is generated according to the interest of user first, then by emerging Interesting fingerprint obtains the Interest Similarity between user.The generating process of interest fingerprint is specific as follows：

1. hashing, wherein being hashed to all interest, several K hashed value is obtained.

2. it weights, wherein all hobbies of user are extracted, and each probability right of interest, and dissipated with corresponding Train value is multiplied, if certain position of hashed value is 1, which is multiplied with probability right, if the position is 0, which is -1 and probability is weighed The product of weight.

3. adding up, wherein all of the above hashed value to each progress accumulation operations, only there are one sequences for generation Numeric string.

4. dimensionality reduction, wherein the numeric string that above-mentioned accumulation step obtains is become 0 and 1 character string, i.e., final interest refers to Line.

If each is more than 0, which is denoted as 1, if being less than 0, which is denoted as 0.Finally this K number is connected in order It picks up and, as interest fingerprint.

Assuming that the interest fingerprint of user u and user v is respectively f_uAnd f_v, the measurement of Interest Similarity, then by the following method To obtain：

Wherein f_uiAnd f_viThe interest fingerprint of user u and user v in i-th bit is indicated respectively.

In the similarity between obtaining user, user characteristics similarity and user behavior similarity is utilized：

Sim (u, v)=α sim_B(u,v)+(1-α)sim_F(u,v)

Wherein α is the weight of behavior similarity, and 1- α are characterized the weight of similarity.

Compared with prior art, the present invention has the advantages that：

1) present invention can be potential target group with automatic identification, can effectively improve the quality of recommendation crowd.

2) present invention has carried out filter operation to the attack of large-scale user profile, has saved certain manpower.

3) method in the present invention can be widely applied to user's recommendation of electric business system, the crowd of advertisement delivery system determines To etc., the quality of crowd's recommendation is improved to a certain extent.

Description of the drawings

Fig. 1 is the schematic diagram according to crowd orientation method of a preferred embodiment of the present invention one based on uncertain neighbours.

Fig. 2 is the interest classification schematic diagram according to the above preferred embodiment of the present invention.

Fig. 3 is the interest fingerprint generating principle figure according to the above preferred embodiment of the present invention.

Specific implementation mode

It is described below for disclosing the present invention so that those skilled in the art can realize the present invention.It is excellent in being described below Embodiment is selected to be only used as illustrating, it may occur to persons skilled in the art that other obvious modifications.It defines in the following description The present invention basic principle can be applied to other embodiments, deformation scheme, improvement project, equivalent program and do not carry on the back Other technologies scheme from the spirit and scope of the present invention.

When it is implemented, technical solution provided by the present invention can use computer software technology by those skilled in the art Automatic running flow is realized, below in conjunction with the drawings and examples technical solution that the present invention will be described in detail.

Fig. 1 is the embodiment party according to crowd's orientation method based on uncertain neighbours of a preferred embodiment of the present invention Case is divided into following procedure：The feature of user, the i.e. ascribed characteristics of population and interest of user are obtained first, mainly according to the behavior of user (URL of access) establishes user characteristics prediction model, and user characteristics prediction model is divided into ascribed characteristics of population prediction model and interest point Class model goes out the feature of user by model prediction.Then according to the behavior of user, select has similar row to seed crowd For crowd, and according to user characteristics and its behaviors, the neighbours of seed user are selected in the similar crowd of subordinate act, will be owned Neighbours as candidate crowd.Finally by the method that crowd orients, target user is selected from candidate crowd automatically.

Specific implementation step is as follows：

Step 1, user characteristics prediction model is established：The URL accessed according to user establishes the ascribed characteristics of population prediction of user Model and interest disaggregated model, and then predict the ascribed characteristics of population and interest preference of user.

Step 1.1 predicts the ascribed characteristics of population of user, from the URL that user accesses, extracts the online media sites of user's access, And according to the online media sites of access, establish the prediction model of the ascribed characteristics of population.

The ascribed characteristics of population is the description of user's inherent attribute, i.e. gender, age, personal income, marriage, education degree, occupation With 7 subcharacters of industry.By taking gender as an example, it is however generally that, often browse buying car (www.haomaiche.com), net game (www.youxi.com) user is mostly male, and the user overwhelming majority user for often accessing amusement variety is women. Then, it when predicting the ascribed characteristics of population of user, uses user and accesses the domain name (i.e. website) of URL to establish the pre- of the ascribed characteristics of population Survey model.For predicting subcharacter k, specific prediction model is as follows：

Assuming that some user has accessed n different media, respectively M₁, M₂..., M_n, andExpression has accessed Media M_iAnd the user number counting of the classification j of k-th of subcharacter,Indicate that the classification j of k-th of subcharacter is user, then the user The probability that the classification for belonging to subcharacter k is j is：

It can determine whether through above-mentioned model, when predicting subcharacter k, select classes of the higher j of class probability as subcharacter k Distinguishing label.Such as when predicting this subcharacter of gender of user, if the probability of male is more than the probability of women, the user's This subcharacter of gender is male.

Step 1.2, the URL that user accesses can not only reflect the ascribed characteristics of population of user, but also can reflect use The category of interest at family.This is because the content of the different URL pages, has reacted different interest topics, such as the page of good buying car The theme of face content reaction is automobile, and the theme that the content of pages played is biased to is amusement.Then, in the page of URL Appearance establishes Topic Profile, and is predicted by interest disaggregated model not marking the category of interest URL pages, waits for that interest is pre- After the completion of survey, and the category of interest of mark is given to the user for accessing URL.

According to this preferred embodiment of the invention, interest can be divided into amusement, finance and economics finance, movement, digital product, tourism, Automobile, literature and art, the political situation of the time, health care and military 10 classifications.As shown in Fig. 2, interest classification is main including training pattern and emerging Interest 2 stages of prediction：LR graders are trained by sample set first, then use the LR graders of training to the page of access into Row interest classifies in the training pattern stage, first by the text of the crawler capturing sample data URL pages, and the sample to crawling The pretreatment operations such as this text segmented, filtering useless word form the training sample after participle；Then by treated sample This training LR sorter models is in interest forecast period, it is necessary first to capturing the web page contents of URL to be sorted, and be divided The pretreatment operations such as word, filtering useless word；Then predict that the URL pages carry out category of interest by LR sorter models, and will Category of interest is as the hobby for accessing the URL user.

Step 2 selects the behavior phase of seed crowd according to the behavior similarity calculating method of user from all groups Like crowd.

Target group be essentially all with seed crowd have similar user behavior, therefore choose recommendation crowd when first According to user behavior, the behavior similar crowd of seed crowd is selected.Since user behavior is all one that user once accessed Media (or website) information of series, according to traditional method for measuring similarity, such as cosine similarity, Pearson correlation coefficient Deng, and these methods those of only only account for accessing between two users media jointly, have ignored the influence of other media.If Can estimate user entire mediaspace Density Distribution, then according to user mediaspace density, to calculate two The behavior similarity of user can be more in line with reality.

According to this preferred embodiment of the invention, the thought of cuclear density method is used to estimate user in mediaspace Density.Common kernel function has uniform kernel function, triangle kernel function, gaussian kernel function etc., but influence of the shape of core to result Smaller than window width is more, then use gaussian kernel function in embodiment estimate user mediaspace density.

Defined in embodiment：Assuming that M (u) indicates that the media collection that user u is accessed, h indicate window width,Indicate user u Access media M_jCounting,Indicate media M_iWith media M_jThe distance between, U_iExpression has accessed media M_iUser collection It closes, U_jExpression has accessed media M_jUser set, then user access media density be：

And

Density Estimator is carried out by the above method, the Density Distribution of entire mediaspace can be obtained.Then pass through matchmaker The cuclear density of body is distributed to calculate the behavior similarity between two users.According to this preferred embodiment of the invention, it uses KL divergences calculate the behavior similarity of two users.

Defined in embodiment：Assuming that P_uAnd P_vIt is the cuclear density distribution for being user u, user v respectively, then P_uAnd P_vDivergence For：

Since there is KL divergences asymmetry to calculate two by following formula according to this preferred embodiment of the invention The behavior similarity of a user, i.e.,：

According to this preferred embodiment of the invention, in the similar crowd of housing choice behavior, the media of user's access are first depending on, User is estimated in the density of mediaspace, the behavior similarity of seed user and other users is then calculated, phase is finally set The threshold value answered, and select behavior similar crowd of user set of the behavior similarity not less than threshold value as seed crowd.

When step 3 selects candidate crowd, the method based on user's similarity is used first, in the similar crowd of subordinate act, Calculate the similarity of each seed user and other users.Then certain threshold value is set, and select similarity and be more than threshold value Neighbours of the user as the seed user.Finally using the neighborhood of the seed user of left and right as candidate crowd.

According to this preferred embodiment of the invention, when selecting potential target user, there is no direct subordinate act is similar Directly go to choose in crowd, be on the one hand because when the media that user accesses are less, cannot using the method for behavior similarity Behavior accurately between measure user is similar.On the other hand it is because being highly susceptible to other users during selection The influence of general picture attack.Then, according to this preferred embodiment of the invention, pass through selected seed user in the similar crowd of subordinate act Neighbours, the lower user of those similitudes is filtered out according to this, and using all neighbours as candidate crowd, to enhance referrer The quality of group.When choosing candidate crowd, the method that uses user's similarity.User's similarity be then by user behavior and User characteristics weigh the similarity degree between user, it is the behavior of the feature and user according to user, calculate the spy of user The behavior similarity of similarity and user is levied, and corresponding weight is arranged to characteristic similarity and behavior similarity, is then passed through The method of weighting calculates the similarity between user.

Due to user feature mainly include the ascribed characteristics of population and category of interest, calculate user characteristic similarity When, the method for measuring similarity of different characteristic need to be studied.Thus according to presently preferred embodiment of the invention, according to the feature of user Difference calculates separately the ascribed characteristics of population similarity and Interest Similarity of user.

When calculating the similarity of the ascribed characteristics of population, the value type for considering the ascribed characteristics of population is needed.The value master of the ascribed characteristics of population It is divided into two kinds of numeric type and title type, then, according to this preferred embodiment of the invention, by user in numeric type and title Distance in type calculates the similarity of user property.

Distance D in numeric type feature_number, then following methods are used to measure：

Distance D in title type feature_nominal, then following methods are used to measure：

Assuming that the value number of title type attribute is N, then manually graded in order to all values, i.e., all comments Grade is r₁, r₂..., r_NIf grading of two users on the attribute is respectively r_iAnd r_j, then two users are on the attribute Distance is | r_i-r_j|, therefore distance of two users in all title type features is：

Defined in embodiment：Assuming that there are user u and user v, D_numberIt is two users in all numeric type features Distance, D_nominalFor distance of two users in all title type features, then the ascribed characteristics of population similarity of user u and user v For：

Measurement for Interest Similarity, presently preferred embodiment of the invention use the similarity meter based on interest fingerprint Calculation method.As shown in figure 3, for the hobby of each user, the interest fingerprint of user is generated.The specific generation of interest fingerprint Process is as follows：

1. hashing.All interest is hashed, several K hashed value is obtained.

2. weighting.Extract all hobbies of user, and the probability right of each interest, and with corresponding hashed value It is multiplied, if certain position of hashed value is 1, which is multiplied with probability right, if the position is 0, this is -1 and probability right Product.

3. adding up.All of the above hashed value to each progress accumulation operations, the number only there are one sequence is generated String.

4. dimensionality reduction.The character string for the numeric string that above-mentioned accumulation step obtains being become 0 and 1, forms final interest fingerprint. If each is more than 0, which is denoted as 1, if being less than 0, which is denoted as 0.Finally this K number is linked in sequence, As interest fingerprint.

Defined in embodiment：Assuming that the interest fingerprint of user u and user v is respectively f_uAnd f_v, then the interest phase of two users It is like degree：

User characteristics are the inherent attributes of user, and user characteristics contain two aspects of the ascribed characteristics of population and interest, because This user characteristics similarity includes similarity two parts of the similarity and interest of the ascribed characteristics of population.Since the ascribed characteristics of population and user are emerging Interest is to describe user characteristics from different aspect, belongs to different dimensional spaces, ascribed characteristics of population similarity between user and emerging Interesting similarity is different, can all influence the similarity of user characteristics, then presently preferred embodiment of the invention uses harmonic average Method calculate the characteristic similarity of user.

Defined in embodiment：Assuming that there are user u and user v, sim_P(u, v) is that the ascribed characteristics of population of two users is similar Degree, sim_I(u, v) is the Interest Similarity of two users, then the characteristic similarity of user u and user v are：

User not only has inherent user characteristics, but also includes dynamic user behavior.User's similarity be from Family characteristic similarity and user behavior similarity two dimensions weigh the similarity degree between user, are weighed due to each dimension Degree is different, therefore when similarity between measure user, uses the method for weighting to calculate, i.e., by similar to two Corresponding weight is arranged in degree, is then combined with the result of two similarities.

Defined in embodiment：Assuming that there are user u and user v, sim_B(u, v) is the behavior similarity of two users, sim_F (u, v) is the characteristic similarity of two users, then the similarity of user u and user v are：

Sim (u, v)=α sim_B(u,v)+(1-α)sim_F(u,v)

The target that candidate's mass selection takes finds out higher kind of similarity mainly using user behavior and user characteristics as foundation The neighbours of child user.The process includes mainly following two stages：

1. first against each seed user, the similarity of each user in crowd similar to behavior is calculated.

2. corresponding threshold value is arranged, the candidate crowd of seed crowd is selected.In this stage, similarity is set first Threshold value, and it is directed to each seed user, select neighbour of the similarity not less than those of threshold value user's set as seed user It occupies.Finally using all neighborhoods selected as the candidate crowd of seed crowd.

Step 4 is not since the user in candidate crowd is the neighborhood selected for each seed user, but not Be each user has higher similitude with all seed users, then, according to this preferred embodiment of the invention, from The whole angle of seed crowd is set out, the method oriented by crowd, this method with user characteristics and user behavior be choose according to According to the similarity of each user and seed crowd in the candidate crowd of calculating, the higher user of dynamic select similarity is as latent Target group.

Crowd's orientation method mainly dynamic select from candidate crowd goes out potential target group, includes mainly three ranks Section：

1. the similarity of each user and seed user in candidate crowd are calculated first, then according to user and all kinds The similarity of child user calculates the average value of similarity, and as the similarity of user and seed crowd.

2. according to the similarity of all users and seed crowd, calculate the average value of similarity, and using this average value as The threshold value of similarity.

The user that 3. user and seed crowd's similarity are selected from candidate crowd not less than threshold value gathers, and by these User is as potential target group.

To ensure the performance of crowd's orientation, model evaluation can be carried out：

(1) performance evaluation

Index evaluation is carried out to system performance.Index includes：Precision, recall rate and anti-attack ability etc..In addition to research is Except the precision and recall rate of system, it is also added into the user of general picture attack in systems, and is to study by anti-attack ability The quality that system is recommended.

It (2) can performance and complexity analyzing

Computability analysis mainly analyzes whether this method is that can calculate, can be achieved in the case where not considering complexity 's.To the np complete problem of appearance, approximate computational methods are proposed.Analysis of complexity is mainly, under the premise of computable, point Time complexity of the model in calculating process is analysed, the efficiency of model is weighed in the complexity estimation modeled.

Specific embodiments are merely illustrative of the spirit of the present invention described in this project.Technology belonging to the present invention The technical staff in field can make various modifications or additions to the described embodiments or by a similar method It substitutes, however, it does not deviate from the spirit of the invention or beyond the scope of the appended claims.

It should be understood by those skilled in the art that the embodiment of the present invention shown in foregoing description and attached drawing is only used as illustrating And it is not intended to limit the present invention.The purpose of the present invention has been fully and effectively achieved.The function and structural principle of the present invention exists It shows and illustrates in embodiment, under without departing from the principle, embodiments of the present invention can have any deformation or modification.

Claims

1. crowd's orientation method based on uncertain neighbours, which is characterized in that include the following steps：

A：Obtain the feature of user comprising the ascribed characteristics of population and interest tags；

B：The similar crowd of housing choice behavior obtains user wherein according to given seed crowd by the online media sites that user accesses Between behavior similarity, and corresponding threshold value is set, and select similarity and be not less than the user of threshold value, wherein selecting User's set is used as the similar crowd of behavior；

C：The candidate crowd of selection passes through user's similarity acquisition methods, subordinate act phase wherein according to user characteristics and user behavior Like the neighbor user for selecting each seed in crowd, and using the seed-bearing neighbor user of institute as candidate crowd；With

D：For candidate crowd in the step C, the method dynamic select oriented by crowd goes out the higher user of similarity and collects It closes, and using user as potential target group；

The wherein described step A includes the following steps：

2. crowd's orientation method according to claim 1 based on uncertain neighbours, which is characterized in that the ascribed characteristics of population is special Sign includes gender, age, marital status, personal income, educational background, 7 subcharacters of occupation and industry, wherein the population attributive character The acquisition of the subcharacter predicted by the following method：

Wherein M₁, M₂..., M_nIndicate n media, wherein C_j ^kIndicate that the classification j of k-th of subcharacter is user, wherein C_j ^k(M_i) table Show and has accessed media M_iAnd the user number counting of the classification j of k-th of subcharacter, wherein p (C_j ^k|M₁M₂…M_n) it is k-th of user son The probability of the classification j of feature.

3. crowd's orientation method according to claim 1 based on uncertain neighbours, which is characterized in that the step B exists When obtaining the behavior similarity of user u and user v, obtained using following methods：

Wherein D_KL(P_u||P_v) indicate P_uAnd P_vDivergence, wherein D_KL(P_v||P_u) indicate P_vAnd P_uDivergence, wherein P_uIndicate user The media density of u, wherein P_vIndicate the media density of user v；

The wherein acquisition of divergence, using following methods：

Wherein M indicates the media collection accessed, wherein P_u(i) and P_v(i) indicate that user u and user v access media M respectively_iIt is close Degree；

Wherein when estimating that user accesses media density, obtained using following methods：

And

Wherein M (u) indicates that the media collection that user u is accessed, wherein h indicate window width, whereinIndicate that user u accesses media M_j Counting, whereinIndicate media M_iWith media M_jThe distance between, wherein U_iExpression has accessed media M_iUser set, Wherein U_jExpression has accessed media M_jUser set.

4. crowd's orientation method according to claim 1 based on uncertain neighbours, which is characterized in that the step C exists When obtaining user's similarity, user characteristics similarity is utilized, wherein user characteristics similarity is obtained by following formula：

Wherein sim_P(u, v) is the ascribed characteristics of population similarity of two users, wherein sim_I(u, v) is that the interest of two users is similar Degree；

The value of the wherein ascribed characteristics of population is broadly divided into two kinds of numeric type and title type, wherein in the similarity for obtaining the ascribed characteristics of population When, distance of two users on numeric type and title type is used, then the similarity of the ascribed characteristics of population is prepared by the following：

Wherein D_numberIndicate distances of the user u and user v in all numeric type features, wherein D_nominalIndicate user u and use Distances of the family v in all title type features；

Wherein for the range measurement in numeric type feature, obtained as the following formula：

Wherein d_jThe distance in two users on subcharacter j is indicated, if wherein all d_jAll be 0, then D_numberDefault value It is 1；

Wherein for the range measurement in title type feature, obtained using following methods：

Assuming that the value number of title type attribute is N, then manually graded in order to all values, i.e., all is rated r₁, r₂..., r_NWherein if grading of two users on the attribute is respectively r_iAnd r_j, then two users are on the attribute Distance is | r_i-r_j|, distance of the two of which user in all title type features is：

Wherein d'_jThe distance in two users on subcharacter j is indicated, if wherein all d'_jBe all 0, then D_nominalAcquiescence Value is 1；

Wherein when obtaining the similarity of interest, the interest fingerprint of user is generated according to the interest of user first, then by emerging Interesting fingerprint obtains the Interest Similarity between user, and the generating process of wherein interest fingerprint is specific as follows：

1. hashing, wherein being hashed to all interest, several K hashed value is obtained；

2. weight, wherein extract all hobbies of user, and each interest probability right, and with corresponding hashed value Be multiplied, if wherein certain position of hashed value be 1, which is multiplied with probability right, if wherein the position be 0, the position be -1 and generally The product of rate weight；

3. adding up, wherein all of the above hashed value to each progress accumulation operations, to generate the only number there are one sequence Word string；With

4. dimensionality reduction, wherein the numeric string that above-mentioned accumulation step obtains is become 0 and 1 character string, i.e., final interest fingerprint, If each in is more than 0, which is denoted as 1, if wherein being less than 0, which is denoted as 0, and finally this K number is linked in sequence Get up, as interest fingerprint；

Wherein assume that the interest fingerprint of user u and user v is respectively f_uAnd f_v, then the measurement of Interest Similarity obtained by following formula：

5. crowd's orientation method according to claim 1 based on uncertain neighbours, which is characterized in that obtain user it Between similarity when, utilize user characteristics similarity and user behavior similarity：

Sim (u, v)=α sim_B(u,v)+(1-α)sim_F(u,v)

Wherein α is the weight of behavior similarity, and wherein 1- α are characterized the weight of similarity.

6. crowd's orientation method based on uncertain neighbours, which is characterized in that include the following steps：

D：For candidate crowd in the step C, the method dynamic select oriented by crowd goes out the higher user of similarity and collects It closes, and using user as potential target group.

7. crowd's orientation method according to claim 6 based on uncertain neighbours, which is characterized in that the step A packets Include following steps：

A2：According to the webpage that user browses, the interest tags of user are predicted；

Wherein the population attributive character includes following subcharacter：Gender, the age, marital status, personal income, educational background, occupation and Industry.

8. crowd's orientation method according to claim 7 based on uncertain neighbours, which is characterized in that the ascribed characteristics of population is special The acquisition of the subcharacter of sign is predicted by the following method：

Wherein M₁, M₂..., M_nIndicate n media, wherein C_j ^kIndicate that the classification j of k-th of subcharacter is user, wherein C_j ^k(M_i) table Show and has accessed media M_iAnd the user number counting of the classification j of k-th of subcharacter, wherein p (C_j ^k|M₁M₂…M_n) it is k-th of user son The probability of the classification j of feature；

The wherein described step B is obtained when obtaining the behavior similarity of user u and user v using following methods：

The wherein acquisition of divergence, using following methods：

And

9. crowd's orientation method according to claim 8 based on uncertain neighbours, which is characterized in that the step C exists When obtaining user's similarity, user characteristics similarity is utilized, wherein user characteristics similarity is obtained by following formula：

10. crowd's orientation method according to claim 9 based on uncertain neighbours, which is characterized in that obtaining user Between similarity when, utilize user characteristics similarity and user behavior similarity：

Sim (u, v)=α sim_B(u,v)+(1-α)sim_F(u,v)