Embodiment
The user behavior similarity calculation method process flow diagram that Fig. 1 provides for the embodiment of the present invention.The embodiment of the present invention is for lacking the method utilizing the behavioural characteristic of a large number of users to analyze the similarity of different user behavior in prior art, cause the utilization factor of the behavioural characteristic to a large number of users collected lower, provide user behavior similarity calculation method, the method concrete steps are as follows:
The behavioural characteristic value that in multiple behavioural characteristics of step S101, collection first kind user, each behavioural characteristic is corresponding, and the behavioural characteristic value that in described multiple behavioural characteristic of Equations of The Second Kind user, each behavioural characteristic is corresponding;
The embodiment of the present invention browses according to the multiple behavioural characteristics preset the behavioural characteristic gathering user in the process of webpage clicking information user, such as, multiple behavioural characteristic specifically comprises: whether browse certain webpage, the title whether clicked in certain webpage, user browse the time of certain webpage, the time clicking certain title, click title content, click the number of times etc. of certain title in one day, the number that the embodiment of the present invention does not limit the multiple behavioural characteristics preset is 6, can be any number of.In addition, the embodiment of the present invention carries out Digital ID to each behavioural characteristic in multiple behavioural characteristic in advance, and such as user has browsed certain webpage and has been designated as 1, and user does not browse certain webpage and is designated as 0; The title that user clicks in certain webpage is designated as 1, and the title that user does not click in certain webpage is designated as 0; The time that user browses certain webpage is designated as 1 in the morning, and noon is designated as 2, and be designated as 3 in the afternoon, and be designated as 4 in the evening; The time clicking certain title is designated as 1 in the morning, and noon is designated as 2, and be designated as 3 in the afternoon, and be designated as 4 in the evening; The title content clicked belongs to health diet and is designated as 1, and amusement and recreation are designated as 2, and financial investment is designated as 3, and scientific and technical information is designated as 4 etc.; The number of times clicking certain title in one day can define according to the number of times of certain title content of actual click.Such as, user has browsed certain webpage, click the title in certain webpage, user browses the time of certain webpage in the morning, click the time of certain title at noon, the title content clicked belongs to health diet, the number of times clicking certain title in one day is 3, then the browser of user side gathers the behavioural characteristic value that in multiple behavioural characteristics of this user, each behavioural characteristic is corresponding and is respectively 1,1,1,2,1,3, and the behavioural characteristic value that in multiple behavioural characteristics of this user, each behavioural characteristic is corresponding can form a behavioural characteristic vector [1,1,1,2,1,3].
Described first kind user is the user meeting first object behavioural characteristic, and described Equations of The Second Kind user is the user meeting the second goal behavior feature, and described first object behavioural characteristic has the identical behavioural characteristic of part with described second goal behavior feature.
The embodiment of the present invention gathers the behavioural characteristic value that in multiple behavioural characteristics of first kind user and Equations of The Second Kind user, each behavioural characteristic is corresponding respectively, first kind user is specially seed user, Equations of The Second Kind user is specially contrast user, seed user is identical with the part behavioural characteristic of contrast user, part behavioural characteristic is different, such as, seed user is the user having browsed certain brand milk advertisement and clicked this advertisement, contrast user is the user having browsed certain brand milk advertisement but do not clicked this advertisement, and the identification number of first kind user is 1, the identification number of Equations of The Second Kind user is 0.Such as, the embodiment of the present invention gathers the behavioural characteristic value that in multiple behavioural characteristics of 100 first kind users and 100 Equations of The Second Kind user difference correspondences, each behavioural characteristic is corresponding, namely 100 first kind users are to there being 100 behavioural characteristic vectors, and 100 Equations of The Second Kind users are to there being 100 behavioural characteristic vectors.
Step S102, according to multiple behavioural characteristic value corresponding to described first kind user and multiple behavioural characteristic values corresponding to described Equations of The Second Kind user, screening is carried out to described multiple behavioural characteristic and obtain goal behavior characteristic set;
Number due to the multiple behavioural characteristics preset can be any number of, but some behavioural characteristic is redundancy for the user behavior similarity calculation method that the embodiment of the present invention provides in the plurality of behavioural characteristic, so need to carry out screening to multiple behavioural characteristic to obtain goal behavior characteristic set.
Step S103, set up the first generalized linear model according to described goal behavior characteristic set, utilize optimization method to calculate the first maximum likelihood estimation of described first generalized linear model, and obtain estimated parameter corresponding to described first maximum likelihood estimation;
The first generalized linear model is set up according to this goal behavior characteristic set, the method setting up generalized linear model can adopt any one method in prior art, optimization method is utilized to calculate the first maximum likelihood estimation of described first generalized linear model, can obtain corresponding estimated parameter by this first maximum likelihood estimation, the number of this estimated parameter is identical with the number of behavioural characteristic in goal behavior characteristic set.
Step S104, behavioural characteristic value that in described estimated parameter and described goal behavior characteristic set corresponding to user to be measured, each behavioural characteristic is corresponding is utilized to calculate the behavior similarity of described user to be measured and described first kind user.
The described behavior similarity utilizing behavioural characteristic value that in described estimated parameter and described goal behavior characteristic set corresponding to user to be measured, each behavioural characteristic is corresponding to calculate described user to be measured and described first kind user, comprise: described estimated parameter is formed primary vector, behavioural characteristic value corresponding for each behavioural characteristic in described goal behavior characteristic set corresponding for described user to be measured is formed secondary vector; The inner product calculating described primary vector and described secondary vector obtains described behavior similarity.
Behavioural characteristic value that in this estimated parameter and described goal behavior characteristic set corresponding to user to be measured, each behavioural characteristic is corresponding is utilized to calculate the behavior similarity of described user to be measured and described first kind user, be specially the behavioural characteristic value that in the described goal behavior characteristic set gathering user to be measured, each behavioural characteristic is corresponding, the behavior, eigenwert formed behavioural characteristic value vector, by the behavior feature value vector and the one-dimensional vector that forms of estimated parameter do inner product, this inner product value is the behavior similarity of described user to be measured and described first kind user.
The embodiment of the present invention carries out screening acquisition goal behavior characteristic set by the behavioural characteristic value that each behavioural characteristic in the corresponding respectively multiple behavioural characteristics of dissimilar user is corresponding to multiple behavioural characteristic, generalized linear model is set up according to this goal behavior characteristic set, optimization method is utilized to calculate the maximum likelihood estimation of generalized linear model, and obtain estimated parameter corresponding to this maximum likelihood estimation, the behavior similarity of user to be measured and particular type of user is calculated by the behavioural characteristic value of this estimated parameter and user to be measured, the behavioural characteristic making full use of a large number of users analyzes the similarity of different user behavior, improve the utilization factor of the behavioural characteristic to a large number of users collected.
On the basis of above-described embodiment, the described multiple behavioural characteristic value corresponding according to described first kind user and multiple behavioural characteristic values corresponding to described Equations of The Second Kind user are carried out screening to described multiple behavioural characteristic and are obtained goal behavior characteristic set, comprising:
The multiple behavioural characteristic value corresponding according to described first kind user and multiple behavioural characteristic values corresponding to described Equations of The Second Kind user calculate coverage rate, chi amount and the information entropy that in described multiple behavioural characteristic, each behavioural characteristic is corresponding respectively;
From described multiple behavioural characteristic, delete the behavioural characteristic that coverage rate is less than the behavioural characteristic of first threshold, chi amount is less than Second Threshold behavioural characteristic and information entropy be less than the 3rd threshold value obtain the first behavior characteristic set;
Delete the degree of association any one behavioural characteristic be greater than in two behavioural characteristics of the 4th threshold value in described first behavior characteristic set and obtain the second behavioural characteristic set;
The second generalized linear model is set up according to described second behavioural characteristic set, utilize optimization method to calculate the maximum likelihood estimation of described second generalized linear model, delete in described second behavioural characteristic set and do not have influential behavioural characteristic to obtain described goal behavior characteristic set to described second maximum likelihood estimation.
On the basis of above-described embodiment, 100 first kind users are to there being 100 behavioural characteristic vectors, 100 Equations of The Second Kind users, to there being 100 behavioural characteristic vectors, distinguish corresponding coverage rate, chi amount and information entropy according to these 200 each behavioural characteristics of behavioural characteristic vector calculation i.e. " whether browsing certain webpage ", " whether clicking the title in certain webpage ", " user browses the time of certain webpage ", " clicking the time of certain title ", " title content of click ", " clicking the number of times of certain title in one day ".
Foundation coverage rate order is from big to small to 6 behavioural characteristics: " whether browsing certain webpage ", " whether clicking the title in certain webpage ", " user browses the time of certain webpage ", " clicking the time of certain title ", " title content of click ", " clicking the number of times of certain title in one day " are sorted, and such as, last behavioural characteristic after sequence is " title content of click ", sort to 6 behavioural characteristics according to chi amount order from big to small, such as, last behavioural characteristic after sequence is " number of times clicking certain title in a day ", sort to 6 behavioural characteristics according to information entropy order from big to small, such as, last behavioural characteristic after sequence is " number of times clicking certain title in a day ", the behavioural characteristic that coverage rate is less than first threshold is deleted from described multiple behavioural characteristic, chi amount is less than the behavioural characteristic that the behavioural characteristic of Second Threshold and information entropy be less than the 3rd threshold value and obtains the first behavior characteristic set, last behavioural characteristic after aforementioned three sequences specifically can be deleted from multiple behavioural characteristic, namely whether " title content of click " and " clicking the number of times of certain title in one day " reservation " browses certain webpage ", " whether click the title in certain webpage ", " user browses the time of certain webpage ", " click the time of certain title " and form the first behavior characteristic set.
First behavior characteristic set comprises 4 behavioural characteristics and " whether browses certain webpage ", " whether click the title in certain webpage ", " user browses the time of certain webpage ", " click the time of certain title ", wherein, " user browses the time of certain webpage " and " clicking the time of certain title " is all higher about its degree of association of behavioural characteristic of time, then delete any one behavioural characteristic in " user browses the time of certain webpage " and " clicking the time of certain title ", such as deletion " user browses the time of certain webpage " reservation " is clicked the time of certain title " and is obtained the second behavioural characteristic set afterwards.
This the second behavioural characteristic set comprises: " whether browsing certain webpage ", " whether click the title in certain webpage ", " click the time of certain title ", according to " whether browsing certain webpage " that 100 first kind users are corresponding respectively, " whether click the title in certain webpage ", behavioural characteristic value of " clicking the time of certain title " and corresponding respectively " whether the browsing certain webpage " of 100 Equations of The Second Kind users, " whether click the title in certain webpage ", behavioural characteristic value sets up generalized linear model again " to click the time of certain title ", and again utilize optimization method to calculate the maximum likelihood estimation of this generalized linear model, record this maximum likelihood estimation, removing in the second behavioural characteristic set utilizes optimization method to calculate the maximum likelihood estimation of this generalized linear model after any one behavioural characteristic, if maximum likelihood estimation does not change, illustrate that the behavioural characteristic removed does not affect maximum likelihood estimation, if maximum likelihood estimation there occurs change, illustrate that the behavioural characteristic removed has impact to maximum likelihood estimation, retain in the second behavioural characteristic set the influential behavioural characteristic of maximum likelihood estimation, remove and influential behavioural characteristic is not had to maximum likelihood estimation, further screening behavioural characteristic.Reasonable assumption, each behavioural characteristic in the second behavioural characteristic set has impact to maximum likelihood estimation, then the second behavioural characteristic set is goal behavior characteristic set.
The embodiment of the present invention is by screening to delete the behavioural characteristic of redundancy to multiple behavioural characteristic value, the behavioural characteristic filtered out is utilized to set up generalized linear model, and utilize optimization method to calculate the maximum likelihood estimation of this generalized linear model, improve counting yield.
On the basis of above-described embodiment, describedly utilize after behavioural characteristic value that in described estimated parameter and described goal behavior characteristic set corresponding to user to be measured, each behavioural characteristic is corresponding calculates the behavior similarity of described user to be measured and described first kind user, also comprise: judge whether more described behavior similarity is greater than the 5th threshold value; If described behavior similarity is greater than described 5th threshold value, then judge that described user to be measured is similar to the behavior of described first kind user; Add up the ratio of user to be measured similar to the behavior of described first kind user in all users to be measured.
The embodiment of the present invention carries out behavioural analysis to a large amount of users to be measured, the behavioural characteristic value of each user to be measured is gathered according to the behavioural characteristic in the goal behavior characteristic set obtained in above-described embodiment, namely gather " whether browsing certain webpage " that each user to be measured is corresponding respectively, " whether click the title in certain webpage ", the behavioural characteristic value " clicking the time of certain title ", and the method utilizing above-described embodiment to calculate behavior similarity calculates the behavior similarity of each user to be measured and first kind user, judge whether behavior similarity is greater than default threshold value, if be greater than, illustrate that this user to be measured is similar to the behavior of first kind user, also can count the ratio of user to be measured similar to the behavior of first kind user in all users to be measured simultaneously.
The embodiment of the present invention, by judging that user to be measured and the behavior similarity of first kind user are greater than a certain threshold value and determine that user to be measured is similar to the behavior of first kind user, also can obtain the ratio of user to be measured similar to the behavior of first kind user in all users to be measured.
The structural drawing of the user behavior Similarity measures device that Fig. 2 provides for the embodiment of the present invention.The user behavior Similarity measures device that the embodiment of the present invention provides can perform the treatment scheme that user behavior similarity calculation method embodiment provides, as shown in Figure 2, user behavior Similarity measures device 20 comprises acquisition module 21, screening module 22, MBM 23 and computing module 24, wherein, acquisition module 21 for gather first kind user multiple behavioural characteristics in behavioural characteristic value corresponding to each behavioural characteristic, and the behavioural characteristic value that in described multiple behavioural characteristic of Equations of The Second Kind user, each behavioural characteristic is corresponding; Screening module 22 is for carrying out screening acquisition goal behavior characteristic set according to multiple behavioural characteristic value corresponding to described first kind user and multiple behavioural characteristic values corresponding to described Equations of The Second Kind user to described multiple behavioural characteristic; MBM 23 is for setting up the first generalized linear model according to described goal behavior characteristic set, utilize optimization method to calculate the first maximum likelihood estimation of described first generalized linear model, and obtain estimated parameter corresponding to described first maximum likelihood estimation; The behavior similarity of computing module 24 for utilizing behavioural characteristic value that in described estimated parameter and described goal behavior characteristic set corresponding to user to be measured, each behavioural characteristic is corresponding to calculate described user to be measured and described first kind user.
The embodiment of the present invention carries out screening acquisition goal behavior characteristic set by the behavioural characteristic value that each behavioural characteristic in the corresponding respectively multiple behavioural characteristics of dissimilar user is corresponding to multiple behavioural characteristic, generalized linear model is set up according to this goal behavior characteristic set, optimization method is utilized to calculate the maximum likelihood estimation of generalized linear model, and obtain estimated parameter corresponding to this maximum likelihood estimation, the behavior similarity of user to be measured and particular type of user is calculated by the behavioural characteristic value of this estimated parameter and user to be measured, the behavioural characteristic making full use of a large number of users analyzes the similarity of different user behavior, improve the utilization factor of the behavioural characteristic to a large number of users collected.
The structural drawing of the user behavior Similarity measures device that Fig. 3 provides for another embodiment of the present invention.On the basis of above-described embodiment, screening module 22 is specifically for calculating according to multiple behavioural characteristic value corresponding to described first kind user and multiple behavioural characteristic values corresponding to described Equations of The Second Kind user coverage rate, chi amount and the information entropy that in described multiple behavioural characteristic, each behavioural characteristic is corresponding respectively; From described multiple behavioural characteristic, delete the behavioural characteristic that coverage rate is less than the behavioural characteristic of first threshold, chi amount is less than Second Threshold behavioural characteristic and information entropy be less than the 3rd threshold value obtain the first behavior characteristic set; Delete the degree of association any one behavioural characteristic be greater than in two behavioural characteristics of the 4th threshold value in described first behavior characteristic set and obtain the second behavioural characteristic set; The second generalized linear model is set up according to described second behavioural characteristic set, utilize optimization method to calculate the maximum likelihood estimation of described second generalized linear model, delete in described second behavioural characteristic set and do not have influential behavioural characteristic to obtain described goal behavior characteristic set to described second maximum likelihood estimation.
Described first kind user is the user meeting first object behavioural characteristic, and described Equations of The Second Kind user is the user meeting the second goal behavior feature, and described first object behavioural characteristic has the identical behavioural characteristic of part with described second goal behavior feature.
Behavioural characteristic value corresponding for each behavioural characteristic in described goal behavior characteristic set corresponding for described user to be measured, specifically for described estimated parameter is formed primary vector, is formed secondary vector by computing module 24; The inner product calculating described primary vector and described secondary vector obtains described behavior similarity.
User behavior Similarity measures device 20 also comprises judge module 25 and statistical module 26, and wherein, judge module 25 is for judging whether more described behavior similarity is greater than the 5th threshold value; If described behavior similarity is greater than described 5th threshold value, then judge that described user to be measured is similar to the behavior of described first kind user; Statistical module 26 is for adding up the ratio of user to be measured similar to the behavior of described first kind user in all users to be measured.
The user behavior Similarity measures device that the embodiment of the present invention provides can specifically for performing the embodiment of the method that above-mentioned Fig. 1 provides, and concrete function repeats no more herein.
The embodiment of the present invention is by screening to delete the behavioural characteristic of redundancy to multiple behavioural characteristic value, the behavioural characteristic filtered out is utilized to set up generalized linear model, and utilize optimization method to calculate the maximum likelihood estimation of this generalized linear model, improve counting yield; By judging that user to be measured and the behavior similarity of first kind user are greater than a certain threshold value and determine that user to be measured is similar to the behavior of first kind user, the ratio of user to be measured similar to the behavior of first kind user in all users to be measured also can be obtained.
In sum, the embodiment of the present invention carries out screening acquisition goal behavior characteristic set by the behavioural characteristic value that each behavioural characteristic in the corresponding respectively multiple behavioural characteristics of dissimilar user is corresponding to multiple behavioural characteristic, generalized linear model is set up according to this goal behavior characteristic set, optimization method is utilized to calculate the maximum likelihood estimation of generalized linear model, and obtain estimated parameter corresponding to this maximum likelihood estimation, the behavior similarity of user to be measured and particular type of user is calculated by the behavioural characteristic value of this estimated parameter and user to be measured, the behavioural characteristic making full use of a large number of users analyzes the similarity of different user behavior, improve the utilization factor of the behavioural characteristic to a large number of users collected, by screening multiple behavioural characteristic value to delete the behavioural characteristic of redundancy, utilizing the behavioural characteristic filtered out to set up generalized linear model, and utilizing optimization method to calculate the maximum likelihood estimation of this generalized linear model, improve counting yield, by judging that user to be measured and the behavior similarity of first kind user are greater than a certain threshold value and determine that user to be measured is similar to the behavior of first kind user, the ratio of user to be measured similar to the behavior of first kind user in all users to be measured also can be obtained.
In several embodiment provided by the present invention, should be understood that, disclosed apparatus and method, can realize by another way.Such as, device embodiment described above is only schematic, such as, the division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of device or unit or communication connection can be electrical, machinery or other form.
The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.Above-mentioned integrated unit both can adopt the form of hardware to realize, and the form that hardware also can be adopted to add SFU software functional unit realizes.
The above-mentioned integrated unit realized with the form of SFU software functional unit, can be stored in a computer read/write memory medium.Above-mentioned SFU software functional unit is stored in a storage medium, comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) or processor (processor) perform the part steps of method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, portable hard drive, ROM (read-only memory) (Read-OnlyMemory, ROM), random access memory (RandomAccessMemory, RAM), magnetic disc or CD etc. various can be program code stored medium.
Those skilled in the art can be well understood to, for convenience and simplicity of description, only be illustrated with the division of above-mentioned each functional module, in practical application, can distribute as required and by above-mentioned functions and be completed by different functional modules, inner structure by device is divided into different functional modules, to complete all or part of function described above.The specific works process of the device of foregoing description, with reference to the corresponding process in preceding method embodiment, can not repeat them here.
Last it is noted that above each embodiment is only in order to illustrate technical scheme of the present invention, be not intended to limit; Although with reference to foregoing embodiments to invention has been detailed description, those of ordinary skill in the art is to be understood that: it still can be modified to the technical scheme described in foregoing embodiments, or carries out equivalent replacement to wherein some or all of technical characteristic; And these amendments or replacement, do not make the essence of appropriate technical solution depart from the scope of various embodiments of the present invention technical scheme.