Summary of the invention
In order to overcome the above problems at least partly, the present invention proposes method and the device of a kind of user's of estimation recessive character distribution, make in the time of estimation user's recessive character, estimation result is more accurate.
For solving the problems of the technologies described above, the method that the recessive character that the technical scheme that the present invention adopts is a kind of user of estimation distributes, comprises obtaining using the user of website and user's dominant character; Obtain the characteristic information of all populations from demographic database, wherein, described characteristic information comprises dominant character and recessive character; According to the characteristic information of described all populations, the use user of website and described user's dominant character, calculate described user's recessive character in conjunction with bayesian algorithm and distribute.
Wherein, described according to the characteristic information of described all populations, the use user of website and described user's dominant character, the step of calculating described user's recessive character distribution in conjunction with bayesian algorithm is specially: if under any user's recessive character, the Probability Independence condition that user uses website and user to have dominant character is set up, calculate described user's recessive character according to following formula
Wherein, described L is more than or equal to 1 integer, the recessive character that described x is user, and the dominant character that described t is user, described f is the user who uses described website.
Wherein, further comprise, judge under any user's recessive character, whether the Probability Independence condition that user uses website and user to have dominant character is set up, described judgement concrete steps comprise: according to the characteristic information of all populations, the use user of website and user's dominant character, calculate any user's P
1value, wherein, described P
1computing formula as follows:
P
1=P(t∩f|x
1∩....∩x
L)
According to the characteristic information of described all populations, calculate any user's P
2value, wherein, described P
2computing formula as follows:
P
2=P(t|x
1∩.....∩x
L)P(f|x
1∩.....∩x
L)
If described any user's P
1with P
2all equate, described Probability Independence condition is set up.
Wherein, described method also comprises: according to described user's dominant character and recessive character, analyze described user behavior custom.
For solving the problems of the technologies described above, another technical solution used in the present invention is: the device that provides a kind of user's of estimation recessive character to distribute, comprising: the first acquisition module, uses the user of website and user's dominant character for obtaining; The second acquisition module, for obtain the characteristic information of all populations from national demographic database, wherein, described characteristic information comprises dominant character and recessive character; Computing module, for according to the characteristic information of described all populations, the use user of website and described user's dominant character, calculates described user's recessive character in conjunction with bayesian algorithm and distributes.
Wherein, if under any user's recessive character, the Probability Independence condition that user uses website and user to have dominant character is set up, and calculates described user's recessive character according to following formula,
Wherein, described L is more than or equal to 1 integer, the recessive character that described x is user, and the dominant character that described t is user, described f is the user who uses described website.
Wherein, described device also comprises judge module; Described judge module, for according to described all users' characteristic information, the use user of website and user's dominant character, calculates any user's P
1value, wherein, described P
1computing formula as follows:
P
1=P(t∩f|x
1∩....∩x
L)
With,
According to described all users' characteristic information, the use user of website and user's dominant character, calculate any user's P
2value, wherein, described P
2computing formula as follows:
P
2=P(t|x
1∩.....∩x
L)P(f|x
1∩.....∩x
L)
And,
Judge described any user's P
1with P
2whether equate, if equate, described Probability Independence condition is set up.
Wherein, described device also comprises analysis module; Described analysis module, for according to user's dominant character and recessive character, analyzes described user behavior custom.
For solving the problems of the technologies described above, another technical scheme that the present invention adopts is: the device that provides a kind of user's of estimation recessive character to distribute, and device comprises processor; Processor is for using the user of website and user's dominant character for obtaining, with, obtain the characteristic information of all populations from demographic database, wherein, described characteristic information comprises dominant character and recessive character, and, according to the characteristic information of described all populations, the use user of website and described user's dominant character, calculate described user's recessive character in conjunction with bayesian algorithm and distribute;
Wherein, described processor is according to the characteristic information of described all populations, the use user of website and described user's dominant character, the step of calculating described user's recessive character distribution in conjunction with bayesian algorithm is specially: if described processor is under the recessive character any user, the Probability Independence condition that user uses website and user to have dominant character is set up, calculate described user's recessive character according to following formula
Wherein, described L is more than or equal to 1 integer, the recessive character that described x is user, and the dominant character that described t is user, described f is the user who uses described website.
Wherein, described processor is also for judging whether the Probability Independence condition that user uses website and user to have dominant character is set up under any user's recessive character, and described judgement concrete steps comprise:
According to the characteristic information of all populations, the use user of website and user's dominant character, calculate any user's P
1value, wherein, described P
1computing formula as follows:
P
1=P(t∩f|x
1∩....∩x
L)
According to the characteristic information of described all populations, calculate any user's P
2value, wherein, described P
2computing formula as follows:
P
2=P(t|x
1∩.....∩x
L)P(f|x
1∩.....∩x
L)
If described any user's P
1with P
2all equate, described Probability Independence condition is set up.
Wherein, described processor also, for according to described user's dominant character and recessive character, is analyzed described user behavior custom.
The invention has the beneficial effects as follows: the situation that is different from prior art, the present invention is in the time calculating user's recessive character, add the user's who uses this website data, make to have the probability having in the crowd of dominant character in the middle of recessive character in the middle of calculating the user group of website time, that user group using website is as sample space, instead of national demographic data, the difference of sample space does not just exist, thereby the error of result of calculation is not existed, corrected Calculation result, and then make result of calculation more accurate.
Embodiment
Below in conjunction with drawings and embodiments, the present invention is described in detail.
Refer to Fig. 1, method comprises:
Step S201: obtain and use the user of website and user's dominant character;
Website records user's relevant information, for example: user's log-on message, visit information of user etc., wherein, user's relevant information is generally held in the statistics on backstage, website, can whom obtain by statistics and use website, for example: statistics records Zhang San, Li Si and be registered as the user of website, can know that by statistics Zhang San and Li Si have used website, certainly, user's relevant information for example requires, for real: real name, real age etc.
User's dominant character is the feature of directly obtaining, such as: in statistics, record registered user's Real Name, the indicating characteristic that name is user.
User's recessive character is the feature that cannot directly obtain, such as: in statistics, do not record registered user's race, cannot directly obtain by statistics user's race, race is user's recessive character.
Step S202: obtain the characteristic information of all populations from demographic database, wherein, described characteristic information comprises dominant character and recessive character;
Demographic database at large records the characteristic information of all populations, for example: people's name, sex, age etc.What deserves to be explained is: the characteristic information of demographic database comprises dominant character and recessive character, wherein, the dominant character of dominant character respective user, the recessive character of recessive character respective user, for example: user's name is indicating characteristic, the name in demographic database is indicating characteristic, and user's race is recessive character, and the race in demographic database is recessive character.
In embodiment of the present invention, demographic database can, for the demographic database of being announced by national authority mechanism, can acquire from open channel.
Step S203: according to the characteristic information of all populations, the use user of website and user's dominant character, calculate the distribution of user's recessive character in conjunction with bayesian algorithm;
Wherein, before calculating the distribution of user's recessive character in conjunction with bayesian algorithm, also need checking under any user's recessive character, whether the Probability Independence condition that user uses website and user to have dominant character is set up, step S203 can be specially again: if under any user's recessive character, the Probability Independence condition that user uses website and user to have dominant character is set up, and calculates described user's recessive character according to following formula
----formula 1
Wherein, described L is more than or equal to 1 integer, the recessive character that described x is user, and the dominant character that described t is user, described f is the user who uses described website.
When following L=1, the origin of formula 1 is described.From background technology, owing to using, user's formation and the formation of national population of website are different, if apply mechanically by force Bayes's equation, can cause result of calculation to produce error.Produce error for fear of result of calculation, need to revise sample space, the user who uses website is joined to Bayes's equation, revised Bayes's equation is:
-----formula 2
Wherein, if the establishment of probability independent condition, P (t ∩ f|x
1)=P (t|x
1) P (f|x
1),
?
----formula 3
From formula 3, the probability problem of three kinds of conditions, be reduced to three kinds of conditions probability problem between any two, simplify the requirement to data.
Further, formula 3 and formula 2 are known, and the Bayes's equation being reduced to need to meet probability independent condition, and concrete reason, describes as follows for example:
As shown in Figure 2, suppose that the recessive character x of website only may present two value A and B, what on figure, show is A and two regions of B, and hypothesis a and b are respectively the areas of A and B in figure.Suppose that the dominant character t that can observe is represented by middle small rectangle, with two codomain common factor parts of recessive character be TA and TB, area is respectively ta and tb.Needing the problem solving is the Area Ratio that will obtain between TA and TB, is normalized to 1 and just can draws both likelihood ratios later.
If A and B are for covering whole demographic sample space completely, simple Bayes's equation is:
If show with graphics area ratio:
if the both members sample space of Area Ratio is consistent and is A+B, and equation must be set up.
If the sample space on both sides is inconsistent, the equation of Area Ratio existing problems, as shown in Figure 3, suppose in the middle of the crowd of B to only have some people to use website F, be labeled as B ', area is b ', and common factor between dominant character t and B ' is TB ', area is tb ', and our in fact interested numerical value has become so
Now, the sample space on the equation left side is A+B ', if we continue to apply mechanically simply Bayes's equation, equation the right continues as:
Now, sample space or the A+B on equation the right.
If with cartographic represenation of area, Bayes's equation equation left side is:
the right of equation is:
obviously,
the equation left side is not equal to the right of equation, that is to say that Bayes's equation both sides are unequal, applies mechanically simply Bayes's equation and can cause result of calculation to produce error.
Obviously, cause result of calculation to produce error former because: the sample space of equation the right and left is unlikely, therefore, need to revise sample space, makes the sample space of the right and left of equation consistent.
As shown in Figure 4, the sample space of TA forms with the sample space of A and forms phase, and the composition of sample of TB is identical with the composition of sample of B, and the people who uses website F in the middle of the crowd of B is B ' time,
Wherein, revise sample space, make the sample space formation of TA and the sample space of A form phase, when the composition of sample of TB is identical with the composition of sample of B,
Bayes's equation is:
With cartographic represenation of area be:
Revised Bayes's equation can be:
Now, can pass through demographic database, obtain population distribution data, for example: each recessive character value { x
1..., x
lunder, there are how many people also to have the dominant character value t that we observe simultaneously, and in the ratio of total population.Wherein, enough detailed database (as the Census data of the U.S.) can let us be determined everyone and they corresponding dominant character and recessive character, total total w people's data in the middle of tentation data storehouse, v people's data are (t
v, x
v), supposing that Π { } is event indicial equation. we can do following calculating to the probability in the middle of deviation Bayes update equation:
Now we also need P (f|x), in the middle of the crowd who is x at each recessive character, there are how many people to use website F (for example having how many people to use this website in the middle of the crowd in 12-19 year), under normal circumstances, in the backstage statistics of website, can record relevant user's data, can obtain the data that need by statistics.
Further, for recessive character add up to n,
Therefore,
Above-mentionedly describe as an example of single recessive character example, in like manner, extend to multiple recessive characters, revised Bayes's equation is:
Wherein, described L is more than or equal to 1 integer, the recessive character that described x is user, and the dominant character that described t is user, described f is the user who uses described website.
Be noted that: revise sample space, make the sample space formation of TA form phase with the sample space of A, when the composition of sample of TB is identical with the composition of sample of B, wherein, must meet Probability Independence condition, contrary, meeting under Probability Independence condition, the sample space of TA forms identical with the sample space formation of A, the composition of sample of TB is also identical with the formation of the sample space of B, therefore, and in the time using revised Bayes's equation, can first verify and whether meet Probability Independence condition, method also comprises:
Judge whether the Probability Independence condition that user uses website and user to have dominant character is set up under any user's recessive character, and described judgement concrete steps comprise:
According to the characteristic information of all populations, the use user of website and user's dominant character, calculate any user's P
1value, wherein, described P
1computing formula as follows:
P
1=P(t∩f|x
1∩....∩x
L)
According to the characteristic information of all populations, calculate any user's P
2value, wherein, P
2computing formula as follows:
P
2=P(t|x
1∩.....∩x
L)P(f|x
1∩.....∩x
L)
If user's P arbitrarily
1with P
2all equate, Probability Independence condition is set up.
Described L is more than or equal to 1 integer, and wherein, L is 1 o'clock, is single recessive character, the recessive character that described x is user, and the dominant character that described t is user, described f is the user who uses described website.
Further, getting after user's dominant character and recessive character, can be according to user's dominant character and recessive character analyzing web site behavioural habits, thus can formulate advertising strategy according to user's behavioural habits, or, push suitable value-added service etc. to user.Wherein, get user's dominant character and recessive character, can more accurately determine user's behavioural habits, and then making advertising strategy or the value-added service of propelling movement of formulating more reasonable, improving success ratio.
The present invention revises the sample space offset issue producing in the middle of recessiveness estimation problem, makes to estimate that operation result is more close to correct theory value, and wherein, the deviation of sample space is stronger, and the necessity that uses the present invention to revise is stronger.And have per family very strong deviation in current numerous popular use, for example foreign social online media sites Facebook, show the crowd in 18-29 year the data of 2012 in the middle of, there is 83% people to use, and over-65s crowd only has 40% people using, if we do not take to revise, with respect to the 18-29 probability in year, common most probable number method and bayesian algorithm amplify each user for the probability of over-65s is to the more than 2 times of right value by mistake, this can cause material impact to every calculating based on this and analysis backward, may bring serious deviation for net result.
In embodiment of the present invention, in the time calculating user's recessive character, add the user's who uses this website data, make to have the probability having in the crowd of dominant character in the middle of recessive character in the middle of calculating the user group of website time, be user group using website as sample space, instead of national demographic data, the difference of sample space does not just exist, thereby the error of result of calculation is not existed, corrected Calculation result.
Device the first embodiment that the present invention also provides estimation user's recessive character to distribute, as shown in Figure 5, device comprises the first acquisition module 301, the second acquisition module 302 and computing module 304.
The first acquisition module 301 obtains and uses the user of website and user's dominant character.The second acquisition module 302 obtains the characteristic information of all populations from national demographic database, wherein, characteristic information comprises dominant character and recessive character.
Computing module 304, according to the characteristic information of all populations, the use user of website and user's dominant character, calculates user's recessive character in conjunction with bayesian algorithm and distributes.Concrete, if computing module 304 can be under the recessive character any user, the Probability Independence condition that user uses website and user to have dominant character is set up, adopting bayesian algorithm to calculate user's recessive character distributes, if computing module 304 again can be specifically under the recessive character any user, the Probability Independence condition that user uses website and user to have dominant character is set up, and calculates described user's recessive character according to following formula
Wherein, described L is more than or equal to 1 integer, the recessive character that described x is user, described t is user's dominant character, described f is the user who uses described website, and can consult estimation user's recessive character Distributed Implementation mode for the origin of above-mentioned computing formula, now repeats no longer one by one.
Device also can comprise judge module 303 and analysis module 305.Judge module 303, for according to all users' characteristic information, the use user of website and user's dominant character, calculates any user's P
1value, wherein, described P
1computing formula as follows:
P
1=P(t∩f|x
1∩....∩x
L)
With,
According to described all users' characteristic information, the use user of website and user's dominant character, calculate any user's P
2value, wherein, described P
2computing formula as follows:
P
2=P(t|x
1∩.....∩x
L)P(f|x
1∩.....∩x
L)
And,
Judge described any user's P
1with P
2whether equate, if equate, described Probability Independence condition is set up.
Analysis module 305 is according to user's dominant character and recessive character, analysis user behavioural habits, thus can formulate advertising strategy according to user's behavioural habits, or, push suitable value-added service etc. to user.Wherein, get user's dominant character and recessive character, can more accurately determine user's behavioural habits, and then making advertising strategy or the value-added service of propelling movement of formulating more reasonable, improving success ratio.
In embodiment of the present invention, computing module 304 is in the time calculating user's recessive character, add the user's who uses this website data, make to have the probability having in the crowd of dominant character in the middle of recessive character in the middle of calculating the user group of website time, be user group using website as sample space, instead of national demographic data, the difference of sample space does not just exist, thereby the error of result of calculation is not existed, corrected Calculation result.
Device the second embodiment that the present invention also provides estimation user's recessive character to distribute, as shown in Figure 6, device comprises processor 401, storer 402 and bus 403.Processor 401 is all connected with bus 403 with storer 402.
Processor 401 uses the user of website and user's dominant character for obtaining, obtain the characteristic information of all populations from demographic database, wherein, described characteristic information comprises dominant character and recessive character, according to the characteristic information of described all populations, the use user of website and described user's dominant character, calculate described user's recessive character in conjunction with bayesian algorithm and distribute.
Further, processor 401 is according to the characteristic information of described all populations, the use user of website and described user's dominant character, the step of calculating described user's recessive character distribution in conjunction with bayesian algorithm is specially: if under any user's recessive character, the Probability Independence condition that user uses website and user to have dominant character is set up, calculate described user's recessive character according to following formula
Wherein, described L is more than or equal to 1 integer, the recessive character that described x is user, and the dominant character that described t is user, described f is the user who uses described website.And judge, under any user's recessive character, whether the Probability Independence condition that user uses website and user to have dominant character is set up, described judgement concrete steps comprise:
According to the characteristic information of all populations, the use user of website and user's dominant character, calculate any user's P
1value, wherein, described P
1computing formula as follows:
P
1=P(t∩f|x
1∩....∩x
L)
According to the characteristic information of described all populations, calculate any user's P
2value, wherein, described P
2computing formula as follows:
P
2=P(t|x
1∩.....∩x
L)P(f|x
1∩.....∩x
L)
If described any user's P
1with P
2all equate, described Probability Independence condition is set up.
Processor 401 also, for according to described user's dominant character and recessive character, is analyzed described user behavior custom.
It should be noted that: use the user of website and user's dominant character to be obtained by backstage, website statistics, and be stored in storer 402, processor 401 extracts and uses the user of website and user's dominant character from storer 402.And the content of demographic database also can be stored in storer 402 on backstage, website in advance from open channel gets, while needing population in use database, from storer 402, extract, also can obtain from open channel more when needed.
In embodiment of the present invention, processor 401 is in the time calculating user's recessive character, add the user's who uses this website data, make to have the probability having in the crowd of dominant character in the middle of recessive character in the middle of calculating the user group of website time, be user group using website as sample space, instead of national demographic data, the difference of sample space does not just exist, thereby the error of result of calculation is not existed, corrected Calculation result.
The foregoing is only embodiments of the present invention; not thereby limit the scope of the claims of the present invention; every equivalent structure or conversion of equivalent flow process that utilizes instructions of the present invention and accompanying drawing content to do; or be directly or indirectly used in other relevant technical fields, be all in like manner included in scope of patent protection of the present invention.