CN101477552A - Website user rank division method - Google Patents

Website user rank division method Download PDF

Info

Publication number
CN101477552A
CN101477552A CNA2009100102926A CN200910010292A CN101477552A CN 101477552 A CN101477552 A CN 101477552A CN A2009100102926 A CNA2009100102926 A CN A2009100102926A CN 200910010292 A CN200910010292 A CN 200910010292A CN 101477552 A CN101477552 A CN 101477552A
Authority
CN
China
Prior art keywords
user
access
overbar
classification
website
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2009100102926A
Other languages
Chinese (zh)
Inventor
刘峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BANRUO NETWORK SCIENCE & TECHNOLOGY Co Ltd LIAONING
Original Assignee
BANRUO NETWORK SCIENCE & TECHNOLOGY Co Ltd LIAONING
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BANRUO NETWORK SCIENCE & TECHNOLOGY Co Ltd LIAONING filed Critical BANRUO NETWORK SCIENCE & TECHNOLOGY Co Ltd LIAONING
Priority to CNA2009100102926A priority Critical patent/CN101477552A/en
Publication of CN101477552A publication Critical patent/CN101477552A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention relates to a method of classifying and dividing users used in an Internet website, which comprises the following steps: (1) access records are filtered and the influence of spider access and artificial attack access is eliminated at first; (2) access browsing times in a certain time P, access times S and total length of access duration time T for each user are calculated, so as to form a three-dimensional clustering space P, S and T; and (3) classification grades of users are confirmed through the clustering calculation for users. The invention breaks through a classical cluster algorithm algorithmically, does not need class-center conversion, and achieves the clustering at a time; in addition, the division method is ingenious and succinct, the computation amount is greatly reduced, the calculation speed is enhanced, and the interpretability and the practicability are very strong.

Description

Website user rank division method
Technical field:
The present invention relates to the classifying and dividing users method of a kind of Internet of being used for website.
Background technology:
The website can only be counted simple indicator such as PV (Page View), the unique IP number of visit and observe for the understanding of user capture situation with simple web page browsing now, for website user's classification, also lacks the way of science.
For large-scale website, a large number of users access websites there is every day.Does how the user for access websites carry out grade classification? do you can science accurately distinguish the user with which index, what division methods? how to find different user group's feature and the border between the customer group? on this basis, carry out effective service and management for different users targetedly, be the important component part of portal management work, its core is user's Classification Management.
Summary of the invention:
In order to solve the problem of above-mentioned existence, the invention provides a kind of scientific and effective Website user rank division method, use this method can find fast the different user group " border between " center " and the customer group is come user gradation is divided with this.
The objective of the invention is to be achieved through the following technical solutions:
Website user rank division method, it is characterized in that: at first Visitor Logs is filtered, adding up each user visits within a certain period of time and browses several P, access times S and access duration time total length T, form P, S, T three dimensions, the user is carried out cluster, thus user gradation is divided.
Described " Visitor Logs filtration " comprising:
(1), browser filters:
Judge by browser in the Visitor Logs,, have the Visitor Logs of special search engine reptile mark, do not add up if not conventional browsers such as IE, Firefox;
(2), access duration time and visit filter density:
Access duration time is the duration of an access websites of user;
Visit density is browsing page quantity in the unit interval;
For access duration time and visit density setting " threshold values ", surpass the Visitor Logs of setting " threshold values ", do not participate in statistics.
Described " cluster " calculating process is as follows:
(1), three-dimensional ordering
Respectively to all users' numerical value U at three direction P, S, T i(p i, s i, t i) sort from big to small;
(2), ask each user's of three dimensions inequality
For the sequence of ordering back gained, ask the mean difference between the user of P, S, T three-dimensional respectively;
If: p is the mean difference of PV;
S is the mean difference of log-on count;
T is the mean difference that the website stops total duration;
p ‾ = Σ i = 1 n ( p i + 1 - p i ) / ( n - 1 )
s ‾ = Σ i = 1 n ( s i + 1 - s i ) / ( n - 1 )
t ‾ = Σ i = 1 n ( t i + 1 - t i ) / ( n - 1 )
(3), determine that each user's classification grade is at P, T, the three-dimensional center of S
R needs classified number for being provided with;
Which classification designator j is;
If:
Figure A200910010292D00064
Be j central value that is sorted in the P direction;
Be j central value that is sorted in the S direction;
Figure A200910010292D00066
Be j central value that is sorted in the T direction;
J is a classification number, is integer;
When calculating the 1st classification center, promptly during j=1:
R j p = ( n / ( 2 × R ) ) × p ‾
R j s = ( n / ( 2 × R ) ) × s ‾
R j t = ( n / ( 2 × R ) ) × t ‾
During the center of other classes, be calculated as:
When 1<j<R:
R j p = R 1 p + ( j - 1 ) × ( n / R ) × p ‾
R j s = R 1 s + ( j - 1 ) × ( n / R ) × s ‾
R j t = R 1 t + ( j - 1 ) × ( n / R ) × t ‾
Obtain each and be sorted in P, S, three directions of T " " center ",
Figure A200910010292D00077
(4), determine the division of user's classification grade
With each user's three dimensional space coordinate U i(p i, s i, t i) respectively with each class centre coordinate
Figure A200910010292D00078
The difference absolute value and compare, j from 1 to R, both: 1≤j≤R.
| U i - R j | = | p i - r p j | + | s i - r s j | + | t i - r t j |
Get | U i-R j| minimum value, just judge from which class center nearest, min{|U i-R j|,
With user U iBe divided into R jIn the customer group, user gradation is divided and is finished.
Beneficial effect of the present invention:
One, objectivity
The present invention adopts website visiting user network page browsing PV, access times Session and total duration three dimensions of the residence time that calling party is classified, and can describe website visiting user's grade objectively.
Two, reliability
In the website visiting record, have a large amount of search engine reptile Visitor Logs and some abnormal access records.The present invention disturbs by pretreated denoising, has filtered a large amount of junk information on the one hand, makes data cleaner, and result of calculation is reliable; On the other hand, significantly reduced meaningless calculated amount.
Three, efficientibility
Classical clustering algorithm need constantly carry out class " " center " conversion; just stop to calculate until not changing; each conversion all needs the distance from each cluster member of new calculating to the class center, the conversion of each center all need be recomputated (the MxN number of members multiply by the class number), and calculated amount is huge.The present invention adopts a kind of high speed clustering method, and " " center ", " the " center " conversion disposablely directly can be divided classification, has significantly reduced calculated amount, has increased substantially computing velocity no longer to need to carry out class by sorting to determine final class.
Description of drawings:
The process flow diagram of Fig. 1 website visiting user rank division method of the present invention.
Embodiment:
The present invention for convenience of explanation, in the following description:
P iRepresent each user capture PV quantity;
S iRepresent Session number of each user capture session;
T iRepresent each user capture to stop total duration;
N represents the number of users in the statistical computation;
R represents the number of categories that will divide user gradation;
I represents i user, 1≤i≤n;
J represents j classification; 1≤j≤R;
U i{ p i, s i, t iRepresent i user at P, S, T three dimensional space coordinate;
Figure A200910010292D00081
Represent j centre coordinate that is sorted on P, S, the T three dimensions.
Website user rank division method at first will be determined, which technical indicator to weigh the like degree of user to the website with.Selected index wants directly to reflect the relation of user and website, succinctly.Too many index can not increase clustering precision, can increase meaningless calculated amount on the contrary, and has overlapping relation between the index, influences last cluster result.
The present invention adds up the Visitor Logs of user in a period of time, adopts 3 indexs:
Index 1: PV (Page View) is counted in the web page browsing of calculating each user;
Index 2: access times Session;
Index 3: access session stops total duration Time;
Three statistical indicators form P, S, T three dimensions, carry out cluster.
Determine that dividing the user gradation number is R, be divided into the user several classes that is:.
Calculate beginning, at first, needing the user to determine to be divided into several classes.For example, through statistics, there are 100,000 users the website, it is divided into: good, good, general, rudimentary 4 classification grades.
Here: number of users n=10 ten thousand; Number of categories R=4.
Website user rank division method of the present invention, step is as follows: as shown in Figure 1,
(1) comprise in parameter setting 101 steps:
The user is divided into several classes, determines the numerical value of R;
Filtered access session Session duration threshold values, for example 2 hours;
Filtered access density, in the unit interval, a user capture PV number, for example 10/minute;
(2) filtered access record, filtering module 102 read access database of records 103 filter it, need to reject improper Visitor Logs.
Improper Visitor Logs comprises:
Reptile disturbs in the A search
By to browser mark in the Visitor Logs, judge whether it is the reptile of search engine.
As: the browser of BAIDU has baiduspoder, and the browser of the MSN of Microsoft has msnbot, and the browser of Google has googlebot, etc.
Each search engine all has the browser mark of oneself in the visit day entry.Search engine has a plurality of browser marks, comes the different search of mark purpose reptile, as: picture, text, music reptile etc.
The Visitor Logs of search engine can not be participated in cluster, otherwise can produce interference to the result.When preparing cluster data, filter out the Visitor Logs of search engine reptile.
The B artificial interference
Normal person's access websites all is to visit by browsers such as IE, Firefox, operation be one progressively, mild process; If virus or hacker's visit are all finished by program rather than browser, be continuous, quick, long process.Its Visitor Logs can be discerned by technological means.
By filtering module, filter an access duration time greater than the 2 hours stipulated times Visitor Logs that is provided with in the parameter setting step 101; Filter and once visit PV, greater than the record of 10/minute of regulation density in the parameter setting step 101;
Behind the data filter, be stored in and filter in the back Visitor Logs database 104;
(3) statistics order module 105 reads the record that filters in the back Visitor Logs database 104, adds up each user's p i, s i, t iNumerical value
A calculates p iValue
Statistics user U iTotal at the appointed time web page browsing PV value, just the numerical value p on the P direction i
B calculates s iValue
Statistics user U iAccess times Session at the appointed time, just the numerical value s on the S direction i
C calculates t iValue.
Statistics user U iTotal at the appointed time visit duration Time, just the numerical value t on the T direction i
The D ordering
To the p that obtains above i, s i, t iSequence sorts from small to large, and the ordering back generates U iTable 106;
E forms the spatial point coordinate
Obtain all users at P, S, the coordinate U of T three dimensions vector point i(p i, s i, t i);
(4) calculate component inequality 107 and read U iP after table 106 ordering i, s i, t iValue,
If: p is the mean difference of PV;
S is the mean difference of log-on count;
T is the mean difference that the website stops total duration;
p ‾ = Σ i = 1 n ( p i + 1 - p i ) / ( n - 1 )
Top formula implication is that in the PV that the adds up ordering ordered series of numbers, a back user browses several p I+1Subtract the several p of browsing of previous user i, being subtracted 1 by number of users and remove, PV inequality p is counted in the web page browsing that obtains between the user.
s ‾ = Σ i = 1 n ( s i + 1 - s i ) / ( n - 1 )
Top formula implication is, in the access times that the add up ordering ordered series of numbers, and a back user network access times S I+1Subtract previous user's S i, subtracted 1 by number of users and remove, obtain visiting between the user inferior
t ‾ = Σ i = 1 n ( t i + 1 - t i ) / ( n - 1 )
The inequality s of number.
Top formula implication is, in the residence time ordering ordered series of numbers that adds up, and a back user residence time t I+1Subtract previous user's t i, subtracted 1 by number of users and remove, obtain the inequality t of the residence time between the user.
(5) step 108 determines that each user's classification grade is at P, T, the three-dimensional center R of S j
If:
Figure A200910010292D00104
Be j central value that is sorted in the P direction;
Figure A200910010292D00111
Be j central value that is sorted in the S direction;
Figure A200910010292D00112
Be j central value that is sorted in the T direction;
J is a classification number, is integer;
When j=1:
R j p = ( n / ( 2 × R ) ) × p ‾
R j s = ( n / ( 2 × R ) ) × s ‾
R j t = ( n / ( 2 × R ) ) × t ‾
When 1<j<R:
R j p = R 1 p + ( j - 1 ) × ( n / R ) × p ‾
R j s = R 1 s + ( j - 1 ) × ( n / R ) × s ‾
R j t = R 1 t + ( j - 1 ) × ( n / R ) × t ‾
Obtain each and be sorted in P, S, three directions of T " " center ",
Figure A200910010292D00119
(6) determine the division of user's classification grade:
Grade separation module 109 reads U iData in the table 106 are with each user's three dimensional space coordinate U i(p i, s i, t i) respectively with each class centre coordinate
Figure A200910010292D001110
The difference absolute value and compare,
| U i - R j | = | p i - r p j | + | s i - r s j | + | t i - r t j |
Get | U i-R j| minimum value, just judge from which class center nearest, min{|U i-R j|,
With user U iBe divided into R jIn the customer group, user gradation is divided and is finished, and classification results is stored in the taxonomy database 110.

Claims (3)

1, Website user rank division method, it is characterized in that: at first Visitor Logs is filtered, adding up each user visits within a certain period of time and browses several P, access times S and access duration time total length T, form P, S, T three dimensions, the user is carried out cluster, thus user gradation is divided.
2, the Website user rank division method described in the claim 1 is characterized in that: described " Visitor Logs filtration " comprises
(1), browser filters:
Judge by browser in the Visitor Logs,, have the Visitor Logs of special search engine reptile mark, do not add up if not conventional browsers such as IE, Firefox;
(2), access duration time and visit filter density:
Access duration time is the duration of an access websites of user;
Visit density is browsing page quantity in the unit interval;
For access duration time and visit density setting " threshold values ", surpass the Visitor Logs of setting " threshold values ", do not participate in statistics.
3, the Website user rank division method described in the claim 1 is characterized in that: described " cluster " calculating process is as follows:
(1), three-dimensional ordering
Respectively to all users' numerical value U at three direction P, S, T i(p i, s i, t i) sort from big to small;
(2), ask each user's of three dimensions inequality
For the sequence of ordering back gained, ask the mean difference between the user of P, S, T three-dimensional respectively;
If: p is the mean difference of web page browsing PV;
S is the mean difference of log-on count;
T is the mean difference that the website stops total duration;
p ‾ = Σ i = 1 n ( p i + 1 - p i ) / ( n - 1 )
s ‾ = Σ i = 1 n ( s i + 1 - s i ) / ( n - 1 )
t ‾ = Σ i = 1 n ( t i + 1 - t i ) / ( n - 1 )
(3), determine that each user's classification grade needs classified number at P, T, the three-dimensional center R of S for being provided with;
Which classification designator j is;
If:
Figure A200910010292C00032
Be j central value that is sorted in web page browsing P direction;
Figure A200910010292C00033
Be j central value that is sorted in access times S direction;
Figure A200910010292C0003162037QIETU
Be j central value that is sorted in visit residence time T direction;
J is a classification number, is integer;
When calculating the 1st classification center, promptly during j=1:
R j p = ( n / ( 2 × R ) ) × p ‾
R j s = ( n / ( 2 × R ) ) × s ‾
R j t = ( n / ( 2 × R ) ) × t ‾
During the center of other classes, be calculated as:
When 1<j<R:
R j p = R 1 p + ( j - 1 ) × ( n / R ) × p ‾
R j s = R 1 s + ( j - 1 ) × ( n / R ) × s ‾
R j t = R 1 t + ( j - 1 ) × ( n / R ) × t ‾
Obtain each and be sorted in P, S, three directions of T " " center ",
Figure A200910010292C000310
(4), determine the division of user's classification grade
With each user's three dimensional space coordinate U i(p i, s i, t i) respectively with each class centre coordinate The difference absolute value and compare,
| U i - R j | = | p i - r p j | + | s i - r s j | + | t i - r t j |
Get | U i-R j| minimum value, just judge from which class center nearest, min{|U i-R j|,
With user U iBe divided into R jIn the customer group, user gradation is divided and is finished.
CNA2009100102926A 2009-02-03 2009-02-03 Website user rank division method Pending CN101477552A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2009100102926A CN101477552A (en) 2009-02-03 2009-02-03 Website user rank division method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2009100102926A CN101477552A (en) 2009-02-03 2009-02-03 Website user rank division method

Publications (1)

Publication Number Publication Date
CN101477552A true CN101477552A (en) 2009-07-08

Family

ID=40838268

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2009100102926A Pending CN101477552A (en) 2009-02-03 2009-02-03 Website user rank division method

Country Status (1)

Country Link
CN (1) CN101477552A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101887390A (en) * 2010-06-23 2010-11-17 宇龙计算机通信科技(深圳)有限公司 Method and device for evaluating rating of application software
CN102929938A (en) * 2012-09-28 2013-02-13 北京奇艺世纪科技有限公司 Playable network resource ordering method and device
CN103577535A (en) * 2013-09-02 2014-02-12 西安交通大学 Method for objective evaluation of e-learning user experience quality
CN103605714A (en) * 2013-11-14 2014-02-26 北京国双科技有限公司 Method and device for identifying abnormal data of websites
CN104156466A (en) * 2014-08-22 2014-11-19 北京京东尚科信息技术有限公司 Grade-based method and device for allocating resources
CN104765776A (en) * 2015-03-18 2015-07-08 华为技术有限公司 Data sample clustering method and device
CN104992182A (en) * 2015-06-29 2015-10-21 北京京东尚科信息技术有限公司 Method and device for determining user level
CN106210044A (en) * 2016-07-11 2016-12-07 焦点科技股份有限公司 A kind of any active ues recognition methods based on the behavior of access
CN107306252A (en) * 2016-04-21 2017-10-31 中国移动通信集团河北有限公司 A kind of data analysing method and system
WO2018006631A1 (en) * 2016-07-08 2018-01-11 武汉斗鱼网络科技有限公司 User level automatic segmentation method and system
CN109145934A (en) * 2017-12-22 2019-01-04 北京数安鑫云信息技术有限公司 User behavior data processing method, medium, equipment and device based on log
CN111966951A (en) * 2020-07-06 2020-11-20 东南数字经济发展研究院 User group hierarchy dividing method based on social e-commerce transaction data

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101887390A (en) * 2010-06-23 2010-11-17 宇龙计算机通信科技(深圳)有限公司 Method and device for evaluating rating of application software
CN102929938A (en) * 2012-09-28 2013-02-13 北京奇艺世纪科技有限公司 Playable network resource ordering method and device
CN102929938B (en) * 2012-09-28 2015-09-30 北京奇艺世纪科技有限公司 A kind of sort method and device playing type Internet resources
CN103577535A (en) * 2013-09-02 2014-02-12 西安交通大学 Method for objective evaluation of e-learning user experience quality
CN103605714A (en) * 2013-11-14 2014-02-26 北京国双科技有限公司 Method and device for identifying abnormal data of websites
CN104156466B (en) * 2014-08-22 2017-12-12 北京京东尚科信息技术有限公司 A kind of resource allocation methods and device based on grade
CN104156466A (en) * 2014-08-22 2014-11-19 北京京东尚科信息技术有限公司 Grade-based method and device for allocating resources
CN104765776B (en) * 2015-03-18 2018-06-05 华为技术有限公司 The clustering method and device of a kind of data sample
CN104765776A (en) * 2015-03-18 2015-07-08 华为技术有限公司 Data sample clustering method and device
CN104992182A (en) * 2015-06-29 2015-10-21 北京京东尚科信息技术有限公司 Method and device for determining user level
CN107306252A (en) * 2016-04-21 2017-10-31 中国移动通信集团河北有限公司 A kind of data analysing method and system
WO2018006631A1 (en) * 2016-07-08 2018-01-11 武汉斗鱼网络科技有限公司 User level automatic segmentation method and system
CN106210044A (en) * 2016-07-11 2016-12-07 焦点科技股份有限公司 A kind of any active ues recognition methods based on the behavior of access
CN106210044B (en) * 2016-07-11 2019-06-11 焦点科技股份有限公司 A kind of any active ues recognition methods based on access behavior
CN109145934A (en) * 2017-12-22 2019-01-04 北京数安鑫云信息技术有限公司 User behavior data processing method, medium, equipment and device based on log
CN111966951A (en) * 2020-07-06 2020-11-20 东南数字经济发展研究院 User group hierarchy dividing method based on social e-commerce transaction data

Similar Documents

Publication Publication Date Title
CN101477552A (en) Website user rank division method
CN107526807B (en) Information recommendation method and device
CN101408883B (en) Method for collecting network public feelings viewpoint
CN103051637A (en) User identification method and device
CN103164427B (en) News Aggreagation method and device
CN103617169B (en) A kind of hot microblog topic extracting method based on Hadoop
CN102073684B (en) Method and device for excavating search log and page search method and device
CN101819573A (en) Self-adaptive network public opinion identification method
CN102542474A (en) Method for sorting inquiry results and device
CN102567494B (en) Website classification method and device
CN103577413A (en) Search result ordering method and system and search result ordering optimization method and system
CN102982157A (en) Device and method used for mining microblog hot topics
CN107943905A (en) A kind of much-talked-about topic analysis method and system
CN103885993A (en) Public opinion monitoring method and device for microblog
CN109992569A (en) Cluster log feature extracting method, device and storage medium
CN111612230A (en) Client appeal trend early warning analysis method
CN106294333A (en) A kind of microblogging burst topic detection method and device
CN106682206A (en) Method and system for big data processing
CN107612925A (en) A kind of WebShell method for digging based on access behavioural characteristic
CN109819128A (en) A kind of quality detecting method and device of telephonograph
CN104462554A (en) Method and device for recommending question and answer page related questions
CN102156746A (en) Method for evaluating performance of search engine
CN103823847A (en) Keyword extension method and device
Plantié et al. From photo networks to social networks, creation and use of a social network derived with photos
CN108874974A (en) Parallelization Topic Tracking method based on frequent term set

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20090708