CN101477552A - Website user rank division method - Google Patents
Website user rank division method Download PDFInfo
- Publication number
- CN101477552A CN101477552A CNA2009100102926A CN200910010292A CN101477552A CN 101477552 A CN101477552 A CN 101477552A CN A2009100102926 A CNA2009100102926 A CN A2009100102926A CN 200910010292 A CN200910010292 A CN 200910010292A CN 101477552 A CN101477552 A CN 101477552A
- Authority
- CN
- China
- Prior art keywords
- user
- access
- overbar
- classification
- website
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention relates to a method of classifying and dividing users used in an Internet website, which comprises the following steps: (1) access records are filtered and the influence of spider access and artificial attack access is eliminated at first; (2) access browsing times in a certain time P, access times S and total length of access duration time T for each user are calculated, so as to form a three-dimensional clustering space P, S and T; and (3) classification grades of users are confirmed through the clustering calculation for users. The invention breaks through a classical cluster algorithm algorithmically, does not need class-center conversion, and achieves the clustering at a time; in addition, the division method is ingenious and succinct, the computation amount is greatly reduced, the calculation speed is enhanced, and the interpretability and the practicability are very strong.
Description
Technical field:
The present invention relates to the classifying and dividing users method of a kind of Internet of being used for website.
Background technology:
The website can only be counted simple indicator such as PV (Page View), the unique IP number of visit and observe for the understanding of user capture situation with simple web page browsing now, for website user's classification, also lacks the way of science.
For large-scale website, a large number of users access websites there is every day.Does how the user for access websites carry out grade classification? do you can science accurately distinguish the user with which index, what division methods? how to find different user group's feature and the border between the customer group? on this basis, carry out effective service and management for different users targetedly, be the important component part of portal management work, its core is user's Classification Management.
Summary of the invention:
In order to solve the problem of above-mentioned existence, the invention provides a kind of scientific and effective Website user rank division method, use this method can find fast the different user group " border between " center " and the customer group is come user gradation is divided with this.
The objective of the invention is to be achieved through the following technical solutions:
Website user rank division method, it is characterized in that: at first Visitor Logs is filtered, adding up each user visits within a certain period of time and browses several P, access times S and access duration time total length T, form P, S, T three dimensions, the user is carried out cluster, thus user gradation is divided.
Described " Visitor Logs filtration " comprising:
(1), browser filters:
Judge by browser in the Visitor Logs,, have the Visitor Logs of special search engine reptile mark, do not add up if not conventional browsers such as IE, Firefox;
(2), access duration time and visit filter density:
Access duration time is the duration of an access websites of user;
Visit density is browsing page quantity in the unit interval;
For access duration time and visit density setting " threshold values ", surpass the Visitor Logs of setting " threshold values ", do not participate in statistics.
Described " cluster " calculating process is as follows:
(1), three-dimensional ordering
Respectively to all users' numerical value U at three direction P, S, T
i(p
i, s
i, t
i) sort from big to small;
(2), ask each user's of three dimensions inequality
For the sequence of ordering back gained, ask the mean difference between the user of P, S, T three-dimensional respectively;
If: p is the mean difference of PV;
S is the mean difference of log-on count;
T is the mean difference that the website stops total duration;
(3), determine that each user's classification grade is at P, T, the three-dimensional center of S
R needs classified number for being provided with;
Which classification designator j is;
Be j central value that is sorted in the S direction;
J is a classification number, is integer;
When calculating the 1st classification center, promptly during j=1:
During the center of other classes, be calculated as:
When 1<j<R:
(4), determine the division of user's classification grade
With each user's three dimensional space coordinate U
i(p
i, s
i, t
i) respectively with each class centre coordinate
The difference absolute value and compare, j from 1 to R, both: 1≤j≤R.
Get | U
i-R
j| minimum value, just judge from which class center nearest, min{|U
i-R
j|,
With user U
iBe divided into R
jIn the customer group, user gradation is divided and is finished.
Beneficial effect of the present invention:
One, objectivity
The present invention adopts website visiting user network page browsing PV, access times Session and total duration three dimensions of the residence time that calling party is classified, and can describe website visiting user's grade objectively.
Two, reliability
In the website visiting record, have a large amount of search engine reptile Visitor Logs and some abnormal access records.The present invention disturbs by pretreated denoising, has filtered a large amount of junk information on the one hand, makes data cleaner, and result of calculation is reliable; On the other hand, significantly reduced meaningless calculated amount.
Three, efficientibility
Classical clustering algorithm need constantly carry out class " " center " conversion; just stop to calculate until not changing; each conversion all needs the distance from each cluster member of new calculating to the class center, the conversion of each center all need be recomputated (the MxN number of members multiply by the class number), and calculated amount is huge.The present invention adopts a kind of high speed clustering method, and " " center ", " the " center " conversion disposablely directly can be divided classification, has significantly reduced calculated amount, has increased substantially computing velocity no longer to need to carry out class by sorting to determine final class.
Description of drawings:
The process flow diagram of Fig. 1 website visiting user rank division method of the present invention.
Embodiment:
The present invention for convenience of explanation, in the following description:
P
iRepresent each user capture PV quantity;
S
iRepresent Session number of each user capture session;
T
iRepresent each user capture to stop total duration;
N represents the number of users in the statistical computation;
R represents the number of categories that will divide user gradation;
I represents i user, 1≤i≤n;
J represents j classification; 1≤j≤R;
U
i{ p
i, s
i, t
iRepresent i user at P, S, T three dimensional space coordinate;
Website user rank division method at first will be determined, which technical indicator to weigh the like degree of user to the website with.Selected index wants directly to reflect the relation of user and website, succinctly.Too many index can not increase clustering precision, can increase meaningless calculated amount on the contrary, and has overlapping relation between the index, influences last cluster result.
The present invention adds up the Visitor Logs of user in a period of time, adopts 3 indexs:
Index 1: PV (Page View) is counted in the web page browsing of calculating each user;
Index 2: access times Session;
Index 3: access session stops total duration Time;
Three statistical indicators form P, S, T three dimensions, carry out cluster.
Determine that dividing the user gradation number is R, be divided into the user several classes that is:.
Calculate beginning, at first, needing the user to determine to be divided into several classes.For example, through statistics, there are 100,000 users the website, it is divided into: good, good, general, rudimentary 4 classification grades.
Here: number of users n=10 ten thousand; Number of categories R=4.
Website user rank division method of the present invention, step is as follows: as shown in Figure 1,
(1) comprise in parameter setting 101 steps:
The user is divided into several classes, determines the numerical value of R;
Filtered access session Session duration threshold values, for example 2 hours;
Filtered access density, in the unit interval, a user capture PV number, for example 10/minute;
(2) filtered access record, filtering module 102 read access database of records 103 filter it, need to reject improper Visitor Logs.
Improper Visitor Logs comprises:
Reptile disturbs in the A search
By to browser mark in the Visitor Logs, judge whether it is the reptile of search engine.
As: the browser of BAIDU has baiduspoder, and the browser of the MSN of Microsoft has msnbot, and the browser of Google has googlebot, etc.
Each search engine all has the browser mark of oneself in the visit day entry.Search engine has a plurality of browser marks, comes the different search of mark purpose reptile, as: picture, text, music reptile etc.
The Visitor Logs of search engine can not be participated in cluster, otherwise can produce interference to the result.When preparing cluster data, filter out the Visitor Logs of search engine reptile.
The B artificial interference
Normal person's access websites all is to visit by browsers such as IE, Firefox, operation be one progressively, mild process; If virus or hacker's visit are all finished by program rather than browser, be continuous, quick, long process.Its Visitor Logs can be discerned by technological means.
By filtering module, filter an access duration time greater than the 2 hours stipulated times Visitor Logs that is provided with in the parameter setting step 101; Filter and once visit PV, greater than the record of 10/minute of regulation density in the parameter setting step 101;
Behind the data filter, be stored in and filter in the back Visitor Logs database 104;
(3) statistics order module 105 reads the record that filters in the back Visitor Logs database 104, adds up each user's p
i, s
i, t
iNumerical value
A calculates p
iValue
Statistics user U
iTotal at the appointed time web page browsing PV value, just the numerical value p on the P direction
i
B calculates s
iValue
Statistics user U
iAccess times Session at the appointed time, just the numerical value s on the S direction
i
C calculates t
iValue.
Statistics user U
iTotal at the appointed time visit duration Time, just the numerical value t on the T direction
i
The D ordering
To the p that obtains above
i, s
i, t
iSequence sorts from small to large, and the ordering back generates U
iTable 106;
E forms the spatial point coordinate
Obtain all users at P, S, the coordinate U of T three dimensions vector point
i(p
i, s
i, t
i);
(4) calculate component inequality 107 and read U
iP after table 106 ordering
i, s
i, t
iValue,
If: p is the mean difference of PV;
S is the mean difference of log-on count;
T is the mean difference that the website stops total duration;
Top formula implication is that in the PV that the adds up ordering ordered series of numbers, a back user browses several p
I+1Subtract the several p of browsing of previous user
i, being subtracted 1 by number of users and remove, PV inequality p is counted in the web page browsing that obtains between the user.
Top formula implication is, in the access times that the add up ordering ordered series of numbers, and a back user network access times S
I+1Subtract previous user's S
i, subtracted 1 by number of users and remove, obtain visiting between the user inferior
The inequality s of number.
Top formula implication is, in the residence time ordering ordered series of numbers that adds up, and a back user residence time t
I+1Subtract previous user's t
i, subtracted 1 by number of users and remove, obtain the inequality t of the residence time between the user.
(5) step 108 determines that each user's classification grade is at P, T, the three-dimensional center R of S
j
J is a classification number, is integer;
When j=1:
When 1<j<R:
(6) determine the division of user's classification grade:
Get | U
i-R
j| minimum value, just judge from which class center nearest, min{|U
i-R
j|,
With user U
iBe divided into R
jIn the customer group, user gradation is divided and is finished, and classification results is stored in the taxonomy database 110.
Claims (3)
1, Website user rank division method, it is characterized in that: at first Visitor Logs is filtered, adding up each user visits within a certain period of time and browses several P, access times S and access duration time total length T, form P, S, T three dimensions, the user is carried out cluster, thus user gradation is divided.
2, the Website user rank division method described in the claim 1 is characterized in that: described " Visitor Logs filtration " comprises
(1), browser filters:
Judge by browser in the Visitor Logs,, have the Visitor Logs of special search engine reptile mark, do not add up if not conventional browsers such as IE, Firefox;
(2), access duration time and visit filter density:
Access duration time is the duration of an access websites of user;
Visit density is browsing page quantity in the unit interval;
For access duration time and visit density setting " threshold values ", surpass the Visitor Logs of setting " threshold values ", do not participate in statistics.
3, the Website user rank division method described in the claim 1 is characterized in that: described " cluster " calculating process is as follows:
(1), three-dimensional ordering
Respectively to all users' numerical value U at three direction P, S, T
i(p
i, s
i, t
i) sort from big to small;
(2), ask each user's of three dimensions inequality
For the sequence of ordering back gained, ask the mean difference between the user of P, S, T three-dimensional respectively;
If: p is the mean difference of web page browsing PV;
S is the mean difference of log-on count;
T is the mean difference that the website stops total duration;
(3), determine that each user's classification grade needs classified number at P, T, the three-dimensional center R of S for being provided with;
Which classification designator j is;
J is a classification number, is integer;
When calculating the 1st classification center, promptly during j=1:
During the center of other classes, be calculated as:
When 1<j<R:
(4), determine the division of user's classification grade
With each user's three dimensional space coordinate U
i(p
i, s
i, t
i) respectively with each class centre coordinate
The difference absolute value and compare,
Get | U
i-R
j| minimum value, just judge from which class center nearest, min{|U
i-R
j|,
With user U
iBe divided into R
jIn the customer group, user gradation is divided and is finished.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2009100102926A CN101477552A (en) | 2009-02-03 | 2009-02-03 | Website user rank division method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2009100102926A CN101477552A (en) | 2009-02-03 | 2009-02-03 | Website user rank division method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101477552A true CN101477552A (en) | 2009-07-08 |
Family
ID=40838268
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA2009100102926A Pending CN101477552A (en) | 2009-02-03 | 2009-02-03 | Website user rank division method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101477552A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101887390A (en) * | 2010-06-23 | 2010-11-17 | 宇龙计算机通信科技(深圳)有限公司 | Method and device for evaluating rating of application software |
CN102929938A (en) * | 2012-09-28 | 2013-02-13 | 北京奇艺世纪科技有限公司 | Playable network resource ordering method and device |
CN103577535A (en) * | 2013-09-02 | 2014-02-12 | 西安交通大学 | Method for objective evaluation of e-learning user experience quality |
CN103605714A (en) * | 2013-11-14 | 2014-02-26 | 北京国双科技有限公司 | Method and device for identifying abnormal data of websites |
CN104156466A (en) * | 2014-08-22 | 2014-11-19 | 北京京东尚科信息技术有限公司 | Grade-based method and device for allocating resources |
CN104765776A (en) * | 2015-03-18 | 2015-07-08 | 华为技术有限公司 | Data sample clustering method and device |
CN104992182A (en) * | 2015-06-29 | 2015-10-21 | 北京京东尚科信息技术有限公司 | Method and device for determining user level |
CN106210044A (en) * | 2016-07-11 | 2016-12-07 | 焦点科技股份有限公司 | A kind of any active ues recognition methods based on the behavior of access |
CN107306252A (en) * | 2016-04-21 | 2017-10-31 | 中国移动通信集团河北有限公司 | A kind of data analysing method and system |
WO2018006631A1 (en) * | 2016-07-08 | 2018-01-11 | 武汉斗鱼网络科技有限公司 | User level automatic segmentation method and system |
CN109145934A (en) * | 2017-12-22 | 2019-01-04 | 北京数安鑫云信息技术有限公司 | User behavior data processing method, medium, equipment and device based on log |
CN111966951A (en) * | 2020-07-06 | 2020-11-20 | 东南数字经济发展研究院 | User group hierarchy dividing method based on social e-commerce transaction data |
-
2009
- 2009-02-03 CN CNA2009100102926A patent/CN101477552A/en active Pending
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101887390A (en) * | 2010-06-23 | 2010-11-17 | 宇龙计算机通信科技(深圳)有限公司 | Method and device for evaluating rating of application software |
CN102929938A (en) * | 2012-09-28 | 2013-02-13 | 北京奇艺世纪科技有限公司 | Playable network resource ordering method and device |
CN102929938B (en) * | 2012-09-28 | 2015-09-30 | 北京奇艺世纪科技有限公司 | A kind of sort method and device playing type Internet resources |
CN103577535A (en) * | 2013-09-02 | 2014-02-12 | 西安交通大学 | Method for objective evaluation of e-learning user experience quality |
CN103605714A (en) * | 2013-11-14 | 2014-02-26 | 北京国双科技有限公司 | Method and device for identifying abnormal data of websites |
CN104156466B (en) * | 2014-08-22 | 2017-12-12 | 北京京东尚科信息技术有限公司 | A kind of resource allocation methods and device based on grade |
CN104156466A (en) * | 2014-08-22 | 2014-11-19 | 北京京东尚科信息技术有限公司 | Grade-based method and device for allocating resources |
CN104765776B (en) * | 2015-03-18 | 2018-06-05 | 华为技术有限公司 | The clustering method and device of a kind of data sample |
CN104765776A (en) * | 2015-03-18 | 2015-07-08 | 华为技术有限公司 | Data sample clustering method and device |
CN104992182A (en) * | 2015-06-29 | 2015-10-21 | 北京京东尚科信息技术有限公司 | Method and device for determining user level |
CN107306252A (en) * | 2016-04-21 | 2017-10-31 | 中国移动通信集团河北有限公司 | A kind of data analysing method and system |
WO2018006631A1 (en) * | 2016-07-08 | 2018-01-11 | 武汉斗鱼网络科技有限公司 | User level automatic segmentation method and system |
CN106210044A (en) * | 2016-07-11 | 2016-12-07 | 焦点科技股份有限公司 | A kind of any active ues recognition methods based on the behavior of access |
CN106210044B (en) * | 2016-07-11 | 2019-06-11 | 焦点科技股份有限公司 | A kind of any active ues recognition methods based on access behavior |
CN109145934A (en) * | 2017-12-22 | 2019-01-04 | 北京数安鑫云信息技术有限公司 | User behavior data processing method, medium, equipment and device based on log |
CN111966951A (en) * | 2020-07-06 | 2020-11-20 | 东南数字经济发展研究院 | User group hierarchy dividing method based on social e-commerce transaction data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101477552A (en) | Website user rank division method | |
CN107526807B (en) | Information recommendation method and device | |
CN101408883B (en) | Method for collecting network public feelings viewpoint | |
CN103051637A (en) | User identification method and device | |
CN103164427B (en) | News Aggreagation method and device | |
CN103617169B (en) | A kind of hot microblog topic extracting method based on Hadoop | |
CN102073684B (en) | Method and device for excavating search log and page search method and device | |
CN101819573A (en) | Self-adaptive network public opinion identification method | |
CN102542474A (en) | Method for sorting inquiry results and device | |
CN102567494B (en) | Website classification method and device | |
CN103577413A (en) | Search result ordering method and system and search result ordering optimization method and system | |
CN102982157A (en) | Device and method used for mining microblog hot topics | |
CN107943905A (en) | A kind of much-talked-about topic analysis method and system | |
CN103885993A (en) | Public opinion monitoring method and device for microblog | |
CN109992569A (en) | Cluster log feature extracting method, device and storage medium | |
CN111612230A (en) | Client appeal trend early warning analysis method | |
CN106294333A (en) | A kind of microblogging burst topic detection method and device | |
CN106682206A (en) | Method and system for big data processing | |
CN107612925A (en) | A kind of WebShell method for digging based on access behavioural characteristic | |
CN109819128A (en) | A kind of quality detecting method and device of telephonograph | |
CN104462554A (en) | Method and device for recommending question and answer page related questions | |
CN102156746A (en) | Method for evaluating performance of search engine | |
CN103823847A (en) | Keyword extension method and device | |
Plantié et al. | From photo networks to social networks, creation and use of a social network derived with photos | |
CN108874974A (en) | Parallelization Topic Tracking method based on frequent term set |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Open date: 20090708 |