CN102122291A

CN102122291A - Blog friend recommendation method based on tree log pattern analysis

Info

Publication number: CN102122291A
Application number: CN2011100204787A
Authority: CN
Inventors: 陈刚; 胡天磊; 寿黎但; 陈珂; 周健; 贝毅君
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2011-01-18
Filing date: 2011-01-18
Publication date: 2011-07-13

Abstract

The invention discloses a blog friend recommendation method based on tree-shaped log pattern analysis. Using the offline mining method, through the analysis of the server log, the visitor's access record to the blog page is extracted, and the access log tree rooted at the blog to be recommended is further constructed through grouping, sorting, loopback and other technologies. The access log tree is frequently mined to find the frequent subtrees that meet the preset requirements, and the nodes in the frequent subtrees are used as candidate blog friends, and the recommendation degree is calculated according to the set formula, and the ones with the highest scores are recommended . The algorithm is different from the traditional algorithm based on frequent item mining or frequent sequence mining. In view of the unique parallel link relationship and indirect access characteristics of the blog circle, the frequent tree structure mining method is used to fully explore and extract potential access links between blogs. , and recommended to visiting users, which improves the user experience, and is an efficient and practical blog recommendation method.

Description

A blog friend recommendation method based on tree log pattern analysis

技术领域 technical field

本发明涉及对博客服务器日志的数据分析技术和频繁访问模式的挖掘技术，特别是涉及一种基于树形日志模式分析的博客好友推荐方法。 The invention relates to data analysis technology of blog server logs and mining technology of frequent access patterns, in particular to a blog friend recommendation method based on tree log pattern analysis.

背景技术 Background technique

随着互联网技术的不断发展，博客已经不仅仅是一个单纯的发布个人文章、信息的平台，在增加了各种类如留言、关注、好友等互动功能后，用户之间会逐渐形成一个博客圈。博客圈中包含好友、潜在好友（尚未加入好友名单的博客或者是好友的好友）和志趣相投的其他博客等等。在博客这样的典型web2.0应用中，建立志趣相投的用户社会关系是决定系统成败的关键，因此面向博客的好友推荐已经成为博客系统的主体功能。博客好友推荐应用通过用户对博客的访问行为，发现博客用户间潜在的关联性，并试着建议博客根据关联性将与有可能其具有共同兴趣的人群转化为好友关系。 With the continuous development of Internet technology, blogs are no longer simply a platform for publishing personal articles and information. After adding various interactive functions such as messages, attention, friends, etc., users will gradually form a blog circle. . The blogosphere includes friends, potential friends (bloggers who have not yet joined the friend list or friends of friends), other bloggers with similar interests, and so on. In a typical web2.0 application like a blog, the key to the success of the system is to establish a social relationship with like-minded users. Therefore, friend recommendation for blogs has become the main function of the blog system. The blog friend recommendation application discovers the potential correlation between blog users through the user's visit behavior to the blog, and tries to suggest that the blog converts people who may have common interests into friend relationships according to the correlation.

博客圈是一种复杂的树形或者图形结构，目前已经存在一些面向博客的好友推荐系统。他们一般基于博客间已经建立的好友关系和服务器记录的访问量来做推荐，这些推荐方法基于频繁项挖掘或是频繁序列挖掘，存在以下不足和缺点：1）没有考虑博客间特有的平行链接关系和间接访问特性；2）没有考虑用户访问页面的先后顺序所隐藏的博客页面间的逻辑关系；3）没有充分考虑网站组织架构的层次关系和深度关系。 The blog circle is a complex tree or graph structure, and there are already some blog-oriented friend recommendation systems. They generally make recommendations based on the established friendship between blogs and the number of visits recorded by the server. These recommendation methods are based on frequent item mining or frequent sequence mining, and have the following deficiencies and shortcomings: 1) They do not consider the unique parallel link relationship between blogs and indirect access features; 2) did not consider the logical relationship between blog pages hidden in the order of users' access to pages; 3) did not fully consider the hierarchical relationship and depth relationship of the website organizational structure.

发明内容 Contents of the invention

针对博客服务器日志所隐含的丰富的用户行为信息和页面组织信息，本发明的目的在于提供一种基于树形日志模式分析的博客好友推荐方法，是针对博客日志的，基于树形结构挖掘的博客推荐方法。 Aiming at the rich user behavior information and page organization information implied in blog server logs, the purpose of the present invention is to provide a blog friend recommendation method based on tree log pattern analysis, which is aimed at blog logs and based on tree structure mining Blog recommendation method.

本发明解决其技术问题采用的技术方案是： The technical scheme that the present invention solves its technical problem adopts is:

该方法采用的步骤如下： The steps taken by this method are as follows:

1) 解析原始日志，提取有效信息，在数据库中创建会话表，用来记录用户的访问路径； 1) Analyze the original log, extract valid information, and create a session table in the database to record the user's access path;

2) 针对待推荐的博客，在数据库中找出访问过待推荐的博客的用户，根据用户的访问日志，去回环，构建以待推荐的博客为根的访问日志树； 2) For the blog to be recommended, find the users who have visited the blog to be recommended in the database, according to the user's access log, go back and forth, and build an access log tree rooted at the blog to be recommended;

3) 对构造出的访问日志树做频繁递归无序树挖掘，找出符合预设要求的频繁子树； 3) Do frequent recursive unordered tree mining on the constructed access log tree to find frequent subtrees that meet the preset requirements;

4) 把频繁子树中的节点作为候选博客好友，按设定的公式进行推荐度计算，取分值最高的若干个进行推荐。 4) Take the nodes in the frequent subtree as candidate blog friends, calculate the recommendation degree according to the set formula, and recommend the ones with the highest scores.

2、根据权利要求1所述的一种基于树形日志模式分析的博客好友推荐方法，其特征在于：所述步骤1)中解析原始日志，提取有效信息，就是用日志解析器提取服务器中的日志，得到一个时间片内的访问记录，去掉用户请求中的冗余信息，转化成访问三元组<访问者，访问时间，访问博客>存入会话表中，时间片大小的选择依据博客访问量和运行挖掘算法的计算机的性能，访问者为注册用户的，以用户名为“访问者”的标识，访问者为匿名用户的，以用户IP为“访问者”的标识。 2, a kind of blog friend recommending method based on the analysis of tree-shaped log pattern according to claim 1, it is characterized in that: in the described step 1), analyze original log, extract effective information, extract in the server exactly with log parser Log, get the access record within a time slice, remove the redundant information in the user request, convert it into an access triplet <visitor, access time, access blog> and save it in the session table, the selection of the time slice size is based on blog access The amount and performance of the computer running the mining algorithm. If the visitor is a registered user, the user name is identified as "visitor", and if the visitor is an anonymous user, the user IP is identified as "visitor".

3、根据权利要求1所述的一种基于树形日志模式分析的博客好友推荐方法，其特征在于：所述步骤2)中针对待推荐的博客，在数据库中找出访问过待推荐的博客的用户，根据用户的访问日志，去回环，构建以待推荐的博客为根的访问日志树，就是根据网站的组织结构信息，针对待推荐的博客，在会话表中查找出访问过该博客的用户和用户第一次访问该博客的时间，针对每个查找得到的访问者，提取出查找得到的访问者在访问待推荐的博客后访问的其它博客的记录；树形结构生成器以每个访问者为单位构造访问日志树，访问者访问的每个博客对应一个节点，每个节点包含访问三元组信息，父子节点关系的形成依据连续访问请求的时间上的先后顺序；对于产生的回环，删除访问时间上最迟的边，产生的访问日志树具有三个特点：第一，访问日志树具有相同的根节点，即为待推荐的博客；第二，所有的访问日志树不存在标签相同的兄弟节点；第三，访问日志树是无序的，即每个节点的子节点是无序的。 3. A blog friend recommendation method based on tree-shaped log pattern analysis according to claim 1, characterized in that: in said step 2), for the blog to be recommended, find out in the database the blog that has been visited and to be recommended According to the user's access log, go back and forth, and build an access log tree rooted at the blog to be recommended, that is, according to the organizational structure information of the website, for the blog to be recommended, find out the users who have visited the blog in the session table The user and the user's first visit to the blog, for each searched visitor, extract the records of other blogs visited by the searched visitor after visiting the blog to be recommended; the tree structure generator uses each The visitor constructs the access log tree as a unit, and each blog visited by the visitor corresponds to a node, and each node contains access triplet information, and the parent-child node relationship is formed according to the time sequence of continuous access requests; for the generated loopback , delete the latest edge in terms of access time, and the resulting access log tree has three characteristics: first, the access log tree has the same root node, which is the blog to be recommended; second, all access log trees do not have labels The same sibling nodes; Third, the access log tree is unordered, that is, the child nodes of each node are unordered.

4、根据权利要求1所述的一种基于树形日志模式分析的博客好友推荐方法，其特征在于：所述步骤3)中对构造出的访问日志树做频繁递归无序树挖掘，找出符合预设要求的频繁子树，就是把所有的访问日志树分别记为t1,t2…tn,选择合适的最小支持度minsupÎ(0,1),用频繁子树挖掘器进行挖掘，具体步骤如下： 4. A method for recommending blog friends based on tree-shaped log pattern analysis according to claim 1, characterized in that: in the step 3), frequent recursive unordered tree mining is performed on the constructed access log tree to find out The frequent subtree that meets the preset requirements is to record all the access log trees as t1, t2...tn, select the appropriate minimum support degree minsupÎ(0,1), and use the frequent subtree miner to mine. The specific steps are as follows :

第一步、遍历t1,t2…tn，把“访问三元组”中“访问博客”相同的节点归为相同节点，统计每种节点在访问日志树中出现的次数fre1,对于fre1>minsup*n的节点，记为频繁子树EQ1； The first step is to traverse t1, t2...tn, and classify the nodes with the same "visit blog" in the "visit triplet" as the same node, and count the number of times fre1 each node appears in the access log tree, for fre1>minsup* The node of n is recorded as the frequent subtree EQ1;

第二步、对EQ1做扩展，把两个EQ1中的节点做连接操作，构成父子关系，形成包含2个节点的树，作为候选子树，统计出候选子树在所有访问日志树中的出现次数fre2,对于fre2>minsup*n的候选子树，记为频繁子树EQ2； The second step is to expand EQ1, connect the two nodes in EQ1 to form a parent-child relationship, form a tree containing 2 nodes, and use them as candidate subtrees to count the occurrence of candidate subtrees in all access log trees The number of times fre2, for the candidate subtree of fre2>minsup*n, is recorded as the frequent subtree EQ2;

第三步、从EQ2开始，对于每棵树的最右路径，做枚举扩展，每次扩展一个节点，找出所有可能的候选子树，统计出出现次数frei>minsup*n的树，记为新的频繁子树EQi，做类似的递归操作，不断增加挖掘的频繁子树的节点数目，直到没有符合的候选子树为止。 The third step, starting from EQ2, for the rightmost path of each tree, do enumeration expansion, expand one node at a time, find all possible candidate subtrees, and count the trees with occurrence times frei>minsup*n, record For the new frequent subtree EQi, do a similar recursive operation, and continuously increase the number of frequent subtree nodes that are mined until there is no matching candidate subtree.

5、根据权利要求1所述的一种基于树形日志模式分析的博客好友推荐方法，其特征在于：所述步骤4)中把挖掘得到的频繁子树中的节点作为候选博客好友，按设定的公式进行推荐度计算，取分值最高的若干个进行推荐，就是对节点数大于3的频繁子树，按照出现频度fre从大到小排序，依次拿出每棵频繁子树，做如下操作：根据宽度优先遍历，从树的第2层开始，计算每个节点的推荐度R，公式如下： 5. A kind of blog friend recommendation method based on tree-shaped log pattern analysis according to claim 1, characterized in that: in the step 4), the node in the frequent subtree obtained by digging is used as a candidate blog friend, according to the setting The recommended formula is used to calculate the recommendation degree, and the ones with the highest scores are selected for recommendation, that is, for the frequent subtrees with more than 3 nodes, they are sorted according to the frequency of occurrence fre from large to small, and each frequent subtree is taken out in turn, and done The operation is as follows: According to breadth-first traversal, starting from the second layer of the tree, calculate the recommendation R of each node, the formula is as follows:

Figure 2011100204787100002DEST_PATH_IMAGE001

参数说明：fre为频繁子树的频度；T表示是否存在直接的页面链接，存在，则T为1，不存在,则T为0；d是该节点的深度，根节点深度记为0； W_k是每层的权重参数，默认为1；B_k为每层的分支数目，即同一父节点下兄弟节点数目;计算出所有候选节点的推荐度后，根据需要，选出分值最高的若干个节点,取节点对应的博客作为博客好友进行推荐。 Parameter description: fre is the frequency of frequent subtrees; T indicates whether there is a direct page link, if it exists, then T is 1, if it does not exist, then T is 0; d is the depth of the node, and the depth of the root node is recorded as 0; W _k is the weight parameter of each layer, and the default is 1; B _k is the number of branches of each layer, that is, the number of sibling nodes under the same parent node; after calculating the recommendation degree of all candidate nodes, select the one with the highest score as needed There are several nodes, and the blogs corresponding to the nodes are recommended as blog friends.

本发明具有的有益效果是： The beneficial effects that the present invention has are:

根据访问者对博客的访问行为和博客网站的结构特点，结合现有的数据挖掘技术，针对服务器的访问日志，挖掘出树形结构的频繁访问模式。博客的服务提供商根据挖掘出的频繁访问模式研究分析用户的访问行为，为用户推荐博客好友，改善用户体验；同时也可协助网站架构师更好地组织网站架构，提高用户对博客的访问率。 According to the visitor's visit behavior to the blog and the structural characteristics of the blog site, combined with the existing data mining technology, the frequent visit pattern of the tree structure is excavated for the server's visit log. The service provider of the blog researches and analyzes the user's access behavior based on the frequent access patterns excavated, recommends blog friends to the user, and improves the user experience; at the same time, it can also assist the website architect to better organize the website structure and increase the user's access rate to the blog .

附图说明 Description of drawings

图1是基于树形日志模式分析的博客好友推荐方法的总体结构图。 Figure 1 is an overall structure diagram of a blog friend recommendation method based on tree log pattern analysis.

图2是访问会话及其索引。 Figure 2 is the access session and its index.

图3是根据图2中visitor1的会话构造出的访问日志树。 Fig. 3 is an access log tree constructed according to the session of visitor1 in Fig. 2 .

图4是频繁子树挖掘方法的示意图。 Fig. 4 is a schematic diagram of a frequent subtree mining method.

图5是推荐度计算方法的示意图。 FIG. 5 is a schematic diagram of a method for calculating a recommendation degree.

具体实施方式 Detailed ways

以下结合具体实例和附图对本发明作进一步的描述。 The present invention will be further described below in conjunction with specific examples and accompanying drawings.

通过本发明所提供的博客日志分析方法，可以快速，有效地提取出频繁访问模式，通过智能化的筛选过程把潜在的博客好友推荐给访问用户，总体结构图如图1所示，具体的实施步骤如下： Through the blog log analysis method provided by the present invention, frequent access patterns can be extracted quickly and effectively, and potential blog friends are recommended to visiting users through an intelligent screening process. The overall structure diagram is shown in Figure 1, and the specific implementation Proceed as follows:

1)图1中的日志解析器对一个时间段内的服务器日志进行解析，删除冗余信息，构建访问三元组<访问者，访问时间，访问博客>（triple<visitor, access_time, blog_url>），以Apache服务器日志为例，具体过程如下： 1) The log parser in Figure 1 parses the server logs within a period of time, deletes redundant information, and constructs a visit triple <visitor, visit time, visit blog> (triple<visitor, access_time, blog_url>), taking the Apache server log as an example, the specific process is as follows:

记录在Apache服务器中的日志可以表示成下面的形式： The logs recorded in the Apache server can be expressed in the following form:

117.24.255.86 - - [01/Jul/2010:18:01:25 +0800] "GET 117.24.255.86 - - [01/Jul/2010:18:01:25 +0800] "GET

http://B.blog.163.com HTTP/1.0" 200 1231 http://B.blog.163.com HTTP/1.0" 200 1231

"117.24.255.230.1277794615926482" 46807 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)" "117.24.255.230.1277794615926482" 46807 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)"

这条日志记录给出了IP为117.24.255.86的匿名用户在01/Jul/2010:18:01:25 +0800时间访问了页面http://A.blog.163.com。 This log record shows that the anonymous user with IP 117.24.255.86 visited the page http://A.blog.163.com at time 01/Jul/2010:18:01:25 +0800.

所以，可以依次构建访问三元组<117.24.255.86, 2010-7-1 18:01:25, blogA>，对于访问者为注册用户的，以注册ID作为visitor的标识；为匿名用户的，用IP作为区分，创建一个临时ID；对于访问的页面,为了便于下一步的处理，可以结合网站的组织结构信息，将页面的url地址进行简化，如这里将http://A.blog.163.com简化为blogA,两者必须一一对应。 Therefore, the access triplet <117.24.255.86, 2010-7-1 18:01:25, blogA>, for a visitor who is a registered user, use the registration ID as the identifier of the visitor; for an anonymous user, use the IP as a distinction to create a temporary ID; for the visited page, In order to facilitate the next step of processing, the url address of the page can be simplified in combination with the organizational structure information of the website. For example, http://A.blog.163.com is simplified as blogA here, and the two must correspond one-to-one.

2) 针对待推荐的博客，图1中的树形结构生成器在数据库中找出访问过待推荐的博客的用户，根据用户的访问日志，去回环，构建以待推荐的博客为根的访问日志树，具体步骤如下： 2) For the blog to be recommended, the tree structure generator in Figure 1 finds out the users who have visited the blog to be recommended in the database, according to the user's access log, removes the loop, and constructs the visit rooted at the blog to be recommended log tree, the specific steps are as follows:

第一步，根据网站的组织结构信息，针对某个待推荐的博客blogA，树形结构生成器的分组排序模块在会话表中查找出所有访问过blogA的用户和该用户第一次访问blogA的时间（SQL查询：select visitor ，distinct access_time from triple where access_url = blogA）。假设用户visitor0第一次访问blogA的时间是access_time0,查找用户visitor0的所有的访问时间在access_time0之后的页面，对每个查询得到的用户都做相同操作（SQL查询：select visitor ,access_time，access_url from triple where visitor = visitor0 and access_time > access_time0，经过以上操作，可以得到如图2所示的用户会话信息 In the first step, according to the organizational structure information of the website, for a blogA to be recommended, the grouping and sorting module of the tree structure generator finds out all users who have visited blogA and the user who visited blogA for the first time in the session table. Time (SQL query: select visitor, distinct access_time from triple where access_url = blogA). Assume that the time when user visitor0 visits blogA for the first time is access_time0, find all the pages of user visitor0 whose access time is after access_time0, and do the same operation for each user obtained from the query (SQL query: select visitor , access_time, access_url from triple where visitor = visitor0 and access_time > access_time0, after the above operations, you can get the user session information as shown in Figure 2

第二步，对查询得到的记录，以访问者为单位构造访问日志树，父子节点关系的形成依据连续访问请求的时间上的先后顺序；对于产生的回环，树形结构生成器中的去回环模块通过删除回环上时间最迟的那条边来消除回环，根据图2中visitor1的会话产生的访问日志树如图3所示。产生的访问日志树有三个特点：第一，所有的访问日志树具有相同的根节点，即为待推荐的博客；第二，所有的访问日志树中不存在标签相同的节点；第三，访问日志树是无序的，即不考虑兄弟节点间的先后顺序。 The second step is to construct the access log tree with the visitor as the unit for the records obtained by the query, and the formation of the parent-child node relationship is based on the time sequence of continuous access requests; for the generated loop, the loopback in the tree structure generator The module eliminates the loopback by deleting the edge with the latest time on the loopback. The access log tree generated according to the session of visitor1 in Figure 2 is shown in Figure 3. The generated access log tree has three characteristics: first, all access log trees have the same root node, which is the blog to be recommended; second, there are no nodes with the same label in all access log trees; third, the access The log tree is unordered, that is, the order of sibling nodes is not considered.

3) 图1所示的频繁子树挖掘器对上一步构造出的访问日志树做频繁递归无序树挖掘，找出符合预设要求的频繁子树，具体步骤如下：树形结构生成器把上一步得到的访问日志树编号，分别为t1,t2,…tn。 3) The frequent subtree excavator shown in Figure 1 performs frequent recursive unordered tree mining on the access log tree constructed in the previous step, and finds frequent subtrees that meet the preset requirements. The specific steps are as follows: the tree structure generator takes The access log tree numbers obtained in the previous step are respectively t1, t2,...tn.

第一步：树形结构生成器中的候选子树生成模块遍历所有的访问日志树，把“访问三元组”中“访问博客”相同的节点归为相同节点，子树频度统计模块统计每种节点在访问日志树中出现的位置及含有该种节点的树的总数fre1（频度），对于fre1>minsup*n的节点，记为频繁子树 EQ1； The first step: the candidate subtree generation module in the tree structure generator traverses all the access log trees, and classifies the same nodes as the "visit blog" in the "visit triplet" as the same node, and the subtree frequency statistics module counts The position where each type of node appears in the access log tree and the total number fre1 (frequency) of the tree containing this type of node, for nodes with fre1>minsup*n, it is recorded as frequent subtree EQ1;

第二步：对EQ1中的节点两两做连接操作，构成父子关系，作为候选的频繁子树，统计出候选的频繁子树在所有日志中出现的次数fre2，具体步骤如图4所示，节点A和节点B都属于EQ1，对A，B做连接操作，A为B的父节点，同时记录最后新添加的节点在原树中的位置（图4中为B节点），对于fre2>minsup*n的候选子树，记为频繁子树 EQ2。 Step 2: Connect two nodes in EQ1 to form a parent-child relationship. As a candidate frequent subtree, count the number of times fre2 that the candidate frequent subtree appears in all logs. The specific steps are shown in Figure 4. Node A and node B both belong to EQ1, connect A and B, A is the parent node of B, and record the position of the last newly added node in the original tree (node B in Figure 4), for fre2>minsup* The candidate subtree of n is denoted as frequent subtree EQ2.

第三步：从EQ2开始，对于每棵树的最右路径，做枚举扩展，每次扩展一个节点，找出所有可能的候选子树，统计出出现次数frei>minsup*n的树，记为新的频繁子树 EQi。如图4所示，首先对于节点A做了最右路径的扩展，扩展出了一个新的节点B，也可以对原来的B节点做扩展，但一次只能扩展一个节点。如此做类似的递归操作，不断增加挖掘的频繁子树的节点数目，直到没有符合的候选频繁子树为止。挖掘树的过程中，为了便于树的记录，采用了对树的字符串编码，例如图4中树t1编码为ABC-1BD-1E-1-1B，t2的字符编码为ABC-1DE-1-1-1B，编码根据深度优先遍历顺序，每次往回走时插入一个-1，根据这种方法，树和字符串编码是一一对应的。 Step 3: Starting from EQ2, for the rightmost path of each tree, do enumeration expansion, expand one node at a time, find all possible candidate subtrees, and count the trees with occurrence times frei>minsup*n, record is the new frequent subtree EQi. As shown in Figure 4, firstly, the rightmost path is extended for node A, and a new node B is expanded, and the original node B can also be expanded, but only one node can be expanded at a time. Do similar recursive operations in this way, and continuously increase the number of nodes in the frequent subtrees that are mined until there is no matching candidate frequent subtree. In the process of mining the tree, in order to facilitate the recording of the tree, the string encoding of the tree is adopted. For example, in Figure 4, the encoding of tree t1 is ABC-1BD-1E-1-1B, and the character encoding of t2 is ABC-1DE-1- 1-1B, the encoding is based on the depth-first traversal order, and a -1 is inserted each time it goes back. According to this method, the tree and the string encoding are in one-to-one correspondence.

4)挖掘出所有的频繁子树后，图1所示的候选节点推荐器按照频繁子树的出现频度fre从大到小排序，依次拿出每棵频繁子树做如下操作：根据宽度优先遍历顺序，从树的第2层开始，节点推荐度计算模块计算每个节点的推荐度R，公式如下： 4) After digging out all the frequent subtrees, the candidate node recommender shown in Figure 1 sorts the frequent subtrees from large to small according to the occurrence frequency fre of the frequent subtrees, and takes out each frequent subtree in turn to do the following operations: according to the width first In order of traversal, starting from the second layer of the tree, the node recommendation calculation module calculates the recommendation R of each node, the formula is as follows:

参数说明：fre为频繁子树的频度；T表示是否存在直接的页面链接，存在，则T为1，不存在，则T为0；d是该节点的深度，根节点深度记为0；是每层的权重参数，默认为1；

为每层的分支数目，即同一父节点下兄弟节点数目。 Parameter description: fre is the frequency of frequent subtrees; T indicates whether there is a direct page link, if it exists, then T is 1, if it does not exist, then T is 0; d is the depth of the node, and the depth of the root node is recorded as 0; is the weight parameter of each layer, the default is 1;

is the number of branches of each layer, that is, the number of sibling nodes under the same parent node.

如图5所示，挖掘出了频繁子树ABC-1D-1-1B（字符串编码）,这棵树在t1,t2中都出现，所以频度fre是100%，计算该候选子树的推荐度R，步骤如下：对于点A，在树的第1层，所以略过，对于第二层的节点B，若网站结构中不存在A到B的直接链接，则T为0，所以R_B=0；若网站结构中存在A到B的直接链接，T=1,则R_B=1*1*1/2=0.5。对于节点C，若不存在节点B到节点C的直接链接，则T=0,从而R_C=0；若存在，则T=1，则R_C=1*1*（1/2）（1/3）=0.167。节点D的情况与C相同。 As shown in Figure 5, the frequent subtree ABC-1D-1-1B (string code) is excavated. This tree appears in both t1 and t2, so the frequency fre is 100%. Calculate the candidate subtree The recommendation degree R, the steps are as follows: For point A, it is in the first layer of the tree, so it is skipped. For node B in the second layer, if there is no direct link from A to B in the website structure, then T is 0, so R _B =0; if there is a direct link from A to B in the website structure, T=1, then R _B =1*1*1/2=0.5. For node C, if there is no direct link from node B to node C, then T=0, so R _C =0; if there is, then T=1, then R _C =1*1*(1/2) (1 /3) = 0.167. Node D is the same as C.

计算出所有候选节点的推荐度后，根据需要，选出分值最高的若干个节点对应的博客作为博客好友进行推荐。按图5计算的节点，假使都存在直接的链接，根据计算，B节点，E节点的推荐度都为0.5，所以这两个节点对应的博客作为博客好友首先被推荐，节点C和节点D的推荐度为0.167，如果需要，它们对应的博客作为博客好友被进一步被推荐。 After calculating the recommendation degrees of all candidate nodes, select the blogs corresponding to several nodes with the highest scores as required as blog friends for recommendation. The nodes calculated according to Figure 5, assuming that there are direct links, according to the calculation, the recommendation degrees of node B and node E are both 0.5, so the blogs corresponding to these two nodes are recommended as blog friends first, and node C and node D The recommendation degree is 0.167, and their corresponding blogs are further recommended as blog friends if necessary.

Claims

1. blog intimate recommend method of analyzing based on tree-like logging mode is characterized in that the step that this method adopts is as follows:

1) resolves original log, extract effective information, in database, create conversational list, be used for the access path of recording user;

2) at blog to be recommended, in database, find out the user of visit blog to be recommended excessively, the access log according to the user removes winding, and making up with blog to be recommended is the access log tree of root;

3) the access log tree that constructs is cooked frequent recurrence unordered tree and excavate, find out the frequent subtree that meets preset requirement;

4) the node in the frequent subtree as candidate's blog intimate, calculate by the formula degree of setting of recommendation, get the highest several of score value and recommend.

2. a kind of blog intimate recommend method of analyzing based on tree-like logging mode according to claim 1, it is characterized in that: resolve original log in the described step 1), extract effective information, be exactly with the daily record in the daily record resolver extraction server, obtain the Visitor Logs in the timeslice, remove the redundant information in user's request, change into visit tlv triple＜visitor, access time, accesses blog〉deposit in the conversational list, the selection of timeslice size is according to the performance of the computing machine of blog visit capacity and operation mining algorithm, the visitor is the registered user's, with user by name " visitor's " sign, the visitor is an anonymous, is the sign of " visitor " with User IP.

3. a kind of blog intimate recommend method of analyzing based on tree-like logging mode according to claim 1, it is characterized in that: described step 2) at blog to be recommended, in database, find out the user of visit blog to be recommended excessively, access log according to the user, remove winding, structure is the access log tree of root with blog to be recommended, be exactly institutional framework information according to the website, at blog to be recommended, in conversational list, find out the time that the user that visited this blog and user visit this blog for the first time, search the visitor who obtains at each, extract the record of searching other blog that the visitor that obtains visits after visit blog to be recommended; The tree structure maker is a unit structure access log tree with each visitor, the corresponding node of each blog of Accessor Access, each node comprises the visit triplet information, and the formation of father and son's node relationships is according to the temporal sequencing of connected reference request; For the winding that produces, delete limit at the latest on the access time, the access log tree of generation has three characteristics: the first, the access log tree has identical root node, is blog to be recommended; The second, there is not the identical brotgher of node of label in all access log trees; The 3rd, the access log tree is unordered, and promptly the child node of each node is unordered.

4. a kind of blog intimate recommend method of analyzing based on tree-like logging mode according to claim 1, it is characterized in that: in the described step 3) access log tree that constructs is cooked frequent recurrence unordered tree and excavate, find out the frequent subtree that meets preset requirement, exactly all access log trees are designated as t1 respectively, t2 ... tn selects suitable minimum support minsup (0,1), excavate with frequent subtree delver, concrete steps are as follows:

The first step, traversal t1, t2 ... tn is classified as same node point to " accesses blog " identical node in " visit tlv triple ", adds up the number of times fre1 that every kind of node occurs in access log tree, for fre1〉node of minsup*n, be designated as frequent subtree EQ1;

Second the step, EQ1 is done expansion, node among two EQ1 is done attended operation, constitute set membership, formation comprises the tree of 2 nodes, as candidate's subtree, count the occurrence number fre2 of candidate's subtree in all-access daily record tree, for fre2〉candidate's subtree of minsup*n, be designated as frequent subtree EQ2;

The 3rd the step, from EQ2, right wing footpath for every tree, do and enumerate expansion, node of each expansion is found out all possible candidate's subtree, counts occurrence number frei〉tree of minsup*n, be designated as new frequent subtree EQi, do similar recursive operation, constantly increase the interstitial content of the frequent subtree of excavating, till the candidate's subtree that does not meet.

5. a kind of blog intimate recommend method of analyzing based on tree-like logging mode according to claim 1, it is characterized in that: the node in the frequent subtree that in the described step 4) excavation is obtained is as candidate's blog intimate, calculate by the formula degree of setting of recommendation, getting the highest several of score value recommends, be exactly greater than 3 frequent subtree to the node number, fre sorts from big to small according to occurrence frequency, take out every frequent subtree successively, be done as follows: according to breadth first traversal, from the tree the 2nd layer, calculate the recommendation degree R of each node, formula is as follows:

Figure 2011100204787100001DEST_PATH_IMAGE002

Parameter declaration: fre is the frequency of frequent subtree; T represents whether there is direct page link, exists, and then T is 1, does not exist, and then T is 0; D is the degree of depth of this node, and the root node degree of depth is designated as 0; W _kBe every layer weight parameter, be defaulted as 1; B _kBe every layer number of branches, brotgher of node number under the promptly same father node; After calculating the recommendation degree of all both candidate nodes, as required, select the highest plurality of nodes of score value, get the blog of node correspondence and recommend as blog intimate.