CN102122291A - Blog friend recommendation method based on tree log pattern analysis - Google Patents
Blog friend recommendation method based on tree log pattern analysis Download PDFInfo
- Publication number
- CN102122291A CN102122291A CN2011100204787A CN201110020478A CN102122291A CN 102122291 A CN102122291 A CN 102122291A CN 2011100204787 A CN2011100204787 A CN 2011100204787A CN 201110020478 A CN201110020478 A CN 201110020478A CN 102122291 A CN102122291 A CN 102122291A
- Authority
- CN
- China
- Prior art keywords
- blog
- tree
- node
- frequent
- access
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000004458 analytical method Methods 0.000 title abstract description 11
- 238000005065 mining Methods 0.000 claims abstract description 15
- 101100391272 Neosartorya fumigata (strain ATCC MYA-4609 / Af293 / CBS 101355 / FGSC A1100) freB gene Proteins 0.000 claims description 6
- 230000015572 biosynthetic process Effects 0.000 claims description 3
- 238000004804 winding Methods 0.000 claims 3
- 238000009412 basement excavation Methods 0.000 claims 1
- 238000000605 extraction Methods 0.000 claims 1
- 238000012163 sequencing technique Methods 0.000 claims 1
- 230000002123 temporal effect Effects 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种基于树形日志模式分析的博客好友推荐方法。采用离线挖掘方法,通过对服务器日志的解析,提取出访问者对博客页面的访问记录,通过分组,排序,去回环等技术进一步构造出以待推荐的博客为根的访问日志树,对构造出的访问日志树做频繁挖掘,找出符合预设要求的频繁子树,把频繁子树中的节点作为候选博客好友,按设定的公式进行推荐度计算,取分值最高的若干个进行推荐。算法不同于传统的基于频繁项挖掘或频繁序列挖掘的算法,针对博客圈特有的平行链接关系和间接访问特性,采用了频繁树形结构挖掘的方法,充分发掘,提取了博客间潜在的访问联系,并推荐给访问用户,提高了用户体验,是一种高效,实用的博客推荐方法。
The invention discloses a blog friend recommendation method based on tree-shaped log pattern analysis. Using the offline mining method, through the analysis of the server log, the visitor's access record to the blog page is extracted, and the access log tree rooted at the blog to be recommended is further constructed through grouping, sorting, loopback and other technologies. The access log tree is frequently mined to find the frequent subtrees that meet the preset requirements, and the nodes in the frequent subtrees are used as candidate blog friends, and the recommendation degree is calculated according to the set formula, and the ones with the highest scores are recommended . The algorithm is different from the traditional algorithm based on frequent item mining or frequent sequence mining. In view of the unique parallel link relationship and indirect access characteristics of the blog circle, the frequent tree structure mining method is used to fully explore and extract potential access links between blogs. , and recommended to visiting users, which improves the user experience, and is an efficient and practical blog recommendation method.
Description
技术领域 technical field
本发明涉及对博客服务器日志的数据分析技术和频繁访问模式的挖掘技术,特别是涉及一种基于树形日志模式分析的博客好友推荐方法。 The invention relates to data analysis technology of blog server logs and mining technology of frequent access patterns, in particular to a blog friend recommendation method based on tree log pattern analysis.
背景技术 Background technique
随着互联网技术的不断发展,博客已经不仅仅是一个单纯的发布个人文章、信息的平台,在增加了各种类如留言、关注、好友等互动功能后,用户之间会逐渐形成一个博客圈。博客圈中包含好友、潜在好友(尚未加入好友名单的博客或者是好友的好友)和志趣相投的其他博客等等。在博客这样的典型web2.0应用中,建立志趣相投的用户社会关系是决定系统成败的关键,因此面向博客的好友推荐已经成为博客系统的主体功能。博客好友推荐应用通过用户对博客的访问行为,发现博客用户间潜在的关联性,并试着建议博客根据关联性将与有可能其具有共同兴趣的人群转化为好友关系。 With the continuous development of Internet technology, blogs are no longer simply a platform for publishing personal articles and information. After adding various interactive functions such as messages, attention, friends, etc., users will gradually form a blog circle. . The blogosphere includes friends, potential friends (bloggers who have not yet joined the friend list or friends of friends), other bloggers with similar interests, and so on. In a typical web2.0 application like a blog, the key to the success of the system is to establish a social relationship with like-minded users. Therefore, friend recommendation for blogs has become the main function of the blog system. The blog friend recommendation application discovers the potential correlation between blog users through the user's visit behavior to the blog, and tries to suggest that the blog converts people who may have common interests into friend relationships according to the correlation.
博客圈是一种复杂的树形或者图形结构,目前已经存在一些面向博客的好友推荐系统。他们一般基于博客间已经建立的好友关系和服务器记录的访问量来做推荐,这些推荐方法基于频繁项挖掘或是频繁序列挖掘,存在以下不足和缺点:1)没有考虑博客间特有的平行链接关系和间接访问特性;2)没有考虑用户访问页面的先后顺序所隐藏的博客页面间的逻辑关系;3)没有充分考虑网站组织架构的层次关系和深度关系。 The blog circle is a complex tree or graph structure, and there are already some blog-oriented friend recommendation systems. They generally make recommendations based on the established friendship between blogs and the number of visits recorded by the server. These recommendation methods are based on frequent item mining or frequent sequence mining, and have the following deficiencies and shortcomings: 1) They do not consider the unique parallel link relationship between blogs and indirect access features; 2) did not consider the logical relationship between blog pages hidden in the order of users' access to pages; 3) did not fully consider the hierarchical relationship and depth relationship of the website organizational structure.
发明内容 Contents of the invention
针对博客服务器日志所隐含的丰富的用户行为信息和页面组织信息,本发明的目的在于提供一种基于树形日志模式分析的博客好友推荐方法,是针对博客日志的,基于树形结构挖掘的博客推荐方法。 Aiming at the rich user behavior information and page organization information implied in blog server logs, the purpose of the present invention is to provide a blog friend recommendation method based on tree log pattern analysis, which is aimed at blog logs and based on tree structure mining Blog recommendation method.
本发明解决其技术问题采用的技术方案是: The technical scheme that the present invention solves its technical problem adopts is:
该方法采用的步骤如下: The steps taken by this method are as follows:
1) 解析原始日志,提取有效信息,在数据库中创建会话表,用来记录用户的访问路径; 1) Analyze the original log, extract valid information, and create a session table in the database to record the user's access path;
2) 针对待推荐的博客,在数据库中找出访问过待推荐的博客的用户,根据用户的访问日志,去回环,构建以待推荐的博客为根的访问日志树; 2) For the blog to be recommended, find the users who have visited the blog to be recommended in the database, according to the user's access log, go back and forth, and build an access log tree rooted at the blog to be recommended;
3) 对构造出的访问日志树做频繁递归无序树挖掘,找出符合预设要求的频繁子树; 3) Do frequent recursive unordered tree mining on the constructed access log tree to find frequent subtrees that meet the preset requirements;
4) 把频繁子树中的节点作为候选博客好友,按设定的公式进行推荐度计算,取分值最高的若干个进行推荐。 4) Take the nodes in the frequent subtree as candidate blog friends, calculate the recommendation degree according to the set formula, and recommend the ones with the highest scores.
2、根据权利要求1所述的一种基于树形日志模式分析的博客好友推荐方法,其特征在于:所述步骤1)中解析原始日志,提取有效信息,就是用日志解析器提取服务器中的日志,得到一个时间片内的访问记录,去掉用户请求中的冗余信息,转化成访问三元组<访问者,访问时间,访问博客>存入会话表中,时间片大小的选择依据博客访问量和运行挖掘算法的计算机的性能,访问者为注册用户的,以用户名为“访问者”的标识,访问者为匿名用户的,以用户IP为“访问者”的标识。
2, a kind of blog friend recommending method based on the analysis of tree-shaped log pattern according to
3、根据权利要求1所述的一种基于树形日志模式分析的博客好友推荐方法,其特征在于:所述步骤2)中针对待推荐的博客,在数据库中找出访问过待推荐的博客的用户,根据用户的访问日志,去回环,构建以待推荐的博客为根的访问日志树,就是根据网站的组织结构信息,针对待推荐的博客,在会话表中查找出访问过该博客的用户和用户第一次访问该博客的时间,针对每个查找得到的访问者,提取出查找得到的访问者在访问待推荐的博客后访问的其它博客的记录;树形结构生成器以每个访问者为单位构造访问日志树,访问者访问的每个博客对应一个节点,每个节点包含访问三元组信息,父子节点关系的形成依据连续访问请求的时间上的先后顺序;对于产生的回环,删除访问时间上最迟的边,产生的访问日志树具有三个特点:第一,访问日志树具有相同的根节点,即为待推荐的博客;第二,所有的访问日志树不存在标签相同的兄弟节点;第三,访问日志树是无序的,即每个节点的子节点是无序的。
3. A blog friend recommendation method based on tree-shaped log pattern analysis according to
4、根据权利要求1所述的一种基于树形日志模式分析的博客好友推荐方法,其特征在于:所述步骤3)中对构造出的访问日志树做频繁递归无序树挖掘,找出符合预设要求的频繁子树,就是把所有的访问日志树分别记为t1,t2…tn,选择合适的最小支持度minsupÎ(0,1),用频繁子树挖掘器进行挖掘,具体步骤如下:
4. A method for recommending blog friends based on tree-shaped log pattern analysis according to
第一步、遍历t1,t2…tn,把“访问三元组”中“访问博客”相同的节点归为相同节点,统计每种节点在访问日志树中出现的次数fre1,对于fre1>minsup*n的节点,记为频繁子树EQ1; The first step is to traverse t1, t2...tn, and classify the nodes with the same "visit blog" in the "visit triplet" as the same node, and count the number of times fre1 each node appears in the access log tree, for fre1>minsup* The node of n is recorded as the frequent subtree EQ1;
第二步、对EQ1做扩展,把两个EQ1中的节点做连接操作,构成父子关系,形成包含2个节点的树,作为候选子树,统计出候选子树在所有访问日志树中的出现次数fre2,对于fre2>minsup*n的候选子树,记为频繁子树EQ2; The second step is to expand EQ1, connect the two nodes in EQ1 to form a parent-child relationship, form a tree containing 2 nodes, and use them as candidate subtrees to count the occurrence of candidate subtrees in all access log trees The number of times fre2, for the candidate subtree of fre2>minsup*n, is recorded as the frequent subtree EQ2;
第三步、从EQ2开始,对于每棵树的最右路径,做枚举扩展,每次扩展一个节点,找出所有可能的候选子树,统计出出现次数frei>minsup*n的树,记为新的频繁子树EQi,做类似的递归操作,不断增加挖掘的频繁子树的节点数目,直到没有符合的候选子树为止。 The third step, starting from EQ2, for the rightmost path of each tree, do enumeration expansion, expand one node at a time, find all possible candidate subtrees, and count the trees with occurrence times frei>minsup*n, record For the new frequent subtree EQi, do a similar recursive operation, and continuously increase the number of frequent subtree nodes that are mined until there is no matching candidate subtree.
5、根据权利要求1所述的一种基于树形日志模式分析的博客好友推荐方法,其特征在于:所述步骤4)中把挖掘得到的频繁子树中的节点作为候选博客好友,按设定的公式进行推荐度计算,取分值最高的若干个进行推荐,就是对节点数大于3的频繁子树,按照出现频度fre从大到小排序,依次拿出每棵频繁子树,做如下操作:根据宽度优先遍历,从树的第2层开始,计算每个节点的推荐度R,公式如下:
5. A kind of blog friend recommendation method based on tree-shaped log pattern analysis according to
参数说明:fre为频繁子树的频度;T表示是否存在直接的页面链接,存在,则T为1,不存在,则T为0;d是该节点的深度,根节点深度记为0; Wk是每层的权重参数,默认为1;Bk为每层的分支数目,即同一父节点下兄弟节点数目;计算出所有候选节点的推荐度后,根据需要,选出分值最高的若干个节点,取节点对应的博客作为博客好友进行推荐。 Parameter description: fre is the frequency of frequent subtrees; T indicates whether there is a direct page link, if it exists, then T is 1, if it does not exist, then T is 0; d is the depth of the node, and the depth of the root node is recorded as 0; W k is the weight parameter of each layer, and the default is 1; B k is the number of branches of each layer, that is, the number of sibling nodes under the same parent node; after calculating the recommendation degree of all candidate nodes, select the one with the highest score as needed There are several nodes, and the blogs corresponding to the nodes are recommended as blog friends.
本发明具有的有益效果是: The beneficial effects that the present invention has are:
根据访问者对博客的访问行为和博客网站的结构特点,结合现有的数据挖掘技术,针对服务器的访问日志,挖掘出树形结构的频繁访问模式。博客的服务提供商根据挖掘出的频繁访问模式研究分析用户的访问行为,为用户推荐博客好友,改善用户体验;同时也可协助网站架构师更好地组织网站架构,提高用户对博客的访问率。 According to the visitor's visit behavior to the blog and the structural characteristics of the blog site, combined with the existing data mining technology, the frequent visit pattern of the tree structure is excavated for the server's visit log. The service provider of the blog researches and analyzes the user's access behavior based on the frequent access patterns excavated, recommends blog friends to the user, and improves the user experience; at the same time, it can also assist the website architect to better organize the website structure and increase the user's access rate to the blog .
附图说明 Description of drawings
图1是基于树形日志模式分析的博客好友推荐方法的总体结构图。 Figure 1 is an overall structure diagram of a blog friend recommendation method based on tree log pattern analysis.
图2是访问会话及其索引。 Figure 2 is the access session and its index.
图3是根据图2中visitor1的会话构造出的访问日志树。 Fig. 3 is an access log tree constructed according to the session of visitor1 in Fig. 2 .
图4是频繁子树挖掘方法的示意图。 Fig. 4 is a schematic diagram of a frequent subtree mining method.
图5是推荐度计算方法的示意图。 FIG. 5 is a schematic diagram of a method for calculating a recommendation degree.
具体实施方式 Detailed ways
以下结合具体实例和附图对本发明作进一步的描述。 The present invention will be further described below in conjunction with specific examples and accompanying drawings.
通过本发明所提供的博客日志分析方法,可以快速,有效地提取出频繁访问模式,通过智能化的筛选过程把潜在的博客好友推荐给访问用户,总体结构图如图1所示,具体的实施步骤如下: Through the blog log analysis method provided by the present invention, frequent access patterns can be extracted quickly and effectively, and potential blog friends are recommended to visiting users through an intelligent screening process. The overall structure diagram is shown in Figure 1, and the specific implementation Proceed as follows:
1)图1中的日志解析器对一个时间段内的服务器日志进行解析,删除冗余信息,构建访问三元组<访问者,访问时间,访问博客>(triple<visitor, access_time, blog_url>),以Apache服务器日志为例,具体过程如下: 1) The log parser in Figure 1 parses the server logs within a period of time, deletes redundant information, and constructs a visit triple <visitor, visit time, visit blog> (triple<visitor, access_time, blog_url>), taking the Apache server log as an example, the specific process is as follows:
记录在Apache服务器中的日志可以表示成下面的形式: The logs recorded in the Apache server can be expressed in the following form:
117.24.255.86 - - [01/Jul/2010:18:01:25 +0800] "GET 117.24.255.86 - - [01/Jul/2010:18:01:25 +0800] "GET
http://B.blog.163.com HTTP/1.0" 200 1231 http://B.blog.163.com HTTP/1.0" 200 1231
"117.24.255.230.1277794615926482" 46807 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)" "117.24.255.230.1277794615926482" 46807 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)"
这条日志记录给出了IP为117.24.255.86的匿名用户在01/Jul/2010:18:01:25 +0800时间访问了页面http://A.blog.163.com。 This log record shows that the anonymous user with IP 117.24.255.86 visited the page http://A.blog.163.com at time 01/Jul/2010:18:01:25 +0800.
所以,可以依次构建访问三元组<117.24.255.86, 2010-7-1 18:01:25, blogA>,对于访问者为注册用户的,以注册ID作为visitor的标识;为匿名用户的,用IP作为区分,创建一个临时ID;对于访问的页面,为了便于下一步的处理,可以结合网站的组织结构信息,将页面的url地址进行简化,如这里将http://A.blog.163.com简化为blogA,两者必须一一对应。 Therefore, the access triplet <117.24.255.86, 2010-7-1 18:01:25, blogA>, for a visitor who is a registered user, use the registration ID as the identifier of the visitor; for an anonymous user, use the IP as a distinction to create a temporary ID; for the visited page, In order to facilitate the next step of processing, the url address of the page can be simplified in combination with the organizational structure information of the website. For example, http://A.blog.163.com is simplified as blogA here, and the two must correspond one-to-one.
2) 针对待推荐的博客,图1中的树形结构生成器在数据库中找出访问过待推荐的博客的用户,根据用户的访问日志,去回环,构建以待推荐的博客为根的访问日志树,具体步骤如下: 2) For the blog to be recommended, the tree structure generator in Figure 1 finds out the users who have visited the blog to be recommended in the database, according to the user's access log, removes the loop, and constructs the visit rooted at the blog to be recommended log tree, the specific steps are as follows:
第一步,根据网站的组织结构信息,针对某个待推荐的博客blogA,树形结构生成器的分组排序模块在会话表中查找出所有访问过blogA的用户和该用户第一次访问blogA的时间(SQL查询:select visitor ,distinct access_time from triple where access_url = blogA)。假设用户visitor0第一次访问blogA的时间是access_time0,查找用户visitor0的所有的访问时间在access_time0之后的页面,对每个查询得到的用户都做相同操作 (SQL查询:select visitor ,access_time,access_url from triple where visitor = visitor0 and access_time > access_time0,经过以上操作,可以得到如图2所示的用户会话信息 In the first step, according to the organizational structure information of the website, for a blogA to be recommended, the grouping and sorting module of the tree structure generator finds out all users who have visited blogA and the user who visited blogA for the first time in the session table. Time (SQL query: select visitor, distinct access_time from triple where access_url = blogA). Assume that the time when user visitor0 visits blogA for the first time is access_time0, find all the pages of user visitor0 whose access time is after access_time0, and do the same operation for each user obtained from the query (SQL query: select visitor , access_time, access_url from triple where visitor = visitor0 and access_time > access_time0, after the above operations, you can get the user session information as shown in Figure 2
第二步,对查询得到的记录,以访问者为单位构造访问日志树,父子节点关系的形成依据连续访问请求的时间上的先后顺序;对于产生的回环,树形结构生成器中的去回环模块通过删除回环上时间最迟的那条边来消除回环,根据图2中visitor1的会话产生的访问日志树如图3所示。产生的访问日志树有三个特点:第一,所有的访问日志树具有相同的根节点,即为待推荐的博客;第二,所有的访问日志树中不存在标签相同的节点;第三,访问日志树是无序的,即不考虑兄弟节点间的先后顺序。 The second step is to construct the access log tree with the visitor as the unit for the records obtained by the query, and the formation of the parent-child node relationship is based on the time sequence of continuous access requests; for the generated loop, the loopback in the tree structure generator The module eliminates the loopback by deleting the edge with the latest time on the loopback. The access log tree generated according to the session of visitor1 in Figure 2 is shown in Figure 3. The generated access log tree has three characteristics: first, all access log trees have the same root node, which is the blog to be recommended; second, there are no nodes with the same label in all access log trees; third, the access The log tree is unordered, that is, the order of sibling nodes is not considered.
3) 图1所示的频繁子树挖掘器对上一步构造出的访问日志树做频繁递归无序树挖掘,找出符合预设要求的频繁子树,具体步骤如下:树形结构生成器把上一步得到的访问日志树编号,分别为t1,t2,…tn。 3) The frequent subtree excavator shown in Figure 1 performs frequent recursive unordered tree mining on the access log tree constructed in the previous step, and finds frequent subtrees that meet the preset requirements. The specific steps are as follows: the tree structure generator takes The access log tree numbers obtained in the previous step are respectively t1, t2,...tn.
第一步:树形结构生成器中的候选子树生成模块遍历所有的访问日志树,把“访问三元组”中“访问博客”相同的节点归为相同节点,子树频度统计模块统计每种节点在访问日志树中出现的位置及含有该种节点的树的总数fre1(频度),对于fre1>minsup*n的节点,记为频繁子树 EQ1; The first step: the candidate subtree generation module in the tree structure generator traverses all the access log trees, and classifies the same nodes as the "visit blog" in the "visit triplet" as the same node, and the subtree frequency statistics module counts The position where each type of node appears in the access log tree and the total number fre1 (frequency) of the tree containing this type of node, for nodes with fre1>minsup*n, it is recorded as frequent subtree EQ1;
第二步:对EQ1中的节点两两做连接操作,构成父子关系,作为候选的频繁子树,统计出候选的频繁子树在所有日志中出现的次数fre2,具体步骤如图4所示,节点A和节点B都属于EQ1,对A,B做连接操作,A为B的父节点,同时记录最后新添加的节点在原树中的位置(图4中为B节点),对于fre2>minsup*n的候选子树,记为频繁子树 EQ2。 Step 2: Connect two nodes in EQ1 to form a parent-child relationship. As a candidate frequent subtree, count the number of times fre2 that the candidate frequent subtree appears in all logs. The specific steps are shown in Figure 4. Node A and node B both belong to EQ1, connect A and B, A is the parent node of B, and record the position of the last newly added node in the original tree (node B in Figure 4), for fre2>minsup* The candidate subtree of n is denoted as frequent subtree EQ2.
第三步:从EQ2开始,对于每棵树的最右路径,做枚举扩展,每次扩展一个节点,找出所有可能的候选子树,统计出出现次数frei>minsup*n的树,记为新的频繁子树 EQi。如图4所示,首先对于节点A做了最右路径的扩展,扩展出了一个新的节点B,也可以对原来的B节点做扩展,但一次只能扩展一个节点。如此做类似的递归操作,不断增加挖掘的频繁子树的节点数目,直到没有符合的候选频繁子树为止。挖掘树的过程中,为了便于树的记录,采用了对树的字符串编码,例如图4中树t1编码为ABC-1BD-1E-1-1B,t2的字符编码为ABC-1DE-1-1-1B,编码根据深度优先遍历顺序,每次往回走时插入一个-1,根据这种方法,树和字符串编码是一一对应的。 Step 3: Starting from EQ2, for the rightmost path of each tree, do enumeration expansion, expand one node at a time, find all possible candidate subtrees, and count the trees with occurrence times frei>minsup*n, record is the new frequent subtree EQi. As shown in Figure 4, firstly, the rightmost path is extended for node A, and a new node B is expanded, and the original node B can also be expanded, but only one node can be expanded at a time. Do similar recursive operations in this way, and continuously increase the number of nodes in the frequent subtrees that are mined until there is no matching candidate frequent subtree. In the process of mining the tree, in order to facilitate the recording of the tree, the string encoding of the tree is adopted. For example, in Figure 4, the encoding of tree t1 is ABC-1BD-1E-1-1B, and the character encoding of t2 is ABC-1DE-1- 1-1B, the encoding is based on the depth-first traversal order, and a -1 is inserted each time it goes back. According to this method, the tree and the string encoding are in one-to-one correspondence.
4)挖掘出所有的频繁子树后,图1所示的候选节点推荐器按照频繁子树的出现频度fre从大到小排序,依次拿出每棵频繁子树做如下操作:根据宽度优先遍历顺序,从树的第2层开始,节点推荐度计算模块计算每个节点的推荐度R,公式如下: 4) After digging out all the frequent subtrees, the candidate node recommender shown in Figure 1 sorts the frequent subtrees from large to small according to the occurrence frequency fre of the frequent subtrees, and takes out each frequent subtree in turn to do the following operations: according to the width first In order of traversal, starting from the second layer of the tree, the node recommendation calculation module calculates the recommendation R of each node, the formula is as follows:
参数说明:fre为频繁子树的频度;T表示是否存在直接的页面链接,存在,则T为1,不存在,则T为0;d是该节点的深度,根节点深度记为0;是每层的权重参数,默认为1;为每层的分支数目,即同一父节点下兄弟节点数目。 Parameter description: fre is the frequency of frequent subtrees; T indicates whether there is a direct page link, if it exists, then T is 1, if it does not exist, then T is 0; d is the depth of the node, and the depth of the root node is recorded as 0; is the weight parameter of each layer, the default is 1; is the number of branches of each layer, that is, the number of sibling nodes under the same parent node.
如图5所示,挖掘出了频繁子树ABC-1D-1-1B(字符串编码),这棵树在t1,t2中都出现,所以频度fre是100%,计算该候选子树的推荐度R,步骤如下:对于点A,在树的第1层,所以略过,对于第二层的节点B,若网站结构中不存在A到B的直接链接,则T为0,所以RB=0;若网站结构中存在A到B的直接链接,T=1,则RB=1*1*1/2=0.5。对于节点C,若不存在节点B到节点C的直接链接,则T=0,从而RC=0;若存在,则T=1,则RC=1*1*(1/2)(1/3)=0.167。节点D的情况与C相同。 As shown in Figure 5, the frequent subtree ABC-1D-1-1B (string code) is excavated. This tree appears in both t1 and t2, so the frequency fre is 100%. Calculate the candidate subtree The recommendation degree R, the steps are as follows: For point A, it is in the first layer of the tree, so it is skipped. For node B in the second layer, if there is no direct link from A to B in the website structure, then T is 0, so R B =0; if there is a direct link from A to B in the website structure, T=1, then R B =1*1*1/2=0.5. For node C, if there is no direct link from node B to node C, then T=0, so R C =0; if there is, then T=1, then R C =1*1*(1/2) (1 /3) = 0.167. Node D is the same as C.
计算出所有候选节点的推荐度后,根据需要,选出分值最高的若干个节点对应的博客作为博客好友进行推荐。按图5计算的节点,假使都存在直接的链接,根据计算,B节点,E节点的推荐度都为0.5,所以这两个节点对应的博客作为博客好友首先被推荐,节点C和节点D的推荐度为0.167,如果需要,它们对应的博客作为博客好友被进一步被推荐。 After calculating the recommendation degrees of all candidate nodes, select the blogs corresponding to several nodes with the highest scores as required as blog friends for recommendation. The nodes calculated according to Figure 5, assuming that there are direct links, according to the calculation, the recommendation degrees of node B and node E are both 0.5, so the blogs corresponding to these two nodes are recommended as blog friends first, and node C and node D The recommendation degree is 0.167, and their corresponding blogs are further recommended as blog friends if necessary.
Claims (5)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011100204787A CN102122291A (en) | 2011-01-18 | 2011-01-18 | Blog friend recommendation method based on tree log pattern analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011100204787A CN102122291A (en) | 2011-01-18 | 2011-01-18 | Blog friend recommendation method based on tree log pattern analysis |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102122291A true CN102122291A (en) | 2011-07-13 |
Family
ID=44250851
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011100204787A Pending CN102122291A (en) | 2011-01-18 | 2011-01-18 | Blog friend recommendation method based on tree log pattern analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102122291A (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102646122A (en) * | 2012-02-21 | 2012-08-22 | 北京航空航天大学 | An automatic construction method of academic social network |
WO2013034018A1 (en) * | 2011-09-09 | 2013-03-14 | 腾讯科技(深圳)有限公司 | Classification-based internet information push method and device, and computer storage medium |
CN103488683A (en) * | 2013-08-21 | 2014-01-01 | 北京航空航天大学 | Microblog data management system and implementation method thereof |
CN104252453A (en) * | 2013-06-25 | 2014-12-31 | 腾讯科技(深圳)有限公司 | Detection method and system for write operation in webpage recommendation location content access track |
CN104468488A (en) * | 2013-09-17 | 2015-03-25 | 北京千橡网景科技发展有限公司 | Recommendation method and device for anonymous user |
CN104702488A (en) * | 2013-12-10 | 2015-06-10 | 上海由你网络科技有限公司 | System and method for recommending friends |
CN106446161A (en) * | 2016-09-23 | 2017-02-22 | 中山大学 | Maximum frequent subgraph mining method adopting Hadoop |
CN106776622A (en) * | 2015-11-20 | 2017-05-31 | 北京国双科技有限公司 | The querying method and device of access log |
WO2017101652A1 (en) * | 2015-12-17 | 2017-06-22 | 北京国双科技有限公司 | Method and apparatus for determining an access path between website pages |
CN106897297A (en) * | 2015-12-17 | 2017-06-27 | 北京国双科技有限公司 | The determination method and device of access path between the column of website |
CN107256253A (en) * | 2017-06-09 | 2017-10-17 | 郑州云海信息技术有限公司 | A kind of system and method that web access module excavations are carried out based on XML |
CN108200084A (en) * | 2018-01-31 | 2018-06-22 | 湖北工业大学 | A kind of network security daily record based on grey wolf algorithm determines method and system |
WO2019042163A1 (en) * | 2017-09-01 | 2019-03-07 | 广东神马搜索科技有限公司 | Page information suggestion method, device, computing apparatus, and storage medium |
CN109446194A (en) * | 2018-08-21 | 2019-03-08 | 中国平安人寿保险股份有限公司 | Find method, apparatus, computer equipment and the storage medium of preparation supervisor |
CN109918576A (en) * | 2019-01-09 | 2019-06-21 | 常熟理工学院 | A Weibo Follow Recommendation Method Based on Joint Probabilistic Matrix Decomposition |
CN109947892A (en) * | 2017-12-04 | 2019-06-28 | 阿里巴巴集团控股有限公司 | Analysis path determines method and system, interface, log tree constructing method |
CN110069463A (en) * | 2019-03-12 | 2019-07-30 | 北京奇艺世纪科技有限公司 | User behavior processing method, device electronic equipment and storage medium |
CN111190873A (en) * | 2019-12-24 | 2020-05-22 | 同济大学 | A log pattern extraction method and system for cloud native system log training |
CN114003202A (en) * | 2021-05-28 | 2022-02-01 | 广东安证计算机司法鉴定所 | Member level construction method and device, computer equipment and storage medium |
-
2011
- 2011-01-18 CN CN2011100204787A patent/CN102122291A/en active Pending
Cited By (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013034018A1 (en) * | 2011-09-09 | 2013-03-14 | 腾讯科技(深圳)有限公司 | Classification-based internet information push method and device, and computer storage medium |
CN102646122B (en) * | 2012-02-21 | 2014-01-22 | 北京航空航天大学 | Automatic building method of academic social network |
CN102646122A (en) * | 2012-02-21 | 2012-08-22 | 北京航空航天大学 | An automatic construction method of academic social network |
CN104252453A (en) * | 2013-06-25 | 2014-12-31 | 腾讯科技(深圳)有限公司 | Detection method and system for write operation in webpage recommendation location content access track |
CN104252453B (en) * | 2013-06-25 | 2018-09-28 | 腾讯科技(深圳)有限公司 | The detection method and system of webpage recommending position access to content track write operation |
CN103488683B (en) * | 2013-08-21 | 2017-05-10 | 北京航空航天大学 | Microblog data management system and implementation method thereof |
CN103488683A (en) * | 2013-08-21 | 2014-01-01 | 北京航空航天大学 | Microblog data management system and implementation method thereof |
CN104468488A (en) * | 2013-09-17 | 2015-03-25 | 北京千橡网景科技发展有限公司 | Recommendation method and device for anonymous user |
CN104702488B (en) * | 2013-12-10 | 2019-03-05 | 上海掌门科技有限公司 | Friend recommendation system and method |
CN104702488A (en) * | 2013-12-10 | 2015-06-10 | 上海由你网络科技有限公司 | System and method for recommending friends |
CN106776622A (en) * | 2015-11-20 | 2017-05-31 | 北京国双科技有限公司 | The querying method and device of access log |
CN106776622B (en) * | 2015-11-20 | 2020-03-03 | 北京国双科技有限公司 | Query method and device for access log |
WO2017101652A1 (en) * | 2015-12-17 | 2017-06-22 | 北京国双科技有限公司 | Method and apparatus for determining an access path between website pages |
CN106897297A (en) * | 2015-12-17 | 2017-06-27 | 北京国双科技有限公司 | The determination method and device of access path between the column of website |
CN106897196A (en) * | 2015-12-17 | 2017-06-27 | 北京国双科技有限公司 | The determination method and device of access path between Website page |
CN106897196B (en) * | 2015-12-17 | 2019-10-25 | 北京国双科技有限公司 | Method and device for determining access path between website pages |
CN106446161A (en) * | 2016-09-23 | 2017-02-22 | 中山大学 | Maximum frequent subgraph mining method adopting Hadoop |
CN106446161B (en) * | 2016-09-23 | 2019-10-01 | 中山大学 | A kind of very big Frequent tree mining method for digging using Hadoop |
CN107256253A (en) * | 2017-06-09 | 2017-10-17 | 郑州云海信息技术有限公司 | A kind of system and method that web access module excavations are carried out based on XML |
WO2019042163A1 (en) * | 2017-09-01 | 2019-03-07 | 广东神马搜索科技有限公司 | Page information suggestion method, device, computing apparatus, and storage medium |
CN109947892B (en) * | 2017-12-04 | 2023-01-06 | 阿里巴巴集团控股有限公司 | Analysis path determination method and system, interface and log tree construction method |
CN109947892A (en) * | 2017-12-04 | 2019-06-28 | 阿里巴巴集团控股有限公司 | Analysis path determines method and system, interface, log tree constructing method |
CN108200084A (en) * | 2018-01-31 | 2018-06-22 | 湖北工业大学 | A kind of network security daily record based on grey wolf algorithm determines method and system |
CN109446194A (en) * | 2018-08-21 | 2019-03-08 | 中国平安人寿保险股份有限公司 | Find method, apparatus, computer equipment and the storage medium of preparation supervisor |
CN109918576A (en) * | 2019-01-09 | 2019-06-21 | 常熟理工学院 | A Weibo Follow Recommendation Method Based on Joint Probabilistic Matrix Decomposition |
CN109918576B (en) * | 2019-01-09 | 2021-01-05 | 常熟理工学院 | Microblog attention recommendation method based on joint probability matrix decomposition |
CN110069463A (en) * | 2019-03-12 | 2019-07-30 | 北京奇艺世纪科技有限公司 | User behavior processing method, device electronic equipment and storage medium |
CN110069463B (en) * | 2019-03-12 | 2021-07-16 | 北京奇艺世纪科技有限公司 | User behavior processing method, device electronic equipment and storage medium |
CN111190873A (en) * | 2019-12-24 | 2020-05-22 | 同济大学 | A log pattern extraction method and system for cloud native system log training |
CN111190873B (en) * | 2019-12-24 | 2022-08-16 | 同济大学 | Log mode extraction method and system for log training of cloud native system |
CN114003202A (en) * | 2021-05-28 | 2022-02-01 | 广东安证计算机司法鉴定所 | Member level construction method and device, computer equipment and storage medium |
CN114003202B (en) * | 2021-05-28 | 2024-11-05 | 广东安证计算机司法鉴定所 | Membership level construction method, device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102122291A (en) | Blog friend recommendation method based on tree log pattern analysis | |
Buchner et al. | Navigation pattern discovery from internet data | |
CN103020302B (en) | Academic Core Authors based on complex network excavates and relevant information abstracting method and system | |
US8738656B2 (en) | Method and system for processing a group of resource identifiers | |
JP5197959B2 (en) | System and method for search processing using a super unit | |
CN102054004B (en) | Webpage recommendation method and device adopting same | |
JP6017155B2 (en) | Improved similar document detection method, apparatus, and computer-readable recording medium | |
CN103294781B (en) | A kind of method and apparatus for processing page data | |
CN105808696B (en) | It is a kind of based on global and local feature across line social network user matching process | |
CN105320719B (en) | A Crowdfunding Website Item Recommendation Method Based on Item Labels and Graph Relationships | |
CN101266610A (en) | An Online Mining Method of Website Access Patterns of Active Web Users | |
CN105095281B (en) | A kind of web catalogue method for optimization analysis based on Web log mining | |
CN102760151A (en) | Implementation method of open source software acquisition and searching system | |
CN106980651B (en) | Crawling seed list updating method and device based on knowledge graph | |
CN110321446A (en) | Related data recommended method, device, computer equipment and storage medium | |
CN108846006A (en) | Excavation, searching method and the system of field of finance and economics unstructured text data | |
CN102567392A (en) | Control method for interest subject excavation based on time window | |
Aspert et al. | A graph-structured dataset for Wikipedia research | |
CN104298669A (en) | Person geographic information mining model based on social network | |
CN102004805B (en) | Webpage denoising system and method based on maximum similarity matching | |
CN106844553A (en) | Data snooping and extending method and device based on sample data | |
CN108446333A (en) | A kind of big data text mining processing system and its method | |
Li et al. | DSM-PLW: Single-pass mining of path traversal patterns over streaming Web click-sequences | |
Wei et al. | Algorithm of mining sequential patterns for web personalization services | |
CN115982390A (en) | Industrial chain construction and iterative expansion development method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20110713 |