CN106355095B

CN106355095B - Method for distinguishing is known to fraud webpage using fuzzy theory

Info

Publication number: CN106355095B
Application number: CN201611046454.8A
Authority: CN
Inventors: 尚靖博; 左祥麟; 左万利; 王英
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2016-11-23
Filing date: 2016-11-23
Publication date: 2018-10-19
Anticipated expiration: 2036-11-23
Also published as: CN106355095A

Abstract

The present invention discloses a kind of method that fraud webpage is identified using fuzzy theory, it is related to a kind of fraud webpage identification technology not depending on web page characteristics, solve the problems, such as that fraud webpage identifies using the thinking and fuzzy theory shared out the work and helped one another, the quality of webpage is determined by different users, data set after user makes label is analyzed by computer, to solve the existing fraud web page identification method technical problem big to the dependence of webpage.This technical solution is simple and effective, has important practical value in future searches engine.

Description

Method for distinguishing is known to fraud webpage using fuzzy theory

Technical field

The present invention discloses a kind of method that fraud webpage is identified using fuzzy theory, is related to one kind and not depending on webpage The fraud webpage identification technology of feature, belongs to internet security and service technology field.

Background technology

Search engine has become the indispensable tool of Internet user, but due to the driving of interests, and fraud webpage is big Amount mixes in internet.Tricker takes improper means, is carried out to webpage sorting for search engine ordering strategy artificial Intervene, with acquisition and the disproportionate high ranking in its status, acquisition of the interference user to information, or even damage user benefit, these Webpage is referred to as cheating webpage, and the mode that tricker takes can be divided into four kinds：Mode based on content, the side based on link Formula, the mode based on concealing technique and the mode based on redirection, previous anti-fraud research are carried out for four kinds of deception modes Webpage itself is depended in identification unduly, and recognition result is of short duration effectively, and the fraud web page identification method that searching does not depend on web page characteristics is A current major issue urgently to be resolved hurrily.

Invention content

A kind of utilization fuzzy theory of the present invention does not depend on the fraud net of web page characteristics to cheating web page identification method Page recognition methods, solve previous identification fraud web-page approach depends on that webpage itself, recognition result are of short duration effectively to ask unduly Topic.

For a kind of utilization fuzzy theory of the present invention to cheating web page identification method, technical solution includes following step Suddenly：

Step 1：

User has browsed webpage, and carrying out evaluation to webpage makes user's mark：Respectively " non-fraud webpage F ", " fraud net Page S ", " equivocal B " or " not knowing U "；

Step 2：

Each the end of month is downloaded the data set of of that month whole user's marks by search engine；

Step 3：

Several matrix M is divided by the quantity that each webpage different user marks to data set_i, wherein i=1,2 ..., n；

Step 4：

To each matrix M_i：It is denoted as N, changes into each element R of fuzzy similarity matrix R, R_ij, wherein i, j=1,2 ..., N, n ∈ R, calculation formula include：

Wherein, i, j=1,2 ..., n；N is the line number of N；

Wherein, i, j=1,2 ..., n；N is the line number of N, and m is the columns of N；

Step 5：

Fuzzy similarity matrix changes into fuzzy equivalent matrix, and formula is as follows：

N is self-heating number；P is the line number of R；

Until meeting R^b*R^b！=R^bCondition, matrix reach convergence；

Step 6：

Convergent matrix is chosen into all confidence values [0,1], calculates Level Matrix；

Step 7：

For each Level Matrix, cluster generates multiple set, it is artificial to select first website from each set successively Judgement is that fraud webpage is also non-fraud webpage, if fraud webpage then thinks that the set belongs to fraud webpage；If being non-fraud Webpage then thinks that the set belongs to non-fraud webpage.

The positive effect of the present invention is：It is asked using the thinking and fuzzy theory shared out the work and helped one another to solve fraud webpage identification Topic, the quality of webpage is determined by different users, and the data set after user makes label is analyzed by computer, existing to solve The technical problem for having fraud web page identification method big to the dependence of webpage.This technical solution is simple and effective, in future searches There is important practical value in engine.

Specific implementation mode

In order to illustrate more clearly of technical solution of the present invention, will be described below according to technology described in technical solution to Go out three embodiments, for those of ordinary skill in the art, without having to pay creative labor, can also incite somebody to action The technical solution applies in Practical Project.

Embodiment 1

Step 1：After user has browsed webpage, according to the evaluation to webpage, four kinds pre-set from webpage mark The selection of oneself is provided in (F, S, B, U), such as：What 362F U were indicated is that there are two the labels of user point for website that id is 362 It Wei not F and U.

Step 2：In order to meet the requirement of embodiment, we use data set webspam-uk2007 (" WebSpam Collections ", http://chato.cl/webspam/datasets/, Crawled by the Laboratory of Web Algo rithmics, University of Milan, http://law.di.unimi.it/) verify the reality of cluster The discrimination tested.

Step 3：50 datas that number of users is 2 are chosen from data set, generate the matrix M of 50*2.

Step 4：Fuzzy similarity matrix is calculated according to formula to the matrix and obtains the matrix R of 50*50.

Calculation formula includes：

Wherein, i, j=1,2 ..., n.N is the line number of N；

Wherein, i, j=1,2 ..., n.N is the line number of N, and m is the columns of N；

Step 5：To matrix R caused by step 4, fuzzy equivalent matrix is calculated using formula, result of calculation is m=8, That is R⁸·R⁸=R⁸, at this moment R is still the matrix of 50*50.

Formula is as follows：

N is self-heating number；P is the line number of R；

Until meeting R^b*R^b！=R^bCondition, matrix reach convergence；

Step 6：The sequential organization of element included in matrix from big to small is as follows:It is denoted as λ：1>0.9>0.8.According to Secondary to take λ=1,0.9,0.8 calculates separately its cut set matrix, and as λ=1, all 1 values of being less than all are substituted for 0 in matrix, generate First Level Matrix；As λ=0.9, all 0.9 values of being more than or equal to all are substituted for 1 in matrix, all in matrix to be less than 0.9 Value be all substituted for 0, generate second Level Matrix；As λ=0.8, all 0.8 values of being more than or equal to all are substituted for 1 in matrix, Generate third Level Matrix.

Step 7：

As λ=1,

Cluster generate 5 set, successively from each set choose first website artificial judgment be fraud webpage or Non- fraud webpage thinks the set category if fraud webpage then thinks that the set belongs to fraud webpage if being non-fraud webpage In non-fraud webpage, embodiment result such as following table：(the judgement that we provide according to data set for each website in each set It carries out verifying its corresponding discrimination)

As λ=0.9, cluster generates 4 set, and it is to take advantage of that first website artificial judgment is chosen from each set successively It is also non-fraud webpage to cheat webpage, if fraud webpage then thinks that the set belongs to fraud webpage, is recognized if being non-fraud webpage Belong to non-fraud webpage, embodiment result such as following table for the set：(for each website in each set we according to data set The judgement provided carries out verifying its corresponding discrimination)

As λ=0.8, cluster generates 1 set, and embodiment 1 completes embodiment 1 as mark.

Embodiment 2

Step 2：In order to meet the requirement of embodiment, we use data set webspam-uk2007 (" WebSpam Collections ", http://chato.cl/webspam/datasets/, Crawled by the Laboratory of Web Algorithmics, University of Milan, http://law.di.unimi.it/) verify the experiment of cluster Discrimination.

Step 3：100 datas that number of users is 2 are chosen from data set, generate the matrix M of 100*2.

Step 4：Fuzzy similarity matrix is calculated according to formula to the matrix and obtains the matrix R of 100*100.

Calculation formula includes：

Wherein, i, j=1,2 ..., n.N is the line number of N；

Step 5：To matrix R caused by step 4, fuzzy equivalent matrix is calculated using formula, result of calculation is m= 16, i.e. R¹⁶·R¹⁶=R¹⁶, at this moment R is still the matrix of 100*100.

Formula is as follows：

N is self-heating number；P is the line number of R；

Until meeting R^b*R^b！=R^bCondition, matrix reach convergence；

Step 7：

As λ=1,

Cluster generate 8 set, successively from each set choose first website artificial judgment be fraud webpage or Non- fraud webpage thinks the set category if fraud webpage then thinks that the set belongs to fraud webpage if being non-fraud webpage In non-fraud webpage, embodiment result such as following table：(the judgement that we provide according to data set for each website in each set It carries out verifying its corresponding discrimination)

As λ=0.9,

Cluster generate 2 set, successively from each set choose first website artificial judgment be fraud webpage or Non- fraud webpage thinks the set category if fraud webpage then thinks that the set belongs to fraud webpage if being non-fraud webpage In non-fraud webpage, embodiment result such as following table：(the judgement that we provide according to data set for each website in each set It carries out verifying its corresponding discrimination)

As λ=0.8, cluster generates 1 set, and embodiment 2 completes embodiment 2 as mark.

Embodiment 3

Step 3：200 datas that number of users is 2 are chosen from data set, generate the matrix M of 200*2.

Step 4：Fuzzy similarity matrix is calculated according to formula to the matrix and obtains the matrix R of 200*200.

Calculation formula includes：

Wherein, i, j=1,2 ..., n.N is the line number of N；

Step 5：To matrix R caused by step 4, fuzzy equivalent matrix is calculated using formula, result of calculation is m=8, That is R⁸·R⁸=R⁸, at this moment R is still the matrix of 200*200.

Formula is as follows：

N is self-heating number；P is the line number of R；

Until meeting R^b*R^b！=R^bCondition, matrix reach convergence；

Step 7：

As λ=1,

Cluster generate 9 set, successively from each set choose first website artificial judgment be fraud webpage or Non- fraud webpage thinks the set category if fraud webpage then thinks that the set belongs to fraud webpage if being non-fraud webpage In non-fraud webpage, embodiment result such as following table：(the judgement that we provide according to data set for each website in each set It carries out verifying its corresponding discrimination)

As λ=0.9,

Cluster generate 3 set, successively from each set choose first website artificial judgment be fraud webpage or Non- fraud webpage thinks the set category if fraud webpage then thinks that the set belongs to fraud webpage if being non-fraud webpage In non-fraud webpage, embodiment result such as following table：(the judgement that we provide according to data set for each website in each set It carries out verifying its corresponding discrimination)

As λ=0.8, cluster generates 1 set, and embodiment 3 completes embodiment 3 as mark.

Claims

1. a kind of knowing method for distinguishing using fuzzy theory to fraud webpage, include the following steps：

Step 1：

User has browsed webpage, and carrying out evaluation to webpage makes user's mark：Respectively " non-fraud webpage F ", " fraud webpage S ", " equivocal B " or " not knowing U "；

Step 2：

Step 3：

Step 4：

Wherein, i, j=1,2 ..., n；N is the line number of N；

Step 5：

B=1,2 ..., n；N is natural number；P is the line number of R；

Until meeting R^b*R^b！=R^bCondition, matrix reach convergence；

Step 6：

Step 7：

For each Level Matrix, cluster generates multiple set, selects first website artificial judgment from each set successively Be fraud webpage be also non-fraud webpage, if fraud webpage then think that the set belongs to fraud webpage；If being non-fraud webpage Then think that the set belongs to non-fraud webpage.