CN106355095A

CN106355095A - Method for identifying fraud website by utilizing fuzzy theory

Info

Publication number: CN106355095A
Application number: CN201611046454.8A
Authority: CN
Inventors: 尚靖博; 左祥麟; 左万利; 王英
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2016-11-23
Filing date: 2016-11-23
Publication date: 2017-01-25
Anticipated expiration: 2036-11-23
Also published as: CN106355095B

Abstract

The invention discloses a method for identifying a fraud website by utilizing a fuzzy theory and relates to a technology for identifying a fraud website independent of website characteristics. The fraud website identifying problem is solved by utilizing the thought of division and coordination of labor and the fuzzy theory. The website quality is decided by different users, and data sets marked by the users are analyzed by a computer to solve the technical problem that an existing fraud website identifying method has large website dependency. The method is simple and effective and has an important practical value in a future search engine.

Description

Using fuzzy theory, method for distinguishing is known to fraud webpage

Technical field

The present invention discloses a kind of method fraud webpage being identified using fuzzy theory, is related to one kind and is independent of webpage The fraud webpage technology of identification of feature, belongs to internet security and service technology field.

Background technology

Search engine has become as the indispensable instrument of Internet user, but the driving due to interests, cheat webpage big Amount mixes in the Internet.Tricker takes improper means, for search engine ordering strategy, webpage sorting is carried out manually Intervene, to obtain and the disproportionate high ranking in its status, disturb the acquisition to information for the user, or even infringement user benefit, these Webpage is referred to as cheating webpage, and the mode that tricker takes can be divided into four kinds: the mode based on content, the side based on link Formula, the mode based on concealing technique and the mode based on redirection, anti-fraud research in the past was all carried out for four kinds of deception modes Identification, depends on webpage itself unduly, and recognition result is of short duration effectively, and the fraud web page identification method that searching is independent of web page characteristics is A major issue currently urgently to be resolved hurrily.

Content of the invention

One kind of the present invention is independent of the fraud net of web page characteristics using fuzzy theory to fraud web page identification method Page recognition methodss, solve conventional identification fraud web-page approach depend on webpage itself unduly, recognition result is of short duration effectively asks Topic.

Using fuzzy theory to fraud web page identification method, its technical scheme includes following step to one kind of the present invention Rapid:

Step one:

User has browsed webpage, webpage is carried out with evaluation and makes user's mark: be respectively " non-fraud webpage f ", " fraud net Page s ", " equivocal b " or " not knowing u "；

Step 2:

Each the end of month passes through search engine and downloads the data set of of that month all user's marks；

Step 3:

By the quantity of each webpage different user labelling, some matrix m are divided into data set_i, wherein, i=1,2 ..., n；

Step 4:

To each matrix m_i: it is denoted as n, change into fuzzy similarity matrix r, each element r of r_ij, wherein i, j=1,2 ..., N, n ∈ r, computing formula includes:

r_{i j} = \{\begin{matrix} 1, & i = j \\ 1 - 0.1 * d (n_{i}, n_{j}), & i &notequal; j \end{matrix}

Wherein, i, j=1,2 ..., n；N is the line number of n；

d (n_{i}, n_{j}) = σ_{k = 1}^{m} | n_{i k} - n_{j k} |

Wherein, i, j=1,2 ..., n；N is the line number of n, and m is the columns of n；

Step 5:

Fuzzy similarity matrix changes into fuzzy equivalent matrix, and formula is as follows:

N is self-heating number；P is the line number of r；

Until meeting r^b*r^b！=r^bCondition, matrix reaches convergence；

Step 6:

The matrix of convergence is chosen all of confidence value [0,1], calculates Level Matrix；

Step 7:

For each Level Matrix, cluster produces multiple set, selects first website successively artificial from each set Judgement is fraud webpage is also non-fraud webpage, if fraud webpage then thinks that this set belongs to fraud webpage；If being non-fraud Webpage then thinks that this set belongs to non-fraud webpage.

The positive effect of the present invention is: solves fraud webpage identification using the thinking shared out the work and helped one another and fuzzy theory and asks Topic, to be determined the quality of webpage, to analyze user by computer and to make the data set after labelling by different users, existing to solve There is the fraud web page identification method technical problem big to the dependency of webpage.This technical scheme is simply effective, in future searches There is in engine important practical be worth.

Specific embodiment

In order to be illustrated more clearly that technical solution of the present invention, will be described according to technology below described in technical scheme to Go out three embodiments, for those of ordinary skill in the art, without having to pay creative labor, can also be by This technical scheme applies in Practical Project.

Embodiment 1

Step one: after user has browsed webpage, according to the evaluation to webpage, the four kinds of labellings pre-setting from webpage The selection of oneself is given, for example: what 362f u represented is that the labelling that the website that id is 362 has two users divides in (f, s, b, u) Wei not f and u.

Step 2: in order to meet the requirement of embodiment, we use data set webspam-uk2007 (" webspam Collections ", http://chato.cl/webspam/datasets/, crawled by the laboratory of Web algo rithmics, university of milan, http://law.di.unimi.it/) verifying the reality of cluster The discrimination tested.

Step 3: choose 50 data that number of users is 2 from data set, produce the matrix m of 50*2.

Step 4: according to formula, the matrix r that fuzzy similarity matrix obtains 50*50 is calculated to this matrix.

Computing formula includes:

r_{i j} = \{\begin{matrix} 1, & i = j \\ 1 - 0.1 * d (n_{i}, n_{j}), & i &notequal; j \end{matrix}

Wherein, i, j=1,2 ..., n.N is the line number of n；

d (n_{i}, n_{j}) = σ_{k = 1}^{m} | n_{i k} - n_{j k} |

Wherein, i, j=1,2 ..., n.N is the line number of n, and m is the columns of n；

Step 5: to matrix r produced by step 4, calculate fuzzy equivalent matrix using formula, result of calculation is m=8, I.e. r⁸·r⁸=r⁸, at this moment r is still the matrix of 50*50.

Formula is as follows:

N is self-heating number；P is the line number of r；

Until meeting r^b*r^b！=r^bCondition, matrix reaches convergence；

Step 6: as follows for the sequential organization from big to small of the element included in matrix: be designated as λ: 1 > 0.9 > 0.8.According to Secondary take λ=1,0.9,0.8 calculates its cut set matrix respectively, and when λ=1, in matrix, all values being less than 1 are all substituted for 0, produce First Level Matrix；When λ=0.9, in matrix, all values being more than or equal to 0.9 are all substituted for 1, all in matrix are less than 0.9 Value be all substituted for 0, produce second Level Matrix；When λ=0.8, in matrix, all values being more than or equal to 0.8 are all substituted for 1, Produce the 3rd Level Matrix.

Step 7:

When λ=1,

Cluster produces 5 set, choose from each set successively first website artificial judgment be fraud webpage or Non- fraud webpage, if fraud webpage then thinks that this set belongs to fraud webpage, if being non-fraud webpage, thinks that this set belongs to In non-fraud webpage, embodiment result is as follows: the (judgement that we provide for each website in each set according to data set Carry out verifying its corresponding discrimination)

When λ=0.9, cluster produces 4 set, and choosing first website artificial judgment from each set successively is to take advantage of Swindleness webpage is also non-fraud webpage, if fraud webpage then thinks that this set belongs to fraud webpage, if being non-fraud webpage, recognizes Belong to non-fraud webpage for this set, embodiment result be as follows: (for each set in each website we according to data set The judgement being given carries out verifying its corresponding discrimination)

When λ=0.8, cluster produces 1 set, and embodiment 1 completes embodiment 1 as mark.

Embodiment 2

Step 2: in order to meet the requirement of embodiment, we use data set webspam-uk2007 (" webspam Collections ", http://chato.cl/webspam/datasets/, crawled by the laboratory of Web algorithmics, university of milan, http://law.di.unimi.it/) verifying the experiment of cluster Discrimination.

Step 3: choose 100 data that number of users is 2 from data set, produce the matrix m of 100*2.

Step 4: according to formula, the matrix r that fuzzy similarity matrix obtains 100*100 is calculated to this matrix.

Computing formula includes:

r_{i j} = \{\begin{matrix} 1, & i = j \\ 1 - 0.1 * d (n_{i}, n_{j}), & i &notequal; j \end{matrix}

Wherein, i, j=1,2 ..., n.N is the line number of n；

d (n_{i}, n_{j}) = σ_{k = 1}^{m} | n_{i k} - n_{j k} |

Step 5: to matrix r produced by step 4, calculate fuzzy equivalent matrix using formula, result of calculation is m= 16, i.e. r¹⁶·r¹⁶=r¹⁶, at this moment r is still the matrix of 100*100.

Formula is as follows:

N is self-heating number；P is the line number of r；

Until meeting r^b*r^b！=r^bCondition, matrix reaches convergence；

Step 7:

When λ=1,

Cluster produces 8 set, choose from each set successively first website artificial judgment be fraud webpage or Non- fraud webpage, if fraud webpage then thinks that this set belongs to fraud webpage, if being non-fraud webpage, thinks that this set belongs to In non-fraud webpage, embodiment result is as follows: the (judgement that we provide for each website in each set according to data set Carry out verifying its corresponding discrimination)

When λ=0.9,

Cluster produces 2 set, choose from each set successively first website artificial judgment be fraud webpage or Non- fraud webpage, if fraud webpage then thinks that this set belongs to fraud webpage, if being non-fraud webpage, thinks that this set belongs to In non-fraud webpage, embodiment result is as follows: the (judgement that we provide for each website in each set according to data set Carry out verifying its corresponding discrimination)

When λ=0.8, cluster produces 1 set, and embodiment 2 completes embodiment 2 as mark.

Embodiment 3

Step 3: choose 200 data that number of users is 2 from data set, produce the matrix m of 200*2.

Step 4: according to formula, the matrix r that fuzzy similarity matrix obtains 200*200 is calculated to this matrix.

Computing formula includes:

r_{i j} = \{\begin{matrix} 1, & i = j \\ 1 - 0.1 * d (n_{i}, n_{j}), & i &notequal; j \end{matrix}

Wherein, i, j=1,2 ..., n.N is the line number of n；

d (n_{i}, n_{j}) = σ_{k = 1}^{m} | n_{i k} - n_{j k} |

Step 5: to matrix r produced by step 4, calculate fuzzy equivalent matrix using formula, result of calculation is m=8, I.e. r⁸·r⁸=r⁸, at this moment r is still the matrix of 200*200.

Formula is as follows:

N is self-heating number；P is the line number of r；

Until meeting r^b*r^b！=r^bCondition, matrix reaches convergence；

Step 7:

When λ=1,

Cluster produces 9 set, choose from each set successively first website artificial judgment be fraud webpage or Non- fraud webpage, if fraud webpage then thinks that this set belongs to fraud webpage, if being non-fraud webpage, thinks that this set belongs to In non-fraud webpage, embodiment result is as follows: the (judgement that we provide for each website in each set according to data set Carry out verifying its corresponding discrimination)

When λ=0.9,

Cluster produces 3 set, choose from each set successively first website artificial judgment be fraud webpage or Non- fraud webpage, if fraud webpage then thinks that this set belongs to fraud webpage, if being non-fraud webpage, thinks that this set belongs to In non-fraud webpage, embodiment result is as follows: the (judgement that we provide for each website in each set according to data set Carry out verifying its corresponding discrimination)

When λ=0.8, cluster produces 1 set, and embodiment 3 completes embodiment 3 as mark.

Claims

1. one kind knows method for distinguishing using fuzzy theory to fraud webpage, comprises the steps:

Step one:

User has browsed webpage, webpage is carried out with evaluation and makes user's mark: be respectively " non-fraud webpage f ", " fraud webpage S ", " equivocal b " or " not knowing u "；

Step 2:

Step 3:

Step 4:

r_{i j} = \{\begin{matrix} 1, & i = j \\ 1 - 0.1 * d (n_{i}, n_{j}), & i &notequal; j \end{matrix}

Wherein, i, j=1,2 ..., n；N is the line number of n；

d (n_{i}, n_{j}) = σ_{k = 1}^{m} | n_{i k} - n_{j k} |

Step 5:

B=1,2 ..., n；N is self-heating number；P is the line number of r；

Until meeting r^b*r^b！=r^bCondition, matrix reaches convergence；

Step 6:

Step 7:

For each Level Matrix, cluster produces multiple set, selects first website artificial judgment successively from each set Be fraud webpage be also non-fraud webpage, if fraud webpage then think that this set belongs to fraud webpage；If being non-fraud webpage Then think that this set belongs to non-fraud webpage.