CN106355095A - Method for identifying fraud website by utilizing fuzzy theory - Google Patents

Method for identifying fraud website by utilizing fuzzy theory Download PDF

Info

Publication number
CN106355095A
CN106355095A CN201611046454.8A CN201611046454A CN106355095A CN 106355095 A CN106355095 A CN 106355095A CN 201611046454 A CN201611046454 A CN 201611046454A CN 106355095 A CN106355095 A CN 106355095A
Authority
CN
China
Prior art keywords
webpage
matrix
fraud
website
fraud webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611046454.8A
Other languages
Chinese (zh)
Other versions
CN106355095B (en
Inventor
尚靖博
左祥麟
左万利
王英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin University
Original Assignee
Jilin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin University filed Critical Jilin University
Priority to CN201611046454.8A priority Critical patent/CN106355095B/en
Publication of CN106355095A publication Critical patent/CN106355095A/en
Application granted granted Critical
Publication of CN106355095B publication Critical patent/CN106355095B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a method for identifying a fraud website by utilizing a fuzzy theory and relates to a technology for identifying a fraud website independent of website characteristics. The fraud website identifying problem is solved by utilizing the thought of division and coordination of labor and the fuzzy theory. The website quality is decided by different users, and data sets marked by the users are analyzed by a computer to solve the technical problem that an existing fraud website identifying method has large website dependency. The method is simple and effective and has an important practical value in a future search engine.

Description

Using fuzzy theory, method for distinguishing is known to fraud webpage
Technical field
The present invention discloses a kind of method fraud webpage being identified using fuzzy theory, is related to one kind and is independent of webpage The fraud webpage technology of identification of feature, belongs to internet security and service technology field.
Background technology
Search engine has become as the indispensable instrument of Internet user, but the driving due to interests, cheat webpage big Amount mixes in the Internet.Tricker takes improper means, for search engine ordering strategy, webpage sorting is carried out manually Intervene, to obtain and the disproportionate high ranking in its status, disturb the acquisition to information for the user, or even infringement user benefit, these Webpage is referred to as cheating webpage, and the mode that tricker takes can be divided into four kinds: the mode based on content, the side based on link Formula, the mode based on concealing technique and the mode based on redirection, anti-fraud research in the past was all carried out for four kinds of deception modes Identification, depends on webpage itself unduly, and recognition result is of short duration effectively, and the fraud web page identification method that searching is independent of web page characteristics is A major issue currently urgently to be resolved hurrily.
Content of the invention
One kind of the present invention is independent of the fraud net of web page characteristics using fuzzy theory to fraud web page identification method Page recognition methodss, solve conventional identification fraud web-page approach depend on webpage itself unduly, recognition result is of short duration effectively asks Topic.
Using fuzzy theory to fraud web page identification method, its technical scheme includes following step to one kind of the present invention Rapid:
Step one:
User has browsed webpage, webpage is carried out with evaluation and makes user's mark: be respectively " non-fraud webpage f ", " fraud net Page s ", " equivocal b " or " not knowing u ";
Step 2:
Each the end of month passes through search engine and downloads the data set of of that month all user's marks;
Step 3:
By the quantity of each webpage different user labelling, some matrix m are divided into data seti, wherein, i=1,2 ..., n;
Step 4:
To each matrix mi: it is denoted as n, change into fuzzy similarity matrix r, each element r of rij, wherein i, j=1,2 ..., N, n ∈ r, computing formula includes:
r i j = 1 , i = j 1 - 0.1 * d ( n i , n j ) , i &notequal; j
Wherein, i, j=1,2 ..., n;N is the line number of n;
d ( n i , n j ) = σ k = 1 m | n i k - n j k |
Wherein, i, j=1,2 ..., n;N is the line number of n, and m is the columns of n;
Step 5:
Fuzzy similarity matrix changes into fuzzy equivalent matrix, and formula is as follows:
N is self-heating number;P is the line number of r;
Until meeting rb*rb!=rbCondition, matrix reaches convergence;
Step 6:
The matrix of convergence is chosen all of confidence value [0,1], calculates Level Matrix;
Step 7:
For each Level Matrix, cluster produces multiple set, selects first website successively artificial from each set Judgement is fraud webpage is also non-fraud webpage, if fraud webpage then thinks that this set belongs to fraud webpage;If being non-fraud Webpage then thinks that this set belongs to non-fraud webpage.
The positive effect of the present invention is: solves fraud webpage identification using the thinking shared out the work and helped one another and fuzzy theory and asks Topic, to be determined the quality of webpage, to analyze user by computer and to make the data set after labelling by different users, existing to solve There is the fraud web page identification method technical problem big to the dependency of webpage.This technical scheme is simply effective, in future searches There is in engine important practical be worth.
Specific embodiment
In order to be illustrated more clearly that technical solution of the present invention, will be described according to technology below described in technical scheme to Go out three embodiments, for those of ordinary skill in the art, without having to pay creative labor, can also be by This technical scheme applies in Practical Project.
Embodiment 1
Step one: after user has browsed webpage, according to the evaluation to webpage, the four kinds of labellings pre-setting from webpage The selection of oneself is given, for example: what 362f u represented is that the labelling that the website that id is 362 has two users divides in (f, s, b, u) Wei not f and u.
Step 2: in order to meet the requirement of embodiment, we use data set webspam-uk2007 (" webspam Collections ", http://chato.cl/webspam/datasets/, crawled by the laboratory of Web algo rithmics, university of milan, http://law.di.unimi.it/) verifying the reality of cluster The discrimination tested.
Step 3: choose 50 data that number of users is 2 from data set, produce the matrix m of 50*2.
Step 4: according to formula, the matrix r that fuzzy similarity matrix obtains 50*50 is calculated to this matrix.
Computing formula includes:
r i j = 1 , i = j 1 - 0.1 * d ( n i , n j ) , i &notequal; j
Wherein, i, j=1,2 ..., n.N is the line number of n;
d ( n i , n j ) = σ k = 1 m | n i k - n j k |
Wherein, i, j=1,2 ..., n.N is the line number of n, and m is the columns of n;
Step 5: to matrix r produced by step 4, calculate fuzzy equivalent matrix using formula, result of calculation is m=8, I.e. r8·r8=r8, at this moment r is still the matrix of 50*50.
Formula is as follows:
N is self-heating number;P is the line number of r;
Until meeting rb*rb!=rbCondition, matrix reaches convergence;
Step 6: as follows for the sequential organization from big to small of the element included in matrix: be designated as λ: 1 > 0.9 > 0.8.According to Secondary take λ=1,0.9,0.8 calculates its cut set matrix respectively, and when λ=1, in matrix, all values being less than 1 are all substituted for 0, produce First Level Matrix;When λ=0.9, in matrix, all values being more than or equal to 0.9 are all substituted for 1, all in matrix are less than 0.9 Value be all substituted for 0, produce second Level Matrix;When λ=0.8, in matrix, all values being more than or equal to 0.8 are all substituted for 1, Produce the 3rd Level Matrix.
Step 7:
When λ=1,
Cluster produces 5 set, choose from each set successively first website artificial judgment be fraud webpage or Non- fraud webpage, if fraud webpage then thinks that this set belongs to fraud webpage, if being non-fraud webpage, thinks that this set belongs to In non-fraud webpage, embodiment result is as follows: the (judgement that we provide for each website in each set according to data set Carry out verifying its corresponding discrimination)
When λ=0.9, cluster produces 4 set, and choosing first website artificial judgment from each set successively is to take advantage of Swindleness webpage is also non-fraud webpage, if fraud webpage then thinks that this set belongs to fraud webpage, if being non-fraud webpage, recognizes Belong to non-fraud webpage for this set, embodiment result be as follows: (for each set in each website we according to data set The judgement being given carries out verifying its corresponding discrimination)
When λ=0.8, cluster produces 1 set, and embodiment 1 completes embodiment 1 as mark.
Embodiment 2
Step one: after user has browsed webpage, according to the evaluation to webpage, the four kinds of labellings pre-setting from webpage The selection of oneself is given, for example: what 362f u represented is that the labelling that the website that id is 362 has two users divides in (f, s, b, u) Wei not f and u.
Step 2: in order to meet the requirement of embodiment, we use data set webspam-uk2007 (" webspam Collections ", http://chato.cl/webspam/datasets/, crawled by the laboratory of Web algorithmics, university of milan, http://law.di.unimi.it/) verifying the experiment of cluster Discrimination.
Step 3: choose 100 data that number of users is 2 from data set, produce the matrix m of 100*2.
Step 4: according to formula, the matrix r that fuzzy similarity matrix obtains 100*100 is calculated to this matrix.
Computing formula includes:
r i j = 1 , i = j 1 - 0.1 * d ( n i , n j ) , i &notequal; j
Wherein, i, j=1,2 ..., n.N is the line number of n;
d ( n i , n j ) = σ k = 1 m | n i k - n j k |
Wherein, i, j=1,2 ..., n.N is the line number of n, and m is the columns of n;
Step 5: to matrix r produced by step 4, calculate fuzzy equivalent matrix using formula, result of calculation is m= 16, i.e. r16·r16=r16, at this moment r is still the matrix of 100*100.
Formula is as follows:
N is self-heating number;P is the line number of r;
Until meeting rb*rb!=rbCondition, matrix reaches convergence;
Step 6: as follows for the sequential organization from big to small of the element included in matrix: be designated as λ: 1 > 0.9 > 0.8.According to Secondary take λ=1,0.9,0.8 calculates its cut set matrix respectively, and when λ=1, in matrix, all values being less than 1 are all substituted for 0, produce First Level Matrix;When λ=0.9, in matrix, all values being more than or equal to 0.9 are all substituted for 1, all in matrix are less than 0.9 Value be all substituted for 0, produce second Level Matrix;When λ=0.8, in matrix, all values being more than or equal to 0.8 are all substituted for 1, Produce the 3rd Level Matrix.
Step 7:
When λ=1,
Cluster produces 8 set, choose from each set successively first website artificial judgment be fraud webpage or Non- fraud webpage, if fraud webpage then thinks that this set belongs to fraud webpage, if being non-fraud webpage, thinks that this set belongs to In non-fraud webpage, embodiment result is as follows: the (judgement that we provide for each website in each set according to data set Carry out verifying its corresponding discrimination)
When λ=0.9,
Cluster produces 2 set, choose from each set successively first website artificial judgment be fraud webpage or Non- fraud webpage, if fraud webpage then thinks that this set belongs to fraud webpage, if being non-fraud webpage, thinks that this set belongs to In non-fraud webpage, embodiment result is as follows: the (judgement that we provide for each website in each set according to data set Carry out verifying its corresponding discrimination)
When λ=0.8, cluster produces 1 set, and embodiment 2 completes embodiment 2 as mark.
Embodiment 3
Step one: after user has browsed webpage, according to the evaluation to webpage, the four kinds of labellings pre-setting from webpage The selection of oneself is given, for example: what 362f u represented is that the labelling that the website that id is 362 has two users divides in (f, s, b, u) Wei not f and u.
Step 2: in order to meet the requirement of embodiment, we use data set webspam-uk2007 (" webspam Collections ", http://chato.cl/webspam/datasets/, crawled by the laboratory of Web algo rithmics, university of milan, http://law.di.unimi.it/) verifying the reality of cluster The discrimination tested.
Step 3: choose 200 data that number of users is 2 from data set, produce the matrix m of 200*2.
Step 4: according to formula, the matrix r that fuzzy similarity matrix obtains 200*200 is calculated to this matrix.
Computing formula includes:
r i j = 1 , i = j 1 - 0.1 * d ( n i , n j ) , i &notequal; j
Wherein, i, j=1,2 ..., n.N is the line number of n;
d ( n i , n j ) = σ k = 1 m | n i k - n j k |
Wherein, i, j=1,2 ..., n.N is the line number of n, and m is the columns of n;
Step 5: to matrix r produced by step 4, calculate fuzzy equivalent matrix using formula, result of calculation is m=8, I.e. r8·r8=r8, at this moment r is still the matrix of 200*200.
Formula is as follows:
N is self-heating number;P is the line number of r;
Until meeting rb*rb!=rbCondition, matrix reaches convergence;
Step 6: as follows for the sequential organization from big to small of the element included in matrix: be designated as λ: 1 > 0.9 > 0.8.According to Secondary take λ=1,0.9,0.8 calculates its cut set matrix respectively, and when λ=1, in matrix, all values being less than 1 are all substituted for 0, produce First Level Matrix;When λ=0.9, in matrix, all values being more than or equal to 0.9 are all substituted for 1, all in matrix are less than 0.9 Value be all substituted for 0, produce second Level Matrix;When λ=0.8, in matrix, all values being more than or equal to 0.8 are all substituted for 1, Produce the 3rd Level Matrix.
Step 7:
When λ=1,
Cluster produces 9 set, choose from each set successively first website artificial judgment be fraud webpage or Non- fraud webpage, if fraud webpage then thinks that this set belongs to fraud webpage, if being non-fraud webpage, thinks that this set belongs to In non-fraud webpage, embodiment result is as follows: the (judgement that we provide for each website in each set according to data set Carry out verifying its corresponding discrimination)
When λ=0.9,
Cluster produces 3 set, choose from each set successively first website artificial judgment be fraud webpage or Non- fraud webpage, if fraud webpage then thinks that this set belongs to fraud webpage, if being non-fraud webpage, thinks that this set belongs to In non-fraud webpage, embodiment result is as follows: the (judgement that we provide for each website in each set according to data set Carry out verifying its corresponding discrimination)
When λ=0.8, cluster produces 1 set, and embodiment 3 completes embodiment 3 as mark.

Claims (1)

1. one kind knows method for distinguishing using fuzzy theory to fraud webpage, comprises the steps:
Step one:
User has browsed webpage, webpage is carried out with evaluation and makes user's mark: be respectively " non-fraud webpage f ", " fraud webpage S ", " equivocal b " or " not knowing u ";
Step 2:
Each the end of month passes through search engine and downloads the data set of of that month all user's marks;
Step 3:
By the quantity of each webpage different user labelling, some matrix m are divided into data seti, wherein, i=1,2 ..., n;
Step 4:
To each matrix mi: it is denoted as n, change into fuzzy similarity matrix r, each element r of rij, wherein i, j=1,2 ..., n, n ∈ R, computing formula includes:
r i j = 1 , i = j 1 - 0.1 * d ( n i , n j ) , i &notequal; j
Wherein, i, j=1,2 ..., n;N is the line number of n;
d ( n i , n j ) = σ k = 1 m | n i k - n j k |
Wherein, i, j=1,2 ..., n;N is the line number of n, and m is the columns of n;
Step 5:
Fuzzy similarity matrix changes into fuzzy equivalent matrix, and formula is as follows:
B=1,2 ..., n;N is self-heating number;P is the line number of r;
Until meeting rb*rb!=rbCondition, matrix reaches convergence;
Step 6:
The matrix of convergence is chosen all of confidence value [0,1], calculates Level Matrix;
Step 7:
For each Level Matrix, cluster produces multiple set, selects first website artificial judgment successively from each set Be fraud webpage be also non-fraud webpage, if fraud webpage then think that this set belongs to fraud webpage;If being non-fraud webpage Then think that this set belongs to non-fraud webpage.
CN201611046454.8A 2016-11-23 2016-11-23 Method for distinguishing is known to fraud webpage using fuzzy theory Expired - Fee Related CN106355095B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611046454.8A CN106355095B (en) 2016-11-23 2016-11-23 Method for distinguishing is known to fraud webpage using fuzzy theory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611046454.8A CN106355095B (en) 2016-11-23 2016-11-23 Method for distinguishing is known to fraud webpage using fuzzy theory

Publications (2)

Publication Number Publication Date
CN106355095A true CN106355095A (en) 2017-01-25
CN106355095B CN106355095B (en) 2018-10-19

Family

ID=57862809

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611046454.8A Expired - Fee Related CN106355095B (en) 2016-11-23 2016-11-23 Method for distinguishing is known to fraud webpage using fuzzy theory

Country Status (1)

Country Link
CN (1) CN106355095B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107194281A (en) * 2017-05-25 2017-09-22 成都知道创宇信息技术有限公司 A kind of anti-fake system based on block chain technology
CN108985815A (en) * 2018-06-06 2018-12-11 阿里巴巴集团控股有限公司 A kind of user identification method, device and equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102592067A (en) * 2011-01-17 2012-07-18 腾讯科技(深圳)有限公司 Webpage recognition method, device and system
CN103634306A (en) * 2013-11-18 2014-03-12 北京奇虎科技有限公司 Security detection method and security detection server for network data
CN104486461A (en) * 2014-12-29 2015-04-01 北京奇虎科技有限公司 Domain name classification method and device and domain name recognition method and system
CN103425736B (en) * 2013-06-24 2016-02-17 腾讯科技(深圳)有限公司 A kind of web information recognition, Apparatus and system
CN105827611A (en) * 2016-04-06 2016-08-03 清华大学 Distributed rejection service network attack detection method and system based on fuzzy inference
CN106021487A (en) * 2016-05-19 2016-10-12 浙江工业大学 Internet of Things semantic event detection method based on fuzzy theory

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102592067A (en) * 2011-01-17 2012-07-18 腾讯科技(深圳)有限公司 Webpage recognition method, device and system
CN103425736B (en) * 2013-06-24 2016-02-17 腾讯科技(深圳)有限公司 A kind of web information recognition, Apparatus and system
CN103634306A (en) * 2013-11-18 2014-03-12 北京奇虎科技有限公司 Security detection method and security detection server for network data
CN104486461A (en) * 2014-12-29 2015-04-01 北京奇虎科技有限公司 Domain name classification method and device and domain name recognition method and system
CN105827611A (en) * 2016-04-06 2016-08-03 清华大学 Distributed rejection service network attack detection method and system based on fuzzy inference
CN106021487A (en) * 2016-05-19 2016-10-12 浙江工业大学 Internet of Things semantic event detection method based on fuzzy theory

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
赵磊: "一种基于模糊等价矩阵传递闭包的聚类算法", 《电脑知识与技术》 *
雷英杰等: "直觉模糊等价矩阵构造方法", 《系统工程理论与实践》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107194281A (en) * 2017-05-25 2017-09-22 成都知道创宇信息技术有限公司 A kind of anti-fake system based on block chain technology
CN107194281B (en) * 2017-05-25 2019-07-16 成都知道创宇信息技术有限公司 A kind of anti-fake system based on block chain technology
CN108985815A (en) * 2018-06-06 2018-12-11 阿里巴巴集团控股有限公司 A kind of user identification method, device and equipment

Also Published As

Publication number Publication date
CN106355095B (en) 2018-10-19

Similar Documents

Publication Publication Date Title
CN103020164B (en) Semantic search method based on multi-semantic analysis and personalized sequencing
CN104123332B (en) The display methods and device of search result
CN106294883B (en) Based on user behavior data to the method and system analyzed on user behavior figure
CN101894134B (en) Spatial layout-based fishing webpage detection and implementation method
CN109934619A (en) User's portrait tag modeling method, apparatus, electronic equipment and readable storage medium storing program for executing
CN105653562B (en) The calculation method and device of correlation between a kind of content of text and inquiry request
CN106407349A (en) Product recommendation method and device
CN106021374A (en) Underlay recall method and device for query result
CN104462611A (en) Modeling method, ranking method, modeling device and ranking device for information ranking model
CN103279879A (en) Method for online valuation of used cars
CN104166732A (en) Project collaboration filtering recommendation method based on global scoring information
CN106021329A (en) A user similarity-based sparse data collaborative filtering recommendation method
CN105893585A (en) Label data-based bipartite graph model academic paper recommendation method
CN103778262A (en) Information retrieval method and device based on thesaurus
CN103164537B (en) A kind of method of search engine logs data mining of user oriented information requirement
CN103365842B (en) A kind of page browsing recommends method and device
Wu et al. How Web 1.0 fails: the mismatch between hyperlinks and clickstreams
CN106355095A (en) Method for identifying fraud website by utilizing fuzzy theory
CN103353865A (en) Barter electronic trading commodity recommendation method based on position
CN104123321B (en) A kind of determining method and device for recommending picture
CN104933149B (en) A kind of information search method and device
CN109034908A (en) A kind of film ranking prediction technique of combination sequence study
CN105653600A (en) Generation method and device of test question digest information
CN103093236B (en) A kind of pornographic filter method of mobile terminal analyzed based on image, semantic
CN101639856B (en) Webpage correlation evaluation device for detecting internet information spreading

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20181019

Termination date: 20201123