CN106355095B - Method for distinguishing is known to fraud webpage using fuzzy theory - Google Patents

Method for distinguishing is known to fraud webpage using fuzzy theory Download PDF

Info

Publication number
CN106355095B
CN106355095B CN201611046454.8A CN201611046454A CN106355095B CN 106355095 B CN106355095 B CN 106355095B CN 201611046454 A CN201611046454 A CN 201611046454A CN 106355095 B CN106355095 B CN 106355095B
Authority
CN
China
Prior art keywords
webpage
matrix
fraud
fraud webpage
fuzzy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201611046454.8A
Other languages
Chinese (zh)
Other versions
CN106355095A (en
Inventor
尚靖博
左祥麟
左万利
王英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin University
Original Assignee
Jilin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin University filed Critical Jilin University
Priority to CN201611046454.8A priority Critical patent/CN106355095B/en
Publication of CN106355095A publication Critical patent/CN106355095A/en
Application granted granted Critical
Publication of CN106355095B publication Critical patent/CN106355095B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Computer And Data Communications (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention discloses a kind of method that fraud webpage is identified using fuzzy theory, it is related to a kind of fraud webpage identification technology not depending on web page characteristics, solve the problems, such as that fraud webpage identifies using the thinking and fuzzy theory shared out the work and helped one another, the quality of webpage is determined by different users, data set after user makes label is analyzed by computer, to solve the existing fraud web page identification method technical problem big to the dependence of webpage.This technical solution is simple and effective, has important practical value in future searches engine.

Description

Method for distinguishing is known to fraud webpage using fuzzy theory
Technical field
The present invention discloses a kind of method that fraud webpage is identified using fuzzy theory, is related to one kind and not depending on webpage The fraud webpage identification technology of feature, belongs to internet security and service technology field.
Background technology
Search engine has become the indispensable tool of Internet user, but due to the driving of interests, and fraud webpage is big Amount mixes in internet.Tricker takes improper means, is carried out to webpage sorting for search engine ordering strategy artificial Intervene, with acquisition and the disproportionate high ranking in its status, acquisition of the interference user to information, or even damage user benefit, these Webpage is referred to as cheating webpage, and the mode that tricker takes can be divided into four kinds:Mode based on content, the side based on link Formula, the mode based on concealing technique and the mode based on redirection, previous anti-fraud research are carried out for four kinds of deception modes Webpage itself is depended in identification unduly, and recognition result is of short duration effectively, and the fraud web page identification method that searching does not depend on web page characteristics is A current major issue urgently to be resolved hurrily.
Invention content
A kind of utilization fuzzy theory of the present invention does not depend on the fraud net of web page characteristics to cheating web page identification method Page recognition methods, solve previous identification fraud web-page approach depends on that webpage itself, recognition result are of short duration effectively to ask unduly Topic.
For a kind of utilization fuzzy theory of the present invention to cheating web page identification method, technical solution includes following step Suddenly:
Step 1:
User has browsed webpage, and carrying out evaluation to webpage makes user's mark:Respectively " non-fraud webpage F ", " fraud net Page S ", " equivocal B " or " not knowing U ";
Step 2:
Each the end of month is downloaded the data set of of that month whole user's marks by search engine;
Step 3:
Several matrix M is divided by the quantity that each webpage different user marks to data seti, wherein i=1,2 ..., n;
Step 4:
To each matrix Mi:It is denoted as N, changes into each element R of fuzzy similarity matrix R, Rij, wherein i, j=1,2 ..., N, n ∈ R, calculation formula include:
Wherein, i, j=1,2 ..., n;N is the line number of N;
Wherein, i, j=1,2 ..., n;N is the line number of N, and m is the columns of N;
Step 5:
Fuzzy similarity matrix changes into fuzzy equivalent matrix, and formula is as follows:
N is self-heating number;P is the line number of R;
Until meeting Rb*Rb!=RbCondition, matrix reach convergence;
Step 6:
Convergent matrix is chosen into all confidence values [0,1], calculates Level Matrix;
Step 7:
For each Level Matrix, cluster generates multiple set, it is artificial to select first website from each set successively Judgement is that fraud webpage is also non-fraud webpage, if fraud webpage then thinks that the set belongs to fraud webpage;If being non-fraud Webpage then thinks that the set belongs to non-fraud webpage.
The positive effect of the present invention is:It is asked using the thinking and fuzzy theory shared out the work and helped one another to solve fraud webpage identification Topic, the quality of webpage is determined by different users, and the data set after user makes label is analyzed by computer, existing to solve The technical problem for having fraud web page identification method big to the dependence of webpage.This technical solution is simple and effective, in future searches There is important practical value in engine.
Specific implementation mode
In order to illustrate more clearly of technical solution of the present invention, will be described below according to technology described in technical solution to Go out three embodiments, for those of ordinary skill in the art, without having to pay creative labor, can also incite somebody to action The technical solution applies in Practical Project.
Embodiment 1
Step 1:After user has browsed webpage, according to the evaluation to webpage, four kinds pre-set from webpage mark The selection of oneself is provided in (F, S, B, U), such as:What 362F U were indicated is that there are two the labels of user point for website that id is 362 It Wei not F and U.
Step 2:In order to meet the requirement of embodiment, we use data set webspam-uk2007 (" WebSpam Collections ", http://chato.cl/webspam/datasets/, Crawled by the Laboratory of Web Algo rithmics, University of Milan, http://law.di.unimi.it/) verify the reality of cluster The discrimination tested.
Step 3:50 datas that number of users is 2 are chosen from data set, generate the matrix M of 50*2.
Step 4:Fuzzy similarity matrix is calculated according to formula to the matrix and obtains the matrix R of 50*50.
Calculation formula includes:
Wherein, i, j=1,2 ..., n.N is the line number of N;
Wherein, i, j=1,2 ..., n.N is the line number of N, and m is the columns of N;
Step 5:To matrix R caused by step 4, fuzzy equivalent matrix is calculated using formula, result of calculation is m=8, That is R8·R8=R8, at this moment R is still the matrix of 50*50.
Formula is as follows:
N is self-heating number;P is the line number of R;
Until meeting Rb*Rb!=RbCondition, matrix reach convergence;
Step 6:The sequential organization of element included in matrix from big to small is as follows:It is denoted as λ:1>0.9>0.8.According to Secondary to take λ=1,0.9,0.8 calculates separately its cut set matrix, and as λ=1, all 1 values of being less than all are substituted for 0 in matrix, generate First Level Matrix;As λ=0.9, all 0.9 values of being more than or equal to all are substituted for 1 in matrix, all in matrix to be less than 0.9 Value be all substituted for 0, generate second Level Matrix;As λ=0.8, all 0.8 values of being more than or equal to all are substituted for 1 in matrix, Generate third Level Matrix.
Step 7:
As λ=1,
Cluster generate 5 set, successively from each set choose first website artificial judgment be fraud webpage or Non- fraud webpage thinks the set category if fraud webpage then thinks that the set belongs to fraud webpage if being non-fraud webpage In non-fraud webpage, embodiment result such as following table:(the judgement that we provide according to data set for each website in each set It carries out verifying its corresponding discrimination)
As λ=0.9, cluster generates 4 set, and it is to take advantage of that first website artificial judgment is chosen from each set successively It is also non-fraud webpage to cheat webpage, if fraud webpage then thinks that the set belongs to fraud webpage, is recognized if being non-fraud webpage Belong to non-fraud webpage, embodiment result such as following table for the set:(for each website in each set we according to data set The judgement provided carries out verifying its corresponding discrimination)
As λ=0.8, cluster generates 1 set, and embodiment 1 completes embodiment 1 as mark.
Embodiment 2
Step 1:After user has browsed webpage, according to the evaluation to webpage, four kinds pre-set from webpage mark The selection of oneself is provided in (F, S, B, U), such as:What 362F U were indicated is that there are two the labels of user point for website that id is 362 It Wei not F and U.
Step 2:In order to meet the requirement of embodiment, we use data set webspam-uk2007 (" WebSpam Collections ", http://chato.cl/webspam/datasets/, Crawled by the Laboratory of Web Algorithmics, University of Milan, http://law.di.unimi.it/) verify the experiment of cluster Discrimination.
Step 3:100 datas that number of users is 2 are chosen from data set, generate the matrix M of 100*2.
Step 4:Fuzzy similarity matrix is calculated according to formula to the matrix and obtains the matrix R of 100*100.
Calculation formula includes:
Wherein, i, j=1,2 ..., n.N is the line number of N;
Wherein, i, j=1,2 ..., n.N is the line number of N, and m is the columns of N;
Step 5:To matrix R caused by step 4, fuzzy equivalent matrix is calculated using formula, result of calculation is m= 16, i.e. R16·R16=R16, at this moment R is still the matrix of 100*100.
Formula is as follows:
N is self-heating number;P is the line number of R;
Until meeting Rb*Rb!=RbCondition, matrix reach convergence;
Step 6:The sequential organization of element included in matrix from big to small is as follows:It is denoted as λ:1>0.9>0.8.According to Secondary to take λ=1,0.9,0.8 calculates separately its cut set matrix, and as λ=1, all 1 values of being less than all are substituted for 0 in matrix, generate First Level Matrix;As λ=0.9, all 0.9 values of being more than or equal to all are substituted for 1 in matrix, all in matrix to be less than 0.9 Value be all substituted for 0, generate second Level Matrix;As λ=0.8, all 0.8 values of being more than or equal to all are substituted for 1 in matrix, Generate third Level Matrix.
Step 7:
As λ=1,
Cluster generate 8 set, successively from each set choose first website artificial judgment be fraud webpage or Non- fraud webpage thinks the set category if fraud webpage then thinks that the set belongs to fraud webpage if being non-fraud webpage In non-fraud webpage, embodiment result such as following table:(the judgement that we provide according to data set for each website in each set It carries out verifying its corresponding discrimination)
As λ=0.9,
Cluster generate 2 set, successively from each set choose first website artificial judgment be fraud webpage or Non- fraud webpage thinks the set category if fraud webpage then thinks that the set belongs to fraud webpage if being non-fraud webpage In non-fraud webpage, embodiment result such as following table:(the judgement that we provide according to data set for each website in each set It carries out verifying its corresponding discrimination)
As λ=0.8, cluster generates 1 set, and embodiment 2 completes embodiment 2 as mark.
Embodiment 3
Step 1:After user has browsed webpage, according to the evaluation to webpage, four kinds pre-set from webpage mark The selection of oneself is provided in (F, S, B, U), such as:What 362F U were indicated is that there are two the labels of user point for website that id is 362 It Wei not F and U.
Step 2:In order to meet the requirement of embodiment, we use data set webspam-uk2007 (" WebSpam Collections ", http://chato.cl/webspam/datasets/, Crawled by the Laboratory of Web Algo rithmics, University of Milan, http://law.di.unimi.it/) verify the reality of cluster The discrimination tested.
Step 3:200 datas that number of users is 2 are chosen from data set, generate the matrix M of 200*2.
Step 4:Fuzzy similarity matrix is calculated according to formula to the matrix and obtains the matrix R of 200*200.
Calculation formula includes:
Wherein, i, j=1,2 ..., n.N is the line number of N;
Wherein, i, j=1,2 ..., n.N is the line number of N, and m is the columns of N;
Step 5:To matrix R caused by step 4, fuzzy equivalent matrix is calculated using formula, result of calculation is m=8, That is R8·R8=R8, at this moment R is still the matrix of 200*200.
Formula is as follows:
N is self-heating number;P is the line number of R;
Until meeting Rb*Rb!=RbCondition, matrix reach convergence;
Step 6:The sequential organization of element included in matrix from big to small is as follows:It is denoted as λ:1>0.9>0.8.According to Secondary to take λ=1,0.9,0.8 calculates separately its cut set matrix, and as λ=1, all 1 values of being less than all are substituted for 0 in matrix, generate First Level Matrix;As λ=0.9, all 0.9 values of being more than or equal to all are substituted for 1 in matrix, all in matrix to be less than 0.9 Value be all substituted for 0, generate second Level Matrix;As λ=0.8, all 0.8 values of being more than or equal to all are substituted for 1 in matrix, Generate third Level Matrix.
Step 7:
As λ=1,
Cluster generate 9 set, successively from each set choose first website artificial judgment be fraud webpage or Non- fraud webpage thinks the set category if fraud webpage then thinks that the set belongs to fraud webpage if being non-fraud webpage In non-fraud webpage, embodiment result such as following table:(the judgement that we provide according to data set for each website in each set It carries out verifying its corresponding discrimination)
As λ=0.9,
Cluster generate 3 set, successively from each set choose first website artificial judgment be fraud webpage or Non- fraud webpage thinks the set category if fraud webpage then thinks that the set belongs to fraud webpage if being non-fraud webpage In non-fraud webpage, embodiment result such as following table:(the judgement that we provide according to data set for each website in each set It carries out verifying its corresponding discrimination)
As λ=0.8, cluster generates 1 set, and embodiment 3 completes embodiment 3 as mark.

Claims (1)

1. a kind of knowing method for distinguishing using fuzzy theory to fraud webpage, include the following steps:
Step 1:
User has browsed webpage, and carrying out evaluation to webpage makes user's mark:Respectively " non-fraud webpage F ", " fraud webpage S ", " equivocal B " or " not knowing U ";
Step 2:
Each the end of month is downloaded the data set of of that month whole user's marks by search engine;
Step 3:
Several matrix M is divided by the quantity that each webpage different user marks to data seti, wherein i=1,2 ..., n;
Step 4:
To each matrix Mi:It is denoted as N, changes into each element R of fuzzy similarity matrix R, Rij, wherein i, j=1,2 ..., n, n ∈ R, calculation formula include:
Wherein, i, j=1,2 ..., n;N is the line number of N;
Wherein, i, j=1,2 ..., n;N is the line number of N, and m is the columns of N;
Step 5:
Fuzzy similarity matrix changes into fuzzy equivalent matrix, and formula is as follows:
B=1,2 ..., n;N is natural number;P is the line number of R;
Until meeting Rb*Rb!=RbCondition, matrix reach convergence;
Step 6:
Convergent matrix is chosen into all confidence values [0,1], calculates Level Matrix;
Step 7:
For each Level Matrix, cluster generates multiple set, selects first website artificial judgment from each set successively Be fraud webpage be also non-fraud webpage, if fraud webpage then think that the set belongs to fraud webpage;If being non-fraud webpage Then think that the set belongs to non-fraud webpage.
CN201611046454.8A 2016-11-23 2016-11-23 Method for distinguishing is known to fraud webpage using fuzzy theory Expired - Fee Related CN106355095B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611046454.8A CN106355095B (en) 2016-11-23 2016-11-23 Method for distinguishing is known to fraud webpage using fuzzy theory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611046454.8A CN106355095B (en) 2016-11-23 2016-11-23 Method for distinguishing is known to fraud webpage using fuzzy theory

Publications (2)

Publication Number Publication Date
CN106355095A CN106355095A (en) 2017-01-25
CN106355095B true CN106355095B (en) 2018-10-19

Family

ID=57862809

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611046454.8A Expired - Fee Related CN106355095B (en) 2016-11-23 2016-11-23 Method for distinguishing is known to fraud webpage using fuzzy theory

Country Status (1)

Country Link
CN (1) CN106355095B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107194281B (en) * 2017-05-25 2019-07-16 成都知道创宇信息技术有限公司 A kind of anti-fake system based on block chain technology
CN108985815A (en) * 2018-06-06 2018-12-11 阿里巴巴集团控股有限公司 A kind of user identification method, device and equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102592067A (en) * 2011-01-17 2012-07-18 腾讯科技(深圳)有限公司 Webpage recognition method, device and system
CN103634306A (en) * 2013-11-18 2014-03-12 北京奇虎科技有限公司 Security detection method and security detection server for network data
CN104486461A (en) * 2014-12-29 2015-04-01 北京奇虎科技有限公司 Domain name classification method and device and domain name recognition method and system
CN103425736B (en) * 2013-06-24 2016-02-17 腾讯科技(深圳)有限公司 A kind of web information recognition, Apparatus and system
CN105827611A (en) * 2016-04-06 2016-08-03 清华大学 Distributed rejection service network attack detection method and system based on fuzzy inference
CN106021487A (en) * 2016-05-19 2016-10-12 浙江工业大学 Internet of Things semantic event detection method based on fuzzy theory

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102592067A (en) * 2011-01-17 2012-07-18 腾讯科技(深圳)有限公司 Webpage recognition method, device and system
CN103425736B (en) * 2013-06-24 2016-02-17 腾讯科技(深圳)有限公司 A kind of web information recognition, Apparatus and system
CN103634306A (en) * 2013-11-18 2014-03-12 北京奇虎科技有限公司 Security detection method and security detection server for network data
CN104486461A (en) * 2014-12-29 2015-04-01 北京奇虎科技有限公司 Domain name classification method and device and domain name recognition method and system
CN105827611A (en) * 2016-04-06 2016-08-03 清华大学 Distributed rejection service network attack detection method and system based on fuzzy inference
CN106021487A (en) * 2016-05-19 2016-10-12 浙江工业大学 Internet of Things semantic event detection method based on fuzzy theory

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种基于模糊等价矩阵传递闭包的聚类算法;赵磊;《电脑知识与技术》;20100930;全文 *
直觉模糊等价矩阵构造方法;雷英杰等;《系统工程理论与实践》;20070731;全文 *

Also Published As

Publication number Publication date
CN106355095A (en) 2017-01-25

Similar Documents

Publication Publication Date Title
CN104346370B (en) Picture search, the method and device for obtaining image text information
CN106294883B (en) Based on user behavior data to the method and system analyzed on user behavior figure
CN103365839B (en) The recommendation searching method and device of a kind of search engine
CN103020164B (en) Semantic search method based on multi-semantic analysis and personalized sequencing
US8682882B2 (en) System and method for automatically identifying classified websites
US20140195348A1 (en) Method and apparatus for composing search phrases, distributing ads and searching product information
CA2612895A1 (en) Systems and methods for providing search results
CN103389974B (en) Carry out the method and server of information search
CN107315841A (en) A kind of information search method, apparatus and system
CN106407349A (en) Product recommendation method and device
US8489604B1 (en) Automated resource selection process evaluation
EP2649542A2 (en) Ranking product information
CN104636407B (en) Parameter value training and searching request treating method and apparatus
CN101957845B (en) On-line application system and implementation method thereof
CN106776609A (en) Reprint the statistical method and device of quantity in website
CN104881472A (en) Combined recommendation method of traveling scenic spots based on network data collection
CN106777295A (en) Method and system is recommended in a kind of position search based on semantic matches
CN106355095B (en) Method for distinguishing is known to fraud webpage using fuzzy theory
CN107220358A (en) The recommendation method and device of point of interest
CN107203558A (en) Object recommendation method and apparatus, recommendation information treating method and apparatus
CN105630937A (en) Method and device for searching answers to exam questions
CN101308507B (en) Internet information issue and search method
CN104615621B (en) Correlation treatment method and system in search
CN103617221B (en) Software recommendation method and software recommendation system
CN103942698A (en) Product information comparing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20181019

Termination date: 20201123

CF01 Termination of patent right due to non-payment of annual fee