CN105119909B - A kind of counterfeit website detection method and system based on page visual similarity - Google Patents

A kind of counterfeit website detection method and system based on page visual similarity Download PDF

Info

Publication number
CN105119909B
CN105119909B CN201510434950.XA CN201510434950A CN105119909B CN 105119909 B CN105119909 B CN 105119909B CN 201510434950 A CN201510434950 A CN 201510434950A CN 105119909 B CN105119909 B CN 105119909B
Authority
CN
China
Prior art keywords
similarity
station address
detected
list
website
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201510434950.XA
Other languages
Chinese (zh)
Other versions
CN105119909A (en
Inventor
高胜
胡俊
何世平
徐原
赵慧
徐晓燕
刘婧
陈阳
李世淙
党向磊
饶毓
赵宸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN201510434950.XA priority Critical patent/CN105119909B/en
Publication of CN105119909A publication Critical patent/CN105119909A/en
Application granted granted Critical
Publication of CN105119909B publication Critical patent/CN105119909B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0004Industrial image inspection
    • G06T7/001Industrial image inspection using an image reference approach

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention relates to a kind of counterfeit website detection method and system based on page visual similarity, including, obtain station address list to be detected;The station address in the list is matched with default white list one by one, allow user to access the station address if matching and adds white list label to it;Otherwise current site address and default blacklist are matched, if matching forbids user to access the station address and adds blacklist label to it;If current site address carries out measuring similarity according to the corresponding web page contents of station address list to be detected and default white list, obtains maximum similarity value, and compared with preset threshold T not in default white and black list;Judge then to detect, otherwise terminate again if it exists with the presence or absence of the station address for being not added with label in station address list to be detected.Counterfeit website detection is completed using this method and reduces the probability that user accesses counterfeit website to a certain extent, is reduced and is strayed into the loss of fake site bring.

Description

A kind of counterfeit website detection method and system based on page visual similarity
Technical field
The present invention relates to a kind of detection method and systems, and in particular to a kind of based on the counterfeit of Webpage visual similarity Website detection method and system.
Background technique
Counterfeit website is a kind of attack pattern for carrying out online fraudulent activity by phisher, is mainly used to steal use The privacy information at family, such as mailbox account number cipher, credit card account password and e-commerce website account number cipher.Counterfeit website Deception form specifically includes that
1. mail link diffusion;
2. webpage forum is puted up in replying;
3. spreading counterfeit site information using social groups in social networks;
4. user is guided to enter illegal counterfeit website in instant messaging (IM) tool using online transaction or e-commerce.
These websites are usually to carry out subtle distort or its webpage is visually and by counterfeit net to by counterfeit website domain name Page has high emulation, and then gains users to trust by cheating, obtains user information, damages user benefit.
Currently, industry usually has following common recognition:
Counterfeit website: the content of pages of the title or webpage that refer to station address is visually very similar to regular business Website, and it is intended to damage user's economy or the website of other interests.
White list: referring to regular website to be protected and is the list of websites of regular website through certifying authority, it is general come Saying needs station address to be protected, is the website that network trading or e-commerce high frequency occur, for example, Taobao, Ebay, The e-commerce websites such as Jingdone district, the bank transaction systems such as industrial and commercial bank, Bank of China, " the good sound of China ", " Chinese most forte " Equal public recreations webpage, these are all the targets of counterfeiter's attack.
Blacklist: refer to the counterfeit list of websites verified by associated mechanisms, the website in this kind of list is all to pass through user Complaint, web price competition, artificial screening or other modes obtain, and by the counterfeit website of regulatory authorities confirmation.
Existing counterfeit website detection identification technology is mostly based on blacklist and white list mechanism.With giving website to be detected Location, by inquiry white list or blacklist come judge the station address to be detected whether list column, it is regular to identify Website or counterfeit website.But existing black and white lists detection identification technology can only identify existing in blacklist imitate Website is emitted, if certain counterfeit website can not identify not in blacklist.However, the variability of network is very big, criminal can Continue to swindle in a manner of by applying for new station address again, and existing identification technology needs are receiving report or thing The database that blacklist could be updated after hair can not accomplish to detect identification in advance, risk warning.
Summary of the invention
In view of the deficiencies of the prior art, the present invention provides a kind of counterfeit website detection method based on page visual similarity And the loss of the network user is effectively reduced in system, fishing website when this method can detecte zero.
The purpose of the present invention is adopt the following technical solutions realization:
A kind of counterfeit website detection method based on page visual similarity, which comprises
(1) website to be detected is searched for, station address list to be detected is obtained;
(2) station address in station address list to be detected is matched one by one with default white list, is judged whether Match, is marked if it does, user is allowed to access the station address and add white list to the station address, go to step (6); Otherwise, continue step (3);
(3) current site address and default blacklist are matched, judges whether to match, if it does, user is forbidden to visit Ask the station address and to station address addition blacklist label;
(4) if current site address is not being preset in blacklist in default white list neither, according to survey grid to be checked again The corresponding web page contents of station address list web page contents corresponding with default white list carry out measuring similarity, obtain maximum similar Angle value;
(5) the maximum similarity value and preset threshold T;
(6) judge in station address list to be detected with the presence or absence of the station address for being not added with label, and if it exists, execute step Suddenly (2);Otherwise terminate.
Preferably, the station address to be detected in the step (1), including immediate communication tool IM, mail server, opinion Altar and virtual community.
Preferably, step (4) measuring similarity uses land mobile distance algorithm EMD to carry out measuring similarity, It specifically includes:
The corresponding Web page image in all websites in default white list is acquired one by one;
Uniform resource position mark URL information is extracted from the station address list to be detected, and is obtained by network downloading Obtain the corresponding Web page image in the website to be detected;Wherein, the Web page image include general image and browser window can Vision area area image;
Successively by the corresponding general image in websites all in default white list, general image corresponding with website to be detected into Row matching, obtains the general image similarity sequence S being made of multiple similarity values1;And by similarity set S1In it is similar Angle value chooses maximum value after sorting from large to small;
Successively by the visible area image of the corresponding browser window in websites all in default white list, with website to be detected The visible area image of corresponding browser window is matched, and the visible area image phase being made of multiple similarity values is obtained Like degree series S2, and by similarity sequence S2In similarity value sort from large to small after choose maximum value.
Further, the acquisition similarity value includes extracting all websites and survey grid to be checked in default white list respectively The rgb space stood in corresponding Web page image;Record R, G of the rgb space, the frequency that B component occurs.
Preferably, relatively maximum similarity value and preset threshold T include in the step (5), if the similarity sequence S1And S2In any maximum similarity value be less than preset threshold T, then determine that current site address is non-counterfeit website and adds non-imitative Emit label;Otherwise, then determine that current site address is counterfeit website and adds counterfeit label.
A kind of counterfeit website detection system based on page visual similarity, comprising: search module, it is to be detected for searching for Website obtains station address list to be detected;
White list processing module, for one by one by station address list to be detected station address and default white list phase Matching, judges whether to match, if it does, allowing user to access the station address and adding white list mark to the station address Note;
Blacklist processing module matches current site address and default blacklist, judges whether to match, if Match, user is forbidden to access the station address and to station address addition blacklist label;
Detection module, if do not preset in blacklist in default white list neither for current site address, root again Measuring similarity is carried out according to the corresponding web page contents of station address list to be detected web page contents corresponding with default white list, is obtained Take maximum similarity value;
Evaluation module is used for the maximum similarity value and preset threshold T.
Compared with the prior art, the beneficial effect that the present invention reaches is:
Counterfeit website detection method and system disclosed by the invention are realized and are examined before user is by individual interest infringement Counterfeit website is measured, user's loss is reduced.
Targeted website Web page picture is matched with website and webpage picture to be detected using based on EMD distance algorithm, energy Enough from the counterfeit website that the similar essential aspect of vision detects.User is reduced to a certain extent accesses the general of counterfeit website Rate;To effectively avoid user from being strayed into puzzlement and economic loss in work brought by fishing website.
Detailed description of the invention
Fig. 1 is the counterfeit website detection method flow chart provided by the invention based on page visual similarity.
Specific embodiment
Specific embodiments of the present invention will be described in further detail with reference to the accompanying drawing.
As shown in Figure 1, a kind of counterfeit website detection method based on page visual similarity, comprising:
(1) website to be detected is searched for, station address list to be detected is obtained;Station address to be detected, including instant messaging Tool IM, mail server, forum and virtual community.
(2) station address in station address list to be detected is matched one by one with default white list, is judged whether Match, is marked if it does, user is allowed to access the station address and add white list to the station address, go to step (6); Otherwise, continue step (3);The structural form of white list is mainly application service, i.e., application is banking system When, official of Everbright Bank network address http://www.cebbank.com can be added in " white list " in this patent, China builds If official of bank network address " http://www.ccb.com " etc.;
When application is entertainment sites, then can be added in white list in this patent " father go where " official Website " http://www.hunantv.com/v/2013/bbqne ".
(3) current site address and default blacklist are matched, judges whether to match, if it does, user is forbidden to visit Ask the station address and to station address addition blacklist label;
The structural form of blacklist mainly according in white list corresponding to guarding website it is multiple known to counterfeit website, lead to These normal counterfeit websites are to be reported by the network user and generated after regulatory authorities confirm.
(4) if current site address is not being preset in blacklist in default white list neither, according to survey grid to be checked again The corresponding web page contents of station address list web page contents corresponding with default white list carry out measuring similarity, obtain maximum similar Angle value;Step (4) measuring similarity uses land mobile distance algorithm EMD to carry out measuring similarity, specifically includes:
The corresponding Web page image in all websites in default white list is acquired one by one;
Uniform resource position mark URL information is extracted from the station address list to be detected, and is obtained by network downloading Obtain the corresponding Web page image in the website to be detected;Wherein, the Web page image include general image and browser window can Vision area area image.Successively by the corresponding general image in websites all in default white list, overall diagram corresponding with website to be detected As being matched, the general image similarity sequence S1 being made of multiple similarity values is obtained;And it will be in similarity set S1 Similarity value sort from large to small after choose maximum value;
Successively by the visible area image of the corresponding browser window in websites all in default white list, with website to be detected The visible area image of corresponding browser window is matched, and the visible area image phase being made of multiple similarity values is obtained Maximum value is chosen like degree series S2, and after the similarity value in similarity sequence S2 is sorted from large to small.
The acquisition similarity value includes that all websites and website to be detected are respectively corresponding in the default white list of extraction respectively Web page image in rgb space;Record R, G of the rgb space, the frequency that B component occurs.
(5) the maximum similarity value and preset threshold T;Compare maximum similarity value and default threshold in step (5) Value T includes, if any maximum similarity value is less than preset threshold T in the similarity sequence S1 and S2, determining current site Address is non-counterfeit website and adds non-counterfeit label;Otherwise, then determine that current site address is counterfeit website and adds counterfeit Label.
(6) judge in station address list to be detected with the presence or absence of the station address for being not added with label, and if it exists, execute step Suddenly (2);Otherwise terminate.
A kind of counterfeit website detection system based on page visual similarity, comprising: search module, it is to be detected for searching for Website obtains station address list to be detected;
White list processing module, for one by one by station address list to be detected station address and default white list phase Matching, judges whether to match, if it does, allowing user to access the station address and adding white list mark to the station address Note;
Blacklist processing module matches current site address and default blacklist, judges whether to match, if Match, user is forbidden to access the station address and to station address addition blacklist label;
Detection module, if do not preset in blacklist in default white list neither for current site address, root again Measuring similarity is carried out according to the corresponding web page contents of station address list to be detected web page contents corresponding with default white list, is obtained Take maximum similarity value;
Evaluation module is used for the maximum similarity value and preset threshold T.
Finally it should be noted that: the above embodiments are merely illustrative of the technical scheme of the present invention and are not intended to be limiting thereof, to the greatest extent Invention is explained in detail referring to above-described embodiment for pipe, it should be understood by those ordinary skilled in the art that: still It can be with modifications or equivalent substitutions are made to specific embodiments of the invention, and without departing from any of spirit and scope of the invention Modification or equivalent replacement, are intended to be within the scope of the claims of the invention.

Claims (4)

1. a kind of counterfeit website detection method based on page visual similarity, which is characterized in that the described method includes:
(1) website to be detected is searched for, station address list to be detected is obtained;
(2) station address in station address list to be detected is matched one by one with default white list, judges whether to match, such as Fruit matching allows user to access the station address and add white list to the station address and marks, and go to step (6);Otherwise, Continue step (3);
(3) current site address and default blacklist are matched, judges whether to match, if it does, forbidding user's access should Station address simultaneously marks station address addition blacklist;
(4) if current site address is not being preset in blacklist in default white list neither, according to website to be detected again The corresponding web page contents of location list web page contents corresponding with default white list carry out measuring similarity, obtain maximum similarity Value;
Step (4) measuring similarity uses land mobile distance algorithm EMD to carry out measuring similarity, specifically:
The corresponding Web page image in all websites in default white list is acquired one by one;
Uniform resource position mark URL information is extracted from the station address list to be detected, and is downloaded by network and obtained institute State the corresponding Web page image in website to be detected;Wherein, the Web page image includes the visible area of general image and browser window Area image;
Successively by the general image progress corresponding with website to be detected of the corresponding general image in websites all in default white list Match, obtains the general image similarity sequence S being made of multiple similarity values1;And by similarity sequence S1In similarity value Maximum value is chosen after sorting from large to small;
It is corresponding with website to be detected successively by the visible area image of the corresponding browser window in websites all in default white list The visible area image of browser window matched, obtain the visible area image similarity being made of multiple similarity values Sequence S2, and by similarity sequence S2In similarity value sort from large to small after choose maximum value;
(5) the maximum similarity value and preset threshold T;If the similarity sequence S1And S2In any maximum similarity Value is less than preset threshold T, then determines that current site address is non-counterfeit website and adds non-counterfeit label;Otherwise, then determine to work as Preceding station address is counterfeit website and adds counterfeit label;
(6) judge in station address list to be detected with the presence or absence of the station address for being not added with label, and if it exists, execute step (2);Otherwise terminate.
2. the method as described in claim 1, which is characterized in that the station address to be detected in the step (1), including it is instant Communication tool IM, mail server, forum and virtual community.
3. the method as described in claim 1, which is characterized in that the acquisition similarity value includes: to extract to preset white name respectively Rgb space in list in all websites and the corresponding Web page image in website to be detected;Record R, G, B of the rgb space The frequency that component occurs.
4. a kind of counterfeit website detection system based on page visual similarity characterized by comprising
Search module obtains station address list to be detected for searching for website to be detected;
White list processing module, for one by one by station address list to be detected station address and default white list phase Match, judge whether to match, is marked if it does, user is allowed to access the station address and add white list to the station address;
Blacklist processing module matches current site address and default blacklist, judges whether to match, if it does, prohibiting Only user accesses the station address and to station address addition blacklist label;
Detection module, if for current site address neither in default white list again not in default blacklist, according to It detects the corresponding web page contents of station address list web page contents corresponding with default white list and carries out measuring similarity, obtain most Big similarity value;
The measuring similarity uses land mobile distance algorithm EMD to carry out measuring similarity, specifically includes: acquisition is pre- one by one If the corresponding Web page image in all websites in white list;
Uniform resource position mark URL information is extracted from the station address list to be detected, and is downloaded by network and obtained institute State the corresponding Web page image in website to be detected;Wherein, the Web page image includes the visible area of general image and browser window Area image;
Successively by the corresponding general image in websites all in default white list, general image progress corresponding with website to be detected Match, obtains the general image similarity sequence S1 being made of multiple similarity values;And by the similarity in similarity set S1 Value chooses maximum value after sorting from large to small;
It is corresponding with website to be detected successively by the visible area image of the corresponding browser window in websites all in default white list The visible area image of browser window matched, obtain the visible area image similarity being made of multiple similarity values Sequence S2, and maximum value is chosen after the similarity value in similarity sequence S2 is sorted from large to small;
Evaluation module is used for the maximum similarity value and preset threshold T;If the similarity sequence S1And S2In it is any Maximum similarity value is less than preset threshold T, then determines that current site address is non-counterfeit website and adds non-counterfeit label;It is no Then, then determine that current site address is counterfeit website and adds counterfeit label.
CN201510434950.XA 2015-07-22 2015-07-22 A kind of counterfeit website detection method and system based on page visual similarity Expired - Fee Related CN105119909B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510434950.XA CN105119909B (en) 2015-07-22 2015-07-22 A kind of counterfeit website detection method and system based on page visual similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510434950.XA CN105119909B (en) 2015-07-22 2015-07-22 A kind of counterfeit website detection method and system based on page visual similarity

Publications (2)

Publication Number Publication Date
CN105119909A CN105119909A (en) 2015-12-02
CN105119909B true CN105119909B (en) 2019-02-19

Family

ID=54667797

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510434950.XA Expired - Fee Related CN105119909B (en) 2015-07-22 2015-07-22 A kind of counterfeit website detection method and system based on page visual similarity

Country Status (1)

Country Link
CN (1) CN105119909B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105763543B (en) * 2016-02-03 2019-08-30 百度在线网络技术(北京)有限公司 A kind of method and device identifying fishing website
CN107786529B (en) * 2016-08-31 2020-12-01 阿里巴巴集团控股有限公司 Website detection method, device and system
CN107968769A (en) * 2016-10-19 2018-04-27 中兴通讯股份有限公司 Webpage security detection method and device
CN106603490A (en) * 2016-11-10 2017-04-26 上海斐讯数据通信技术有限公司 Phishing website detecting method and system
CN110020075A (en) * 2017-10-20 2019-07-16 南京烽火软件科技有限公司 Device is excavated in illegal website automatically
CN107992547A (en) * 2017-11-27 2018-05-04 深信服科技股份有限公司 Apply dispositions method and device in a kind of website
CN110535806B (en) * 2018-05-24 2022-04-01 中国移动通信集团重庆有限公司 Method, device and equipment for monitoring abnormal website and computer storage medium
CN110855716B (en) * 2019-11-29 2020-11-06 北京邮电大学 Self-adaptive security threat analysis method and system for counterfeit domain names
CN113132340B (en) * 2020-01-16 2022-06-28 中国科学院信息工程研究所 Phishing website identification method based on vision and host characteristics and electronic device
CN111698256B (en) * 2020-06-17 2022-05-10 绿盟科技集团股份有限公司 Method and device for detecting illegal link
CN112348104B (en) * 2020-11-17 2023-08-18 百度在线网络技术(北京)有限公司 Identification method, device, equipment and storage medium for counterfeit program
CN114124564B (en) * 2021-12-03 2023-11-28 北京天融信网络安全技术有限公司 Method and device for detecting counterfeit website, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001045085A1 (en) * 1999-12-17 2001-06-21 Sony Corporation Method and apparatus for information processing, and medium for storing program
CN101145902A (en) * 2007-08-17 2008-03-19 东南大学 Fishing webpage detection method based on image processing
CN101820366A (en) * 2010-01-27 2010-09-01 南京邮电大学 Pre-fetching-based phishing web page detection method
CN103442014A (en) * 2013-09-03 2013-12-11 中国科学院信息工程研究所 Method and system for automatic detection of suspected counterfeit websites

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001045085A1 (en) * 1999-12-17 2001-06-21 Sony Corporation Method and apparatus for information processing, and medium for storing program
CN101145902A (en) * 2007-08-17 2008-03-19 东南大学 Fishing webpage detection method based on image processing
CN101820366A (en) * 2010-01-27 2010-09-01 南京邮电大学 Pre-fetching-based phishing web page detection method
CN103442014A (en) * 2013-09-03 2013-12-11 中国科学院信息工程研究所 Method and system for automatic detection of suspected counterfeit websites

Also Published As

Publication number Publication date
CN105119909A (en) 2015-12-02

Similar Documents

Publication Publication Date Title
CN105119909B (en) A kind of counterfeit website detection method and system based on page visual similarity
US11042630B2 (en) Dynamic page similarity measurement
US9123027B2 (en) Social engineering protection appliance
Romanov et al. Detection of fake profiles in social media-Literature review
CN103530367B (en) A kind of fishing website identification system and method
CN105718577B (en) Method and system for automatically detecting phishing aiming at newly added domain name
CN103442014A (en) Method and system for automatic detection of suspected counterfeit websites
CN102638448A (en) Method for judging phishing websites based on non-content analysis
CN109522504A (en) A method of counterfeit website is differentiated based on threat information
CN102647408A (en) Method for judging phishing website based on content analysis
CN105138921B (en) Fishing website aiming field name recognition method based on page feature matching
CN104954372A (en) Method and system for performing evidence acquisition and verification on phishing website
CN108418777A (en) A kind of fishing mail detection method, apparatus and system
CN108092963A (en) Web page identification method, device, computer equipment and storage medium
CN102622553A (en) Method and device for detecting webpage safety
CN103379111A (en) Intelligent anti-phishing defensive system
CN110784462B (en) Three-layer phishing website detection system based on hybrid method
CN107786564A (en) Based on attack detection method, system and the electronic equipment for threatening information
CN110572359A (en) Phishing webpage detection method based on machine learning
Li et al. Detection method of phishing email based on persuasion principle
CN110443031A (en) A kind of two dimensional code Risk Identification Method and system
Geng et al. Favicon-a clue to phishing sites detection
CN110474889A (en) One kind being based on the recognition methods of web graph target fishing website and device
Tang et al. Clues in tweets: Twitter-guided discovery and analysis of SMS spam
CN103745156B (en) Method and device for prompting risk information in search engine

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190219

CF01 Termination of patent right due to non-payment of annual fee