WO2021258838A1 - Phishing website detection method and apparatus, and device and computer readable storage medium - Google Patents

Phishing website detection method and apparatus, and device and computer readable storage medium Download PDF

Info

Publication number
WO2021258838A1
WO2021258838A1 PCT/CN2021/088986 CN2021088986W WO2021258838A1 WO 2021258838 A1 WO2021258838 A1 WO 2021258838A1 CN 2021088986 W CN2021088986 W CN 2021088986W WO 2021258838 A1 WO2021258838 A1 WO 2021258838A1
Authority
WO
WIPO (PCT)
Prior art keywords
website
information
tested
confidence value
phishing
Prior art date
Application number
PCT/CN2021/088986
Other languages
French (fr)
Chinese (zh)
Inventor
梁杰
范渊
Original Assignee
杭州安恒信息技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州安恒信息技术股份有限公司 filed Critical 杭州安恒信息技术股份有限公司
Publication of WO2021258838A1 publication Critical patent/WO2021258838A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]

Abstract

A phishing website detection method, a phishing website detection apparatus, a computer device, and a computer readable storage medium. The phishing website detection method comprises: obtaining multiple pieces of feature information of a website to be detected; obtaining a confidence value and a weight value corresponding to each piece of feature information; determining a weighted confidence value of said website according to the confidence values and the weight values respectively corresponding to the multiple pieces of feature information; and in the case that the weighted confidence value is greater than a preset threshold, determining that said website is a phishing website.

Description

钓鱼网站的检测方法、装置、设备、计算机可读存储介质Phishing website detection method, device, equipment, computer readable storage medium
相关申请Related application
本申请要求2020年6月22日申请的,申请号为202010572801.0,发明名称为“钓鱼网站的检测方法、装置、设备、计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on June 22, 2020, the application number is 202010572801.0, and the invention title is "Phishing website detection methods, devices, equipment, computer-readable storage media", the entire content of which is incorporated by reference Incorporated in this application.
技术领域Technical field
本申请涉及互联网安全技术领域,特别是涉及钓鱼网站的检测方法、钓鱼网站的检测装置、计算机设备和计算机可读存储介质。This application relates to the field of Internet security technology, in particular to the detection method of phishing websites, the detection device of phishing websites, computer equipment and computer-readable storage media.
背景技术Background technique
相关术语解释如下:Related terms are explained as follows:
URL(Uniform Resource Locator,统一资源定位符):网络地址。URL (Uniform Resource Locator): Network address.
威胁情报:根据Gartner(高德纳咨询公司)对威胁情报的定义,威胁情报是某种基于证据的知识,这些知识与资产所面临已有的或酝酿中的威胁或危害相关,这些知识包括情境、机制、指标、推论与可行建议,这些知识可为威胁响应提供决策依据。从安全从业者的角度而言,威胁情报是指一些入侵威胁指标,可用于判定待测目标是否对系统构成安全威胁。Threat intelligence: According to Gartner's definition of threat intelligence, threat intelligence is evidence-based knowledge that is related to existing or brewing threats or hazards faced by assets, and this knowledge includes context , Mechanisms, indicators, inferences, and feasible recommendations. These knowledge can provide a basis for decision-making in threat response. From the perspective of security practitioners, threat intelligence refers to some intrusion threat indicators, which can be used to determine whether the target to be tested poses a security threat to the system.
IOC(Indicators of Compromise,入侵威胁指标):指威胁情报中具体的指标项。IOC (Indicators of Compromise, intrusion threat indicators): Refers to specific indicators in threat intelligence.
置信度:指某一参数的真实值有一定概率落在测量结果的周围的程度。Confidence: refers to the degree to which the true value of a parameter has a certain probability to fall around the measurement result.
钓鱼网站:指欺骗用户的虚假网站,其界面与真实网站的界面基本一致,欺骗消费者或者窃取访问者提交的账号和密码信息。Phishing website: Refers to a fake website that deceives users. Its interface is basically the same as that of a real website. It deceives consumers or steals account and password information submitted by visitors.
当前主流的钓鱼网站识别方案可分为以下两种:The current mainstream phishing website identification schemes can be divided into the following two types:
一种是基于黑名单的检测方法,该技术通过钓鱼链接披露网站,例如PhishTank网站,其收录了大量钓鱼网站,并且更新频率高,可以通过直接对比当前访问的链接来判断该链接是否为钓鱼链接。这种识别方式较为直观,然而,最大问题在于漏报率高,目前国内披露钓鱼链接的网站较少,某些行业的钓鱼链接披露网站更是寥寥无几,比如金融行业,一方面由于样本量少,另一方面是事后披露,在全网范围内能拦截到的钓鱼网站只占很小一部分。One is a blacklist-based detection method, which uses phishing links to disclose websites, such as the PhishTank website, which contains a large number of phishing websites and is updated frequently. You can directly compare the currently visited links to determine whether the link is a phishing link. . This method of identification is relatively intuitive. However, the biggest problem lies in the high rate of underreporting. At present, there are few websites that disclose phishing links in China, and there are very few phishing link disclosure websites in certain industries, such as the financial industry. On the one hand, due to the small sample size , On the other hand, it was disclosed afterwards that only a small part of the phishing websites that can be intercepted across the entire network.
另一种是基于链接特征离线判断检测方法,其根据链接长度、链接路径与主机相关度、是否有加密协议构建特征模型,根据特征模型进行判断。这种方式通用范围广,然而,无法实时结合现有的威胁情报,通过离线判断的方式会产生大量的误拦截,需要增加人力投入进行审核确认,导致成本投入大且存在滞后性。The other is an offline judgment and detection method based on link characteristics, which constructs a characteristic model according to the link length, the correlation between the link path and the host, and whether there is an encryption protocol, and judges according to the characteristic model. This method is universal in scope. However, it cannot be combined with existing threat intelligence in real time. A large number of false intercepts will be generated through offline judgment, which requires increased manpower input for review and confirmation, resulting in high cost and lag.
目前针对相关技术中存在的钓鱼网站的检测结果准确度低的问题,尚未提出有效的解决方案。Currently, there is no effective solution to the problem of low accuracy of detection results of phishing websites in related technologies.
发明内容Summary of the invention
本申请实施例提供了一种钓鱼网站的检测方法、钓鱼网站的检测装置、计算机设备和计算机可读存储介质,以至少解决相关技术中存在的钓鱼网站的检测结果准确度低的问题。The embodiments of the present application provide a method for detecting a phishing website, a device for detecting a phishing website, a computer device, and a computer-readable storage medium, so as to at least solve the problem of low accuracy of the detection result of the phishing website in related technologies.
第一方面,本申请实施例提供了一种钓鱼网站的检测方法,包括:获取待测网站的多个特征信息;获取每个特征信息对应的置信度值和权重值;按照所述多个特征信息分别对应的置信度值和权重值,确定所述待测网站的加权置信度值;在所述加权置信度值大于预设阈值的情况下,确定所述待测网站为钓鱼网站。In the first aspect, an embodiment of the present application provides a method for detecting a phishing website, including: obtaining multiple feature information of the website to be tested; obtaining the confidence value and weight value corresponding to each feature information; and according to the multiple features The confidence value and weight value corresponding to the information respectively determine the weighted confidence value of the website to be tested; when the weighted confidence value is greater than a preset threshold, the website to be tested is determined to be a phishing website.
在其中一些实施例中,所述多个特征信息包括第一特征信息;获取待测网站的多个特征信息包括:获取所述网站的URL;根据所述URL获取所述第一特征信息,其中,所述第一特征信息包括以下至少之一:IP、域名、可执行文件的文件Hash值、Whois信息。In some of the embodiments, the multiple feature information includes first feature information; obtaining multiple feature information of the website to be tested includes: obtaining the URL of the website; obtaining the first feature information according to the URL, where The first feature information includes at least one of the following: IP, domain name, file hash value of executable file, and Whois information.
在其中一些实施例中,所述多个特征信息还包括第二特征信息;获取待测网站的多个特征信息还包括以下至少之一:从情报库中获取与所述IP关联的域名信息,以及获取与所述域名关联的IP信息,所述第二特征信息包括与所述IP关联的域名信息,以及与所述域名关联的IP信息;通过Whois反查技术,获取与所述Whois信息关联的IP和域名信息,所述第二特征信息包括与所述Whois信息关联的IP和域名信息;通过可执行文件动态分析技术,获取与所述可执行文件回连的IP和域名信息,其中,所述第二特征信息包括与所述可执行文件回连的IP和域名信息。In some of the embodiments, the multiple feature information further includes second feature information; obtaining multiple feature information of the website to be tested further includes at least one of the following: obtaining domain name information associated with the IP from an information database, And obtaining the IP information associated with the domain name, the second characteristic information includes the domain name information associated with the IP, and the IP information associated with the domain name; through the Whois counter-check technology, obtain the information associated with the Whois The second characteristic information includes the IP and domain information associated with the Whois information; the dynamic analysis technology of the executable file is used to obtain the IP and domain name information connected back to the executable file, where: The second characteristic information includes IP and domain name information that is connected back to the executable file.
在其中一些实施例中,获取每个特征信息对应的置信度值和权重值包括:将所述待测网站的多个特征信息与所述情报库中的预设特征信息进行匹配;根据匹配结果得到与所述待测网站中每个特征信息对应的置信度值和权重值。In some of the embodiments, obtaining the confidence value and weight value corresponding to each feature information includes: matching multiple feature information of the website to be tested with preset feature information in the intelligence database; and according to the matching result Obtain the confidence value and the weight value corresponding to each feature information in the website to be tested.
在其中一些实施例中,获取每个特征信息对应的置信度值和权重值还包括:从所述多个特征信息中确定第三特征信息,其中,所述第三特征信息包括以下至少之一:从所述URL中获取的网络传输协议信息、端口信息;根据所述第三特征信息,获取对应于所述第三特征信息的置信度值和权重值。In some of the embodiments, obtaining the confidence value and the weight value corresponding to each feature information further includes: determining third feature information from the plurality of feature information, where the third feature information includes at least one of the following : The network transmission protocol information and port information obtained from the URL; according to the third characteristic information, the confidence value and the weight value corresponding to the third characteristic information are obtained.
在其中一些实施例中,在按照所述多个特征信息分别对应的置信度值和权重值,确定所述待测网站的加权置信度值之后,所述方法还包括:判断所述加权置信度值是否大于第一预设阈值;在判断到所述加权置信度值大于第一预设阈值的情况下,确定所述待测网站为钓鱼网站,并拒绝访问所述待测网站;判断所述加权置信度值是否大于第二预设阈值;在判断到所述加权置信度值大于第二预设阈值的情况下,确定所述待测网站为疑似钓鱼网站,并发出用于指示所述待测网站为疑似钓鱼网站的告警信息。In some of the embodiments, after determining the weighted confidence value of the website to be tested according to the confidence value and the weight value corresponding to the multiple feature information, the method further includes: determining the weighted confidence value Whether the value is greater than the first preset threshold; in the case where it is determined that the weighted confidence value is greater than the first preset threshold, determine that the website to be tested is a phishing website, and refuse to access the website to be tested; determine the Whether the weighted confidence value is greater than the second preset threshold; in the case where it is determined that the weighted confidence value is greater than the second preset threshold, determine that the website to be tested is a suspected phishing website, and issue an instruction to indicate the to-be-tested website The tested website is the warning information of a suspected phishing website.
在其中一些实施例中,在确定所述待测网站为钓鱼网站或者疑似钓鱼网站的情况下之后,所述方法还包括:将所述待测网站的URL、所述待测网站中多个特征信息收录至所述情报库中。In some of these embodiments, after it is determined that the website to be tested is a phishing website or a suspected phishing website, the method further includes: combining the URL of the website to be tested and multiple features in the website to be tested Information is included in the information database.
第二方面,本申请实施例提供了一种钓鱼网站的检测装置,包括:第一获取模块,用 于获取待测网站的多个特征信息;第二获取模块,用于获取每个特征信息对应的置信度值和权重值;第一确定模块,用于按照所述多个特征信息分别对应的置信度值和权重值,确定所述待测网站的加权置信度值;第二确定模块,用于在所述加权置信度值大于预设阈值的情况下,确定所述待测网站为钓鱼网站。In the second aspect, an embodiment of the present application provides a detection device for a phishing website, which includes: a first acquisition module for acquiring multiple feature information of the website to be tested; a second acquisition module for acquiring each feature information corresponding The first determination module is used to determine the weighted confidence value of the website to be tested according to the confidence value and the weight value corresponding to the multiple feature information; the second determination module uses In a case where the weighted confidence value is greater than a preset threshold, it is determined that the website to be tested is a phishing website.
第三方面,本申请实施例提供了一种计算机设备,包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上述第一方面所述的钓鱼网站的检测方法。In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and running on the processor. When the processor executes the computer program, The method for detecting phishing websites as described in the above first aspect is realized.
第四方面,本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如上述第一方面所述的钓鱼网站的检测方法。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for detecting a phishing website as described in the first aspect is implemented.
相比于相关技术,本申请实施例提供的钓鱼网站的检测方法、钓鱼网站的检测装置、计算机设备和计算机可读存储介质,通过获取待测网站的多个特征信息;获取每个特征信息对应的置信度值和权重值;按照多个特征信息分别对应的置信度值和权重值,确定待测网站的加权置信度值;在加权置信度值大于预设阈值的情况下,确定待测网站为钓鱼网站,解决了相关技术中存在的钓鱼网站的检测结果准确度低的问题,提升了钓鱼网站的检测结果的准确度。Compared with related technologies, the phishing website detection method, phishing website detection device, computer equipment, and computer readable storage medium provided in the embodiments of the present application obtain multiple feature information of the website to be tested; obtain each feature information corresponding Determine the weighted confidence value of the website to be tested according to the confidence value and weight value corresponding to multiple feature information; when the weighted confidence value is greater than the preset threshold, determine the website to be tested For the phishing website, the problem of low accuracy of the detection result of the phishing website in the related technology is solved, and the accuracy of the detection result of the phishing website is improved.
本申请的一个或多个实施例的细节在以下附图和描述中提出,以使本申请的其他特征、目的和优点更加简明易懂。The details of one or more embodiments of the present application are presented in the following drawings and descriptions, so as to make the other features, purposes and advantages of the present application more concise and understandable.
附图说明Description of the drawings
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:The drawings described here are used to provide a further understanding of the application and constitute a part of the application. The exemplary embodiments and descriptions of the application are used to explain the application, and do not constitute an improper limitation of the application. In the attached picture:
图1是根据本申请实施例的一种钓鱼网站的检测方法的流程图。Fig. 1 is a flowchart of a method for detecting a phishing website according to an embodiment of the present application.
图2是根据本申请实施例的待测网站中多个特征信息之间的关联关系示意图。Fig. 2 is a schematic diagram of an association relationship between multiple feature information in a website to be tested according to an embodiment of the present application.
图3是根据本申请可选实施例的钓鱼网站的检测方法的流程示意图。Fig. 3 is a schematic flowchart of a method for detecting a phishing website according to an optional embodiment of the present application.
图4是根据本申请实施例的钓鱼网站的检测装置的结构框图。Fig. 4 is a structural block diagram of a phishing website detection device according to an embodiment of the present application.
图5是根据本申请实施例的计算机设备的硬件结构示意图。Fig. 5 is a schematic diagram of the hardware structure of a computer device according to an embodiment of the present application.
具体实施方式detailed description
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行描述和说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。基于本申请提供的实施例,本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions, and advantages of this application clearer, the following describes and illustrates this application with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application. Based on the embodiments provided in this application, all other embodiments obtained by a person of ordinary skill in the art without creative work shall fall within the protection scope of this application.
在本申请中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域普通技术人员显式地和隐式地理解的是,本申请所描述的实施例在不冲突的情况下,可以与其它实施例相结合。The reference to "embodiments" in this application means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those of ordinary skill in the art clearly and implicitly understand that the embodiments described in this application can be combined with other embodiments without conflict.
除非另作定义,本申请所涉及的技术术语或者科学术语应当为本申请所属技术领域内具有一般技能的人士所理解的通常意义。本申请所涉及的“一”、“一个”、“一种”、“该”等类似词语并不表示数量限制,可表示单数或复数。本申请所涉及的术语“包括”、“包含”、“具有”以及它们任何变形,意图在于覆盖不排他的包含;例如包含了一系列步骤或模块(单元)的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可以还包括没有列出的步骤或单元,或可以还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。本申请所涉及的“连接”、“相连”、“耦接”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电气的连接,不管是直接的还是间接的。本申请所涉及的“多个”是指两个或两个以上。“和/或”描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。本申请所涉及的术语“第一”、“第二”、“第三”等仅仅是区别类似的对象,不代表针对对象的特定排序。Unless otherwise defined, the technical terms or scientific terms involved in this application shall have the usual meanings understood by those with general skills in the technical field to which this application belongs. The "a", "an", "one", "the" and other similar words referred to in this application do not indicate a quantitative limit, and may indicate a singular or plural number. The terms "include", "include", "have" and any of their variations in this application are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or product that includes a series of steps or modules (units) The equipment is not limited to the listed steps or units, but may further include unlisted steps or units, or may further include other steps or units inherent to these processes, methods, products, or equipment. The terms "connected", "connected", "coupled" and the like mentioned in this application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The "plurality" referred to in this application refers to two or more. "And/or" describes the association relationship of the associated objects, which means that there can be three kinds of relationships. For example, "A and/or B" can mean that: A alone exists, A and B exist at the same time, and B exists alone. The character "/" generally indicates that the associated objects before and after are in an "or" relationship. The terms "first", "second", "third", etc. involved in this application merely distinguish similar objects, and do not represent a specific order for the objects.
本实施例提供了一种钓鱼网站的检测方法。图1是根据本申请实施例的一种钓鱼网站的检测方法的流程图,如图1所示,该流程包括如下步骤:This embodiment provides a method for detecting a phishing website. Fig. 1 is a flowchart of a method for detecting a phishing website according to an embodiment of the present application. As shown in Fig. 1, the process includes the following steps:
步骤S101:获取待测网站的多个特征信息。特征信息包括IP、域名、文件Hash值。Step S101: Obtain multiple feature information of the website to be tested. The characteristic information includes IP, domain name, and file hash value.
其中,IP形如192.168.1.1,是用户客户端(例如浏览器)定位目标服务器的唯一寻址方式。互联网中服务器的IP是全球唯一的,其可分为IPV4和IPV6,本实施例对于这两种类型的IP都是适用的。Among them, the IP is like 192.168.1.1, which is the only addressing method for the user client (such as a browser) to locate the target server. The IP of the server in the Internet is unique in the world, and it can be divided into IPV4 and IPV6. This embodiment is applicable to both types of IP.
其中,域名形如www.baidu.com。通过DNS服务可以将域名解析为IP,再发起服务资源的请求。因此,域名与IP存在互相关联关系,在相同时间内,一个域名可以配置解析多个IP,产生一对多关系。在历史维度上,同一个IP的服务器可能被多个域名更改解析,产生多对多关系。Among them, the domain name is like www.baidu.com. The DNS service can resolve the domain name to IP, and then initiate a request for service resources. Therefore, the domain name and the IP are related to each other. In the same time, a domain name can be configured to resolve multiple IPs, resulting in a one-to-many relationship. In the historical dimension, a server with the same IP may be resolved by multiple domain name changes, resulting in a many-to-many relationship.
其中,文件Hash值是基于消息摘要算法,对一个文件的文件特征进行计算后得到的字符串,该字符串包括字母和/或数字。在当前条件下计算生成的文件Hash值与某个文件具有一一对应关系,及文件Hash值可作为对应文件的唯一标识,以区别于其他文件。本实施例可以采用诸如SHA(Secure Hash Algorithm,安全散列算法)、MD5(Message-Digest Algorithm,消息摘要算法)等消息摘要算法获取文件Hash值。不同的消息摘要算法,摘要出来的文件特征具有不同长度,长度越长,越具有唯一性。在上述两种算法中,可选SHA-256获取文件Hash值,因为MD5的字符串相对较短,可能存在两个不同的文件但MD5值相同的情况;如果选择其他算法,字符串太长会浪费存储空间。因此,本实施例采用SHA-256,以达到同时兼顾抗碰撞性和节省存储空间的效果。Among them, the file hash value is a character string obtained by calculating the file characteristics of a file based on the message digest algorithm, and the character string includes letters and/or numbers. The calculated file hash value under current conditions has a one-to-one correspondence with a certain file, and the file hash value can be used as a unique identifier of the corresponding file to distinguish it from other files. In this embodiment, a message digest algorithm such as SHA (Secure Hash Algorithm) and MD5 (Message-Digest Algorithm) may be used to obtain the file Hash value. Different message digest algorithms have different lengths of the features of the digested files. The longer the length, the more unique it is. In the above two algorithms, you can choose SHA-256 to obtain the file hash value. Because the MD5 string is relatively short, there may be two different files but the MD5 value is the same; if you choose other algorithms, the string will be too long. Waste storage space. Therefore, this embodiment adopts SHA-256 to achieve the effects of both anti-collision properties and saving storage space.
步骤S102:获取每个特征信息对应的置信度值和权重值。在本实施例中,特征信息即入侵威胁指标,本实施为每个入侵威胁指标分配了置信度值和权重值,其中,置信度值代表某一入侵威胁指标相对于待测网站被判为钓鱼网站这一决策的准确程度,权重值代表某一入侵威胁指标相对于待测网站被判为钓鱼网站这一决策的重要程度。Step S102: Obtain the confidence value and the weight value corresponding to each feature information. In this embodiment, the characteristic information is the intrusion threat indicator. This implementation assigns a confidence value and a weight value to each intrusion threat indicator. The confidence value represents that a certain intrusion threat indicator is judged as phishing relative to the website to be tested. The accuracy of the website decision, and the weight value represents the importance of a certain intrusion threat indicator relative to the decision that the website under test is judged as a phishing website.
步骤S103:按照多个特征信息分别对应的置信度值和权重值,确定待测网站的加权置信度值。本实施例使用对应于每个特征信息的置信度值和权重值占比的积,作为单项权重, 通过对各个单项权重的加和计算,得到待测网站的加权置信度值。其中,权重值占比指某一特征信息的权重值占所有特征信息的权重值之和的比值。Step S103: Determine the weighted confidence value of the website to be tested according to the confidence value and the weight value respectively corresponding to the multiple feature information. In this embodiment, the product of the confidence value corresponding to each feature information and the weight value is used as the individual weight, and the weighted confidence value of the website to be tested is obtained by calculating the sum of the individual weights. Among them, the weight value percentage refers to the ratio of the weight value of a certain feature information to the sum of the weight values of all feature information.
相关技术中通常基于单威胁情报和特征直接做出判断,导致误差较大。本实施例考虑多个特征信息,考虑每个特征信息的置信度和权重,最终以加权置信度值决定结果,有利于减小误差。Related technologies usually make judgments directly based on single threat intelligence and characteristics, leading to large errors. This embodiment considers multiple feature information, considers the confidence and weight of each feature information, and finally determines the result with a weighted confidence value, which is beneficial to reduce errors.
步骤S104:在加权置信度值大于预设阈值的情况下,确定待测网站为钓鱼网站。其中,预设阈值可以为多次测试得到的经验值,也可以是预先设定的值。Step S104: In the case that the weighted confidence value is greater than the preset threshold, it is determined that the website to be tested is a phishing website. Wherein, the preset threshold value may be an empirical value obtained through multiple tests, or may be a preset value.
通过上述步骤,解决了相关技术中存在的钓鱼网站的检测结果准确度低的问题,提升了钓鱼网站的检测结果的准确度。Through the above steps, the problem of low accuracy of the detection result of the phishing website in the related technology is solved, and the accuracy of the detection result of the phishing website is improved.
具有威胁的恶意文件通常会利用离线或在线的方式进行运作。离线运作方式,以破坏用户计算机为目的,造成用户使用计算的体验不畅甚至无法使用;在线运作方式,以非法搜集用户信息为目的,往往会盗取用户计算机上的资料、密码,以URL的方式寻找并连接远程服务器,将搜集的用户信息上传到恶意服务器。由此,可根据待测网站中多个特征信息之间的关联关系,逐一获取特征信息。Malicious files with threats usually operate offline or online. Offline mode of operation, for the purpose of destroying the user’s computer, causing the user’s experience of using computing is not smooth or even unavailable; online mode of operation, for the purpose of illegally collecting user information, often steals the user’s computer information, password, and URL Ways to find and connect to a remote server, and upload the collected user information to the malicious server. In this way, the feature information can be obtained one by one according to the association relationship between multiple feature information in the website to be tested.
图2是待测网站中多个特征信息之间的关联关系示意图,如图2所示,不同特征信息之间有相互关联关系,并且可以相互转化。以下实施例将介绍如何根据多个特征信息之间的关联关系,提取多个特征信息。Figure 2 is a schematic diagram of the association relationship between multiple feature information in the website to be tested. As shown in Figure 2, different feature information has an interconnection relationship and can be transformed into each other. The following embodiments will introduce how to extract multiple pieces of characteristic information according to the association relationship between the pieces of characteristic information.
在其中一些实施例中,多个特征信息包括第一特征信息;获取待测网站的多个特征信息包括:获取网站的URL;根据URL获取第一特征信息,其中,第一特征信息包括但不限于以下至少之一:IP、域名、可执行文件的文件Hash值、Whois信息。In some of the embodiments, the multiple feature information includes first feature information; acquiring multiple feature information of the website to be tested includes: acquiring the URL of the website; acquiring the first feature information according to the URL, where the first feature information includes but not Limited to at least one of the following: IP, domain name, file hash value of executable files, and Whois information.
第一特征信息指从网站的URL中获取的直接相关的、精准的信息,以下将对获取网站的URL以及获取第一特征信息分别进行介绍。The first characteristic information refers to the directly relevant and accurate information obtained from the URL of the website. The following will separately introduce the obtaining of the website URL and the obtaining of the first characteristic information.
(1)获取网站URL,URL形如<protocal>://[host]:[port]/[path]。其中,protocal为协议,例如http(Hyper Text Transfer Protocol,超文本传输协议)、https(Hyper Text Transfer Protocol over Secure Socket Layer,超文本传输安全协议)、ftp(File Transfer Protocol,文件传输协议)。host为主机,域名可以解析为IP,也可以直接是IP,例如www.baidu.com或180.101.49.11。port为端口号,常见的网站端口例如80、443、8080等,80和443为默认可以省略。path为路径,例如index.html。(1) Get the website URL, the URL is like <protocal>://[host]:[port]/[path]. Among them, protocol is a protocol, such as http (Hyper Text Transfer Protocol), https (Hyper Text Transfer Protocol over Secure Socket Layer, Hypertext Transfer Security Protocol), ftp (File Transfer Protocol, file transfer protocol). Host is the host, and the domain name can be resolved to IP or directly to IP, such as www.baidu.com or 180.101.49.11. port is the port number, common website ports such as 80, 443, 8080, etc., 80 and 443 can be omitted by default. path is the path, such as index.html.
钓鱼网站通常是诱使用户通过浏览器点击操作,产生恶意行为。URL作为一种互联网上网页的唯一标识,可以通过以下两种方式获得。一种是在浏览网页时,复制浏览器顶部的地址栏,获取URL;另一种是在自动化场景里,通过防火墙设备,拦截用户访问的URL。Phishing websites usually induce users to click through the browser to produce malicious behavior. As a unique identifier of a web page on the Internet, URL can be obtained in the following two ways. One is to copy the address bar at the top of the browser to obtain the URL when browsing the web; the other is to intercept the URL accessed by the user through a firewall device in an automated scenario.
在获取到URL后,将进行多维度特征提取,其中,直接获取到的第一特征信息以$base开头。After the URL is obtained, multi-dimensional feature extraction will be performed, where the first feature information directly obtained starts with $base.
(2)拆分URL,获取到$base_protocal、$base_host、$base_port、$base_path。实现方式上,可以通过分隔符拆分,或者正则表达式提取。正则表达式是一种匹配提取技术,其对具有指定特征的字符串,可用于判断是否匹配,也可用于提取其中指定信息。(2) Split the URL to get $base_protocal, $base_host, $base_port, $base_path. In terms of implementation, it can be split by separators or extracted by regular expressions. Regular expression is a matching extraction technology, which can be used to determine whether a character string with specified characteristics is matched or not, and can also be used to extract specified information.
(3)提取域名和IP,如果$base_host为域名,则$base_domain=$base_host。然后对 $base_domain进行dns(Domain Name System,域名系统(服务)协议)解析,解析方式可以通过windows或者linux的nslookup命令解析得到,也可以通过在线工具解析得到。解析到的IP可能是多个,记为$base_ip。如果$host为ip,则$base_ip=$base_host。(3) Extract domain name and IP, if $base_host is a domain name, then $base_domain=$base_host. Then, dns (Domain Name System, domain name system (service) protocol) analysis is performed on $base_domain. The analysis method can be obtained by analyzing the nslookup command of windows or linux, or it can be obtained by analyzing online tools. There may be more than one IP resolved, which is recorded as $base_ip. If $host is ip, then $base_ip=$base_host.
(4)提取Whois信息。Whoiss信息是一种域名的注册所有者信息,通常为企业或个人,通过该信息,有可能查询到注册所有者的注册时间、过期时间以及邮箱等信息。查询方式可通过linux或者windows的Whois命令获取,也可以通过在线工具获取,记为$base_whois。(4) Extract Whois information. Whoiss information is a type of domain name registration owner information, usually a business or individual. Through this information, it is possible to query the registration owner's registration time, expiration time, and email address. The query method can be obtained through the Whois command of linux or windows, or through an online tool, denoted as $base_whois.
(5)提取网页中的文件Hash值。通常金融类网站,特别是网银交易类网站,为保证信息安全,会提供密盾类插件,让用户下载并安装。仿冒的钓鱼网站,也会提供“密盾”下载链接,然而提供的可执行文件,很可能是病毒木马类程序。在一些实施例中,可通过http请求URL或者使用浏览器渲染的方式,获取到待测网站中网页的html(Hyper Text Markup Language,超文本标记语言)代码,并从中检索可执行文件下载链接,将下载到的可执行文件记为$base_file,并计算其文件Hash值,记为$base_hash。(5) Extract the Hash value of the file in the webpage. Generally, financial websites, especially online banking transaction websites, provide secret shield plug-ins for users to download and install in order to ensure information security. Counterfeit phishing websites will also provide "Secret Shield" download links, but the executable files provided are likely to be virus Trojan horse programs. In some embodiments, the html (Hyper Text Markup Language) code of the webpage of the website to be tested can be obtained through http request URL or browser rendering, and the executable file download link can be retrieved from it. Record the downloaded executable file as $base_file, and calculate its file Hash value and record it as $base_hash.
在其中一些实施例中,多个特征信息还包括第二特征信息;获取待测网站的多个特征信息还包括但不限于以下至少之一:从情报库中获取与IP关联的域名信息,以及获取与域名关联的IP信息,第二特征信息包括与IP关联的域名信息,以及与域名关联的IP信息;通过Whois反查技术,获取与Whois信息关联的IP和域名信息,第二特征信息包括与Whois信息关联的IP和域名信息;通过可执行文件动态分析技术,获取与可执行文件回连的IP和域名信息,其中,第二特征信息包括与可执行文件回连的IP和域名信息。In some of the embodiments, the multiple feature information further includes second feature information; obtaining multiple feature information of the website to be tested also includes but is not limited to at least one of the following: obtaining domain name information associated with the IP from the intelligence database, and Obtain the IP information associated with the domain name, the second characteristic information includes the domain name information associated with the IP, and the IP information associated with the domain name; through the Whois counter-check technology, obtain the IP and domain information associated with the Whois information, the second characteristic information includes The IP and domain name information associated with the Whois information; through the executable file dynamic analysis technology, the IP and domain name information connected to the executable file is obtained, where the second characteristic information includes the IP and domain name information connected back to the executable file.
第二特征信息指相对于第一特征信息而言的间接相关的、粗略的信息,在获取到URL后,将进行多维度特征提取,其中,通过检索间接获取到的信息以$ext开头。以下将对获取第二特征信息分别进行介绍。The second feature information refers to indirectly related and rough information relative to the first feature information. After the URL is obtained, multi-dimensional feature extraction will be performed. Among them, the information obtained indirectly through the search starts with $ext. The following will separately introduce the acquisition of the second characteristic information.
(6)提取历史解析IP和域名。以上步骤获取到了域名或者IP,由于域名和IP的绑定关系不是永恒不变的,域名所有者可以随时更改自己域名指向的IP;而同一个IP在不同时间,会租借给不同的用户使用,绑定不同域名。在一些实施例中,可根据域名和IP的绑定关系,在情报库中获取与IP关联的域名信息,以及获取与域名关联的IP信息,其中,情报库包括威胁情报库;也可以在一些信息收录网站,获取与IP关联的域名信息,以及获取与域名关联的IP信息。这种历史解析的域名和IP,分别记入$ext_domain和$ext_ip。(6) Extract historically resolved IP and domain names. The domain name or IP is obtained in the above steps. Since the binding relationship between the domain name and the IP is not permanent, the domain name owner can change the IP pointed to by his domain name at any time; and the same IP will be leased to different users at different times. Bind different domain names. In some embodiments, the domain name information associated with the IP and the IP information associated with the domain name can be obtained in the intelligence database according to the binding relationship between the domain name and the IP. The intelligence database includes a threat intelligence database; Information collection website, obtain domain name information associated with IP, and obtain IP information associated with domain name. This historically resolved domain name and IP are recorded in $ext_domain and $ext_ip respectively.
(7)Whois信息反查。个人或者单位,有权注册多个域名,通过whois反查技术,可以用Whois信息获取到该个人或单位名下的更多域名,记入$ext_domains,其中这些域名信息,又可以解析出对应的IP,记入$ext_ip。(7) Re-check Whois information. Individuals or organizations have the right to register multiple domain names. Through the Whois counter-check technology, you can use Whois information to obtain more domain names under the individual or organization, and record them in $ext_domains. These domain names can be parsed to find out the corresponding domain names. IP, enter $ext_ip.
(8)提取可执行文件中的服务器地址。如果是恶意的可执行程序,很可能以在线的方式,上传用户信息到恶意服务器,搜集用户敏感数据。使用可执行文件动态分析技术,分析出与可执行文件回连的域名或者IP,记入$ext_domain或者$ext_ip。其中,可执行文件动态分析技术涉及二进制、反汇编、沙箱等技术。(8) Extract the server address in the executable file. If it is a malicious executable program, it is likely to upload user information to a malicious server online to collect sensitive user data. Use executable file dynamic analysis technology to analyze the domain name or IP connected to the executable file and record it in $ext_domain or $ext_ip. Among them, the executable file dynamic analysis technology involves binary, disassembly, sandbox and other technologies.
通过以上步骤,已获取到直接相关的、精准的$base_数据,以及间接相关的、粗略的$ext_数据。表1给出了通过以上步骤获取得到的两种特征信息的汇总情况,如表1所示。Through the above steps, directly related and accurate $base_ data, as well as indirect related, rough $ext_ data have been obtained. Table 1 shows the summary of the two types of feature information obtained through the above steps, as shown in Table 1.
表1直接相关与间接相关特征信息Table 1 Directly related and indirect related feature information
Figure PCTCN2021088986-appb-000001
Figure PCTCN2021088986-appb-000001
以下实施例将结合表1给出的两类特征信息,进行进一步匹配分析。The following embodiments will combine the two types of feature information given in Table 1 for further matching analysis.
在其中一些实施例中,获取每个特征信息对应的置信度值和权重值包括:将待测网站的多个特征信息与情报库中的预设特征信息进行匹配;根据匹配结果得到与待测网站中每个特征信息对应的置信度值和权重值。In some of the embodiments, obtaining the confidence value and weight value corresponding to each feature information includes: matching multiple feature information of the website to be tested with preset feature information in the information database; The confidence value and weight value corresponding to each feature information in the website.
本实施例中的情报库包括威胁情报库,其中,威胁情报库可包括自建情报库和第三方情报库,其共同点为都按照相似的维度,收录了存在恶意行为的域名、IP、文件Hash值等核心入侵威胁指标。本实施例可选采用自建情报库,除了上述核心入侵威胁指标外,在一些实施例中,可在自建情报库中设置Whois信息、历史解析域名和IP信息、情报置信度、恶意URL在内的情报,以辅助前步骤中多个特征信息的获取,以及之后对获取的多个特征信息进行加权判定。另外,对于前步骤中获取到的可执行文件,也可以上传自建情报库,分析是否存在恶意行为。The intelligence database in this embodiment includes a threat intelligence database. The threat intelligence database may include self-built intelligence databases and third-party intelligence databases. The common point is that they include domain names, IPs, and files with malicious behavior in similar dimensions. Core intrusion threat indicators such as hash value. This embodiment can optionally adopt a self-built intelligence database. In addition to the above-mentioned core intrusion threat indicators, in some embodiments, Whois information, historical analysis domain names and IP information, intelligence confidence, and malicious URLs can be set in the self-built intelligence database. The information within is used to assist the acquisition of multiple feature information in the previous step, and the weighted determination of the multiple feature information acquired later. In addition, for the executable file obtained in the previous step, you can also upload a self-built information library to analyze whether there is malicious behavior.
以下将介绍如何对获取的多个特征信息进行加权判定。The following will introduce how to make a weighted judgment on the acquired multiple feature information.
(1)原则:在第一特征信息($base_)命中情报库的情况下,为该第一特征信息的置信度值分配高权重值;在第二特征信息($ext_)命中情报库的情况下,为该第二特征信息的置信度值分配低权重值。(1) Principle: In the case where the first feature information ($base_) hits the intelligence database, assign a high weight value to the confidence value of the first feature information; when the second feature information ($ext_) hits the intelligence database Next, assign a low weight value to the confidence value of the second feature information.
(2)公式:假设置信度值为c,0≤c≤1;权重值为w,1≤w≤10;判定得分为S,S即为加权置信度值,0≤S≤1,采用加权平均算法,则有(2) Formula: Suppose the reliability value is c, 0≤c≤1; the weight value is w, 1≤w≤10; the judgment score is S, S is the weighted confidence value, 0≤S≤1, weighted Averaging algorithm, there are
Figure PCTCN2021088986-appb-000002
Figure PCTCN2021088986-appb-000002
其中,n为自然数。Among them, n is a natural number.
(3)第一特征信息加权:$base_domain、$base_ip、$base_whois、$base_hash、$base_path为情报库可收录信息。(3) The weighting of the first characteristic information: $base_domain, $base_ip, $base_whois, $base_hash, and $base_path are the information that can be included in the information library.
其中,在进行$base_domain、$base_ip、$base_hash匹配时,该类特征信息的准确度高,为核心匹配内容,可以采用精确匹配方式,即$base_domain完全等于情报库的domain; $base_ip同理;$base_hash有不同种表现形式,本实施例采用SHA256算法,与情报库中的SHA256值做精确匹配。对于上述匹配到的数据,c取自于情报库中预设的confidence字段,该confidence字段包含了置信度值,confidence字段反应了对应于某个特征信息在当前条件下的置信度情况。对于未匹配到的数据,c取值为0,w均取值为8。Among them, when matching $base_domain, $base_ip, and $base_hash, this type of feature information has high accuracy and is the core matching content. Exact matching can be used, that is, $base_domain is exactly equal to the domain of the information library; $base_ip is the same; $base_hash has different manifestations. In this embodiment, the SHA256 algorithm is used to accurately match the SHA256 value in the intelligence database. For the above matched data, c is taken from the preset confidence field in the information library, the confidence field contains the confidence value, and the confidence field reflects the confidence of a certain feature information under current conditions. For unmatched data, the value of c is 0, and the value of w is 8.
$base_whois主要匹配注册所有者和注册所有者联系方式的数据,可以采用精确匹配方式。对于匹配到的数据,c取自于情报库中预设的confidence字段;对于未匹配到的数据,c取值为0,w均取值为6。$base_whois mainly matches the data of the registered owner and the contact information of the registered owner, and an exact matching method can be used. For matched data, c is taken from the confidence field preset in the information library; for unmatched data, c takes the value of 0, and w takes the value of 6.
$base_path采用是否包含的方式进行匹配,c取自于情报库中预设的confidence字段,未匹配到则c取值为0,w均取值为4。$base_path is matched by whether it is contained or not. c is taken from the preset confidence field in the information library. If it is not matched, the value of c is 0, and the value of w is 4.
需要说明的是,以上c的取值是根据情报库中预设的confidence字段确定的,confidence字段中对应于某个特征信息的置信度值是在当前条件下根据该特征信息的可信程度计算得到的。而w可以取自于情报库中的预设值,也可以根据特征信息检出的准确性,适当调整w。It should be noted that the value of c above is determined according to the preset confidence field in the information library, and the confidence value corresponding to a certain feature information in the confidence field is calculated based on the credibility of the feature information under current conditions owned. And w can be taken from the preset value in the information database, or can be appropriately adjusted according to the accuracy of the feature information detection.
(4)第二特征信息加权:通过历史解析,或者可执行文件的联网行为解析得到域名和IP。相较于从URL中获取的第一特征信息,第二特征信息的相关性稍低,并且数据量为多条,本实施例对这类特征信息分配较低权重值。在第二特征信息匹配到情报库中收录的入侵威胁指标的情况下,c取自于情报库中预设的confidence字段;否则,未匹配到则c取值为0,w均取值为2。每匹配到一条入侵威胁指标,均作为加权平均的累加项之一。(4) The second feature information weighting: the domain name and IP are obtained through historical analysis or analysis of the network behavior of executable files. Compared with the first feature information obtained from the URL, the second feature information has a slightly lower relevance and has multiple pieces of data. This embodiment assigns a lower weight value to this type of feature information. In the case that the second characteristic information matches the intrusion threat index included in the intelligence database, c is taken from the confidence field preset in the intelligence database; otherwise, if it does not match, the value of c is 0, and the value of w is 2 . Every time an intrusion threat index is matched, it is regarded as one of the cumulative items of the weighted average.
目前https的网络传输加密技术已经得到大范围推广,金融行业网站首选采用https协议,对于非https协议的网站,典型的Google Chrome(谷歌浏览器)会标记为不安全网站。而https证书,需要由权威机构颁发才会被浏览器信任,非信任或者不一致的证书,可能被浏览器拒绝连接。At present, the https network transmission encryption technology has been widely promoted. Financial industry websites prefer to adopt the https protocol. For websites with non-https protocol, the typical Google Chrome (Google browser) will be marked as an insecure website. The https certificate needs to be issued by an authority to be trusted by the browser. Untrusted or inconsistent certificates may be rejected by the browser.
由此可知,网络传输协议和端口信息也可作为加权判定的特征信息。在其中一些实施例中,获取每个特征信息对应的置信度值和权重值还包括:从多个特征信息中确定第三特征信息,其中,第三特征信息包括但不限于以下至少之一:从URL中获取的网络传输协议信息、端口信息;根据第三特征信息,获取对应于第三特征信息的置信度值和权重值。It can be seen that the network transmission protocol and port information can also be used as feature information for weighted judgment. In some of the embodiments, obtaining the confidence value and the weight value corresponding to each feature information further includes: determining third feature information from a plurality of feature information, where the third feature information includes but is not limited to at least one of the following: The network transmission protocol information and port information obtained from the URL; according to the third characteristic information, the confidence value and the weight value corresponding to the third characteristic information are obtained.
第三特征信息指非情报特征信息,即从URL中获取的网络传输协议信息、端口信息。The third feature information refers to non-intelligence feature information, that is, network transmission protocol information and port information obtained from the URL.
如果待测网站仅使用http协议,则令c=1,w=2。如果待测网站使用了https协议,但是不具有权威机构颁发的证书,则判断为不可信,令c=1,w=3。If the website to be tested only uses the http protocol, let c=1 and w=2. If the website to be tested uses the https protocol but does not have a certificate issued by an authority, it is judged to be untrustworthy, and c=1 and w=3.
网站为了用户方便访问,通常采用见名知意的域名帮助用户记忆,并且使用可以省略输入的默认端口,比如http协议通常采用80端口,https协议通常采用443端口。如果待测网站仅采用ip的方式访问,则令c=1,w=3。如果待测网站使用了非默认端口,而使用8080、9090、8888这种端口,则c=1,w=3。For the convenience of users, websites usually use well-known domain names to help users remember, and use default ports that can be omitted. For example, http protocol usually uses port 80, and https protocol usually uses port 443. If the website to be tested is only accessed by ip, let c=1 and w=3. If the website under test uses a non-default port, but uses ports like 8080, 9090, 8888, then c=1, w=3.
在其中一些实施例中,在按照多个特征信息分别对应的置信度值和权重值,确定待测网站的加权置信度值之后,还包括:判断加权置信度值是否大于第一预设阈值;在判断到加权置信度值大于第一预设阈值的情况下,确定待测网站为钓鱼网站,并拒绝访问待测网站。判断加权置信度值是否大于第二预设阈值;在判断到加权置信度值大于第二预设阈值 的情况下,确定待测网站为疑似钓鱼网站,并发出用于指示待测网站为疑似钓鱼网站的告警信息。In some of these embodiments, after determining the weighted confidence value of the website to be tested according to the confidence value and the weight value corresponding to the multiple feature information, the method further includes: judging whether the weighted confidence value is greater than the first preset threshold; In the case where it is determined that the weighted confidence value is greater than the first preset threshold, the website to be tested is determined to be a phishing website, and access to the website to be tested is denied. Determine whether the weighted confidence value is greater than the second preset threshold; if it is determined that the weighted confidence value is greater than the second preset threshold, determine that the website to be tested is a suspected phishing website, and issue an instruction to indicate that the website to be tested is a suspected phishing website Website warning information.
通过以上综合各项信息加权,代入加权平均公式计算得到得分S,可根据预设阈值对S进行分级,根据分级结果采取针对待测网站的策略。Through the above comprehensive information weighting, substituting the weighted average formula to calculate the score S, S can be classified according to the preset threshold, and the strategy for the website to be tested is adopted according to the classification result.
第一预设阈值可取值0.8,如果S≥0.8,则确定该待测网站为钓鱼网站,可以直接拒绝用户访问。The first preset threshold may take a value of 0.8. If S ≥ 0.8, the website to be tested is determined to be a phishing website, and the user can be directly denied access.
第二预设阈值可取值0.6,如果S≥0.6,确定待测网站为疑似钓鱼网站,可以发出告警信息,以提醒该待测网站存在诈骗的可能性的,需要谨慎判断。The second preset threshold can take a value of 0.6. If S≥0.6, it is determined that the website under test is a suspected phishing website, and an alarm message can be issued to remind the website under test that there is a possibility of fraud, and careful judgment is required.
在其中一些实施例中,在确定待测网站为钓鱼网站或者疑似钓鱼网站的情况下之后,还包括:将待测网站的URL、待测网站中多个特征信息收录至情报库中。In some of these embodiments, after determining that the website to be tested is a phishing website or a suspected phishing website, the method further includes: including the URL of the website to be tested and multiple characteristic information of the website to be tested into the information database.
在本实施例中,可以设置第三预设阈值,第三预设阈值可取值0.4,如果S≥0.4,则将该待测网站的特征信息收录到可疑信息库,以便于进一步核实,供后续优化算法参数使用。如果S<0.4,则丢弃从该待测网站获取的特征信息。In this embodiment, a third preset threshold can be set. The third preset threshold can take a value of 0.4. If S ≥ 0.4, the characteristic information of the website to be tested is included in the suspicious information database for further verification. Use of subsequent optimization algorithm parameters. If S<0.4, discard the characteristic information obtained from the website to be tested.
相关技术中采用的开源威胁情报或者商业威胁情报通常以离线方式检测网站,由于其入侵威胁指标存在滞后性,随着时间的发展,误报率将逐渐上升。而本实施例采用自建情报库,在情报库中设置入侵威胁指标,并将检测出的当前待测网站中可疑信息反向收录至情报库中,以更新入侵威胁指标,解决入侵威胁指标的滞后性问题,有利于降低误报率;同时有利于提高情报库的丰富性,提升判定精确度和广度,保持良性循环。The open source threat intelligence or commercial threat intelligence used in related technologies usually detects websites in an offline manner. Due to the lag of intrusion threat indicators, the false alarm rate will gradually increase with the development of time. However, this embodiment adopts a self-built intelligence database, sets intrusion threat indicators in the intelligence database, and reversely includes the suspicious information detected in the current website to be tested into the intelligence database to update the intrusion threat indicators and solve the problem of intrusion threat indicators. The problem of lag is conducive to reducing the false alarm rate; at the same time, it is conducive to increasing the richness of the information database, improving the accuracy and breadth of judgment, and maintaining a virtuous circle.
下面通过可选实施例对本申请实施例进行描述和说明。The following describes and illustrates the embodiments of the present application through optional embodiments.
图3是根据本申请可选实施例的钓鱼网站的检测方法的流程示意图,如图3所示,该流程包括如下步骤:Fig. 3 is a schematic flowchart of a method for detecting a phishing website according to an optional embodiment of the present application. As shown in Fig. 3, the process includes the following steps:
步骤S301:获取网站的URL。Step S301: Obtain the URL of the website.
步骤S302:拆分URL,获取$base_protocal、$base_host、$base_port、$base_path。Step S302: Split the URL to obtain $base_protocal, $base_host, $base_port, and $base_path.
步骤S303:直接获取或解析,得到$base_domain、$base_ip、$base_whois、$base_hash。Step S303: Obtain or parse directly to obtain $base_domain, $base_ip, $base_whois, and $base_hash.
步骤S304:间接获取或解析,得到$ext_domain、$ext_ip。Step S304: Obtain or parse indirectly to obtain $ext_domain and $ext_ip.
步骤S305:匹配情报库,获取置信度值c和权重值w;非情报库直接判定置信度值c和权重值w。Step S305: match the information database to obtain the confidence value c and the weight value w; the non-information database directly determines the confidence value c and the weight value w.
步骤S306:采用加权平均算法计算加权置信度值。即采用以下公式计算S:Step S306: Use a weighted average algorithm to calculate a weighted confidence value. That is, the following formula is used to calculate S:
Figure PCTCN2021088986-appb-000003
Figure PCTCN2021088986-appb-000003
步骤S307:通过S评分,作出访问限制决策。Step S307: Make an access restriction decision based on the S score.
步骤S308:将判定为钓鱼网站或者疑似钓鱼网站的网站信息回收至情报库。Step S308: Retrieve the website information determined as a phishing website or a suspected phishing website to the information database.
在其中一些实施例中,可以直接将情报库中的IOC信息下发到web(World Wide Web,万维网)防火墙,在遇到匹配的特征信息的情况下,直接拦截,以实现精准拦截。In some of these embodiments, the IOC information in the information library can be directly delivered to a web (World Wide Web) firewall, and in the case of matching characteristic information, it is directly intercepted to achieve precise interception.
相比较于相关技术,本申请实施例包括以下优势:Compared with related technologies, the embodiments of the present application include the following advantages:
(1)使用自建情报库中内置的对应于入侵威胁指标的置信度,可帮助判定目标网站可疑程度。(1) Using the built-in confidence level corresponding to the intrusion threat indicator in the self-built intelligence database can help determine the suspicious degree of the target website.
(2)从待测网站的URL中提取出能与情报库匹配的直接关联信息,并且通过直接关 联信息的自身属性,获取到附加的关联信息,以进行综合判定,实现多维度匹配。相比传统方式,能够更准确地发现钓鱼网站,有利于防止用户被假冒网站诈骗,影响正常的经济秩序。(2) Extract the directly related information that can be matched with the information database from the URL of the website to be tested, and obtain the additional related information by directly related to the attributes of the information itself, so as to make comprehensive judgments and realize multi-dimensional matching. Compared with traditional methods, phishing websites can be found more accurately, which helps prevent users from being scammed by fake websites and affect normal economic order.
(3)对各种特征信息的置信度加权,最终以权重决定检测结果,解决单威胁情报和特征直接做出判断导致误差大的问题。(3) Weight the confidence of various feature information, and finally use the weight to determine the detection result, so as to solve the problem of large errors caused by single threat intelligence and feature directly making judgments.
(4)本次检测结果,可入库用于扩充基础库数据,用于之后的钓鱼网站判断;针对误报问题,可通过调整权重值的方式采取干预。(4) This test result can be stored in the database to expand the basic database data for later phishing website judgment; for the problem of false positives, intervention can be taken by adjusting the weight value.
本实施例还提供了一种钓鱼网站的检测装置,该装置用于实现上述实施例及可选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”、“子模块”等可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。This embodiment also provides a detection device for a phishing website, which is used to implement the above-mentioned embodiments and optional implementation manners, and those that have been described will not be repeated. As used below, the terms "module", "sub-module", etc., can be a combination of software and/or hardware that can implement predetermined functions. Although the devices described in the following embodiments are preferably implemented by software, implementation by hardware or a combination of software and hardware is also possible and conceived.
图4是根据本申请实施例的钓鱼网站的检测装置的结构框图,如图4所示,该装置包括:Fig. 4 is a structural block diagram of a detection device for a phishing website according to an embodiment of the present application. As shown in Fig. 4, the device includes:
第一获取模块41,用于获取待测网站的多个特征信息。The first obtaining module 41 is used to obtain multiple feature information of the website to be tested.
第二获取模块42,耦合至第一获取模块41,用于获取每个特征信息对应的置信度值和权重值。The second acquisition module 42 is coupled to the first acquisition module 41 and is configured to acquire the confidence value and the weight value corresponding to each feature information.
第一确定模块43,耦合至第二获取模块42,用于按照多个特征信息分别对应的置信度值和权重值,确定待测网站的加权置信度值。The first determination module 43 is coupled to the second acquisition module 42 and is configured to determine the weighted confidence value of the website to be tested according to the confidence value and the weight value corresponding to the plurality of feature information, respectively.
第二确定模块44,耦合至第一确定模块43,用于在加权置信度值大于预设阈值的情况下,确定待测网站为钓鱼网站。The second determining module 44, coupled to the first determining module 43, is configured to determine that the website to be tested is a phishing website when the weighted confidence value is greater than the preset threshold.
在其中一些实施例中,装置包括:第一获取子模块,用于获取网站的URL;第二获取子模块,用于根据URL获取第一特征信息,其中,第一特征信息包括但不限于以下至少之一:IP、域名、可执行文件的文件Hash值、Whois信息。In some of the embodiments, the device includes: a first acquisition sub-module for acquiring the URL of the website; a second acquisition sub-module for acquiring the first characteristic information according to the URL, where the first characteristic information includes but is not limited to the following At least one: IP, domain name, file hash value of executable file, Whois information.
在其中一些实施例中,装置包括:第三获取子模块,用于从情报库中获取与IP关联的域名信息,以及获取与域名关联的IP信息,第二特征信息包括与IP关联的域名信息,以及与域名关联的IP信息;Whois反查模块,用于通过Whois反查技术,获取与Whois信息关联的IP和域名信息,第二特征信息包括与Whois信息关联的IP和域名信息;可执行文件动态分析模块,用于通过可执行文件动态分析技术,获取与可执行文件回连的IP和域名信息,其中,第二特征信息包括与可执行文件回连的IP和域名信息。In some of the embodiments, the device includes: a third acquisition sub-module for acquiring domain name information associated with the IP from the information database, and acquiring IP information associated with the domain name, and the second characteristic information includes the domain name information associated with the IP , And the IP information associated with the domain name; The Whois counter-check module is used to obtain the IP and domain name information associated with the Whois information through the Whois counter-check technology. The second feature information includes the IP and domain name information associated with the Whois information; executable; The file dynamic analysis module is used to obtain the IP and domain name information connected to the executable file through the dynamic analysis technology of the executable file, wherein the second characteristic information includes the IP and domain name information connected back to the executable file.
在其中一些实施例中,装置包括:匹配模块,用于将待测网站的多个特征信息与情报库中的预设特征信息进行匹配;根据匹配结果得到与待测网站中每个特征信息对应的置信度值和权重值。In some of the embodiments, the device includes: a matching module for matching multiple feature information of the website to be tested with preset feature information in the intelligence database; according to the matching result, the corresponding feature information of the website to be tested is obtained The confidence value and weight value of.
在其中一些实施例中,装置包括:确定子模块,用于从多个特征信息中确定第三特征信息,其中,第三特征信息包括但不限于以下至少之一:从URL中获取的网络传输协议信息、端口信息;第四获取子模块,用于根据第三特征信息,获取对应于第三特征信息的置信度值和权重值。In some of the embodiments, the device includes: a determining sub-module for determining third characteristic information from a plurality of characteristic information, where the third characteristic information includes but is not limited to at least one of the following: network transmission obtained from URL Protocol information, port information; the fourth acquisition sub-module is used to acquire the confidence value and weight value corresponding to the third characteristic information according to the third characteristic information.
在其中一些实施例中,装置还包括:第一判断模块,用于判断加权置信度值是否大于 第一预设阈值;在判断到加权置信度值大于第一预设阈值的情况下,确定待测网站为钓鱼网站,并拒绝访问待测网站;第二判断模块,用于判断加权置信度值是否大于第二预设阈值;在判断到加权置信度值大于第二预设阈值的情况下,确定待测网站为疑似钓鱼网站,并发出用于指示待测网站为疑似钓鱼网站的告警信息。In some of the embodiments, the device further includes: a first judging module for judging whether the weighted confidence value is greater than the first preset threshold; in the case where it is judged that the weighted confidence value is greater than the first preset threshold, it is determined that the weighted confidence value is greater than the first preset threshold. The test website is a phishing website, and access to the website to be tested is denied; the second judgment module is used to judge whether the weighted confidence value is greater than the second preset threshold; when it is judged that the weighted confidence value is greater than the second preset threshold, Determine that the website to be tested is a suspected phishing website, and issue an alarm message indicating that the website to be tested is a suspected phishing website.
其中一些实施例中,装置还包括:收录模块,用于将待测网站的URL、待测网站中多个特征信息收录至情报库中。In some of these embodiments, the device further includes: an inclusion module, which is used to include the URL of the website to be tested and multiple characteristic information of the website to be tested into the information database.
需要说明的是,上述各个模块可以是功能模块也可以是程序模块,既可以通过软件来实现,也可以通过硬件来实现。对于通过硬件来实现的模块而言,上述各个模块可以位于同一处理器中;或者上述各个模块还可以按照任意组合的形式分别位于不同的处理器中。It should be noted that each of the above-mentioned modules may be a functional module or a program module, which may be implemented by software or hardware. For modules implemented by hardware, each of the foregoing modules may be located in the same processor; or each of the foregoing modules may also be located in different processors in any combination.
另外,结合图1描述的本申请实施例钓鱼网站的检测方法可以由计算机设备来实现。图5为根据本申请实施例的计算机设备的硬件结构示意图。In addition, the method for detecting a phishing website in the embodiment of the present application described in conjunction with FIG. 1 can be implemented by a computer device. Fig. 5 is a schematic diagram of the hardware structure of a computer device according to an embodiment of the present application.
计算机设备可以包括处理器51以及存储有计算机程序指令的存储器52。The computer device may include a processor 51 and a memory 52 storing computer program instructions.
具体地,上述处理器51可以包括中央处理器(CPU),或者特定集成电路(Application Specific Integrated Circuit,简称为ASIC),或者可以被配置成实施本申请实施例的一个或多个集成电路。Specifically, the foregoing processor 51 may include a central processing unit (CPU), or a specific integrated circuit (Application Specific Integrated Circuit, ASIC for short), or may be configured to implement one or more integrated circuits of the embodiments of the present application.
其中,存储器52可以包括用于数据或指令的大容量存储器。举例来说而非限制,存储器52可包括硬盘驱动器(Hard Disk Drive,简称为HDD)、软盘驱动器、固态驱动器(Solid State Drive,简称为SSD)、闪存、光盘、磁光盘、磁带或通用串行总线(Universal Serial Bus,简称为USB)驱动器或者两个或更多个以上这些的组合。在合适的情况下,存储器52可包括可移除或不可移除(或固定)的介质。在合适的情况下,存储器52可在数据处理装置的内部或外部。在特定实施例中,存储器52是非易失性(Non-Volatile)存储器。在特定实施例中,存储器52包括只读存储器(Read-Only Memory,简称为ROM)和随机存取存储器(Random Access Memory,简称为RAM)。Among them, the memory 52 may include a large-capacity memory for data or instructions. For example and not limitation, the memory 52 may include a hard disk drive (Hard Disk Drive, referred to as HDD), a floppy disk drive, a solid state drive (Solid State Drive, referred to as SSD), flash memory, optical disk, magneto-optical disk, magnetic tape, or universal serial Universal Serial Bus (USB for short) driver or a combination of two or more of these. Where appropriate, the storage 52 may include removable or non-removable (or fixed) media. Where appropriate, the memory 52 may be internal or external to the data processing device. In a specific embodiment, the memory 52 is a non-volatile (Non-Volatile) memory. In a specific embodiment, the memory 52 includes a read-only memory (Read-Only Memory, ROM for short) and a random access memory (Random Access Memory, RAM for short).
存储器52可以用来存储或者缓存需要处理和/或通信使用的各种数据文件,以及处理器51所执行的可能的计算机程序指令。The memory 52 may be used to store or cache various data files that need to be processed and/or used for communication, and possible computer program instructions executed by the processor 51.
处理器51通过读取并执行存储器52中存储的计算机程序指令,以实现上述实施例中的任意一种钓鱼网站的检测方法。The processor 51 reads and executes the computer program instructions stored in the memory 52 to implement any one of the phishing website detection methods in the foregoing embodiments.
在其中一些实施例中,钓鱼网站的检测设备还可包括通信接口53和总线50。其中,如图5所示,处理器51、存储器52、通信接口53通过总线50连接并完成相互间的通信。In some of the embodiments, the detection device of the phishing website may further include a communication interface 53 and a bus 50. Among them, as shown in FIG. 5, the processor 51, the memory 52, and the communication interface 53 are connected through the bus 50 and complete the mutual communication.
通信接口53用于实现本申请实施例中各模块、装置、单元和/或设备之间的通信。通信接口53还可以实现与其他部件例如:外接设备、图像/数据采集设备、数据库、外部存储以及图像/数据处理工作站等之间进行数据通信。The communication interface 53 is used to implement communication between various modules, devices, units, and/or devices in the embodiments of the present application. The communication interface 53 can also implement data communication with other components such as external devices, image/data acquisition devices, databases, external storage, and image/data processing workstations.
总线50包括硬件、软件或两者,将计算机设备的部件彼此耦接在一起。总线50包括但不限于以下至少之一:数据总线(Data Bus)、地址总线(Address Bus)、控制总线(Control Bus)、扩展总线(Expansion Bus)、局部总线(Local Bus)。在合适的情况下,总线50可包括一个或多个总线。尽管本申请实施例描述和示出了特定的总线,但本申请考虑任何合适的总线或互连。The bus 50 includes hardware, software, or both, and couples the components of the computer device to each other. The bus 50 includes but is not limited to at least one of the following: a data bus (Data Bus), an address bus (Address Bus), a control bus (Control Bus), an expansion bus (Expansion Bus), and a local bus (Local Bus). Where appropriate, the bus 50 may include one or more buses. Although the embodiments of this application describe and show a specific bus, this application considers any suitable bus or interconnection.
该计算机设备可以基于获取到的待测网站的URL,执行本申请实施例中的钓鱼网站的检测方法,从而实现结合图1描述的钓鱼网站的检测方法。The computer device can execute the detection method of the phishing website in the embodiment of the present application based on the acquired URL of the website to be tested, thereby realizing the detection method of the phishing website described in conjunction with FIG. 1.
另外,结合上述实施例中的钓鱼网站的检测方法,本申请实施例可提供一种计算机可读存储介质来实现。该计算机可读存储介质上存储有计算机程序指令;该计算机程序指令被处理器执行时实现上述实施例中的任意一种钓鱼网站的检测方法。In addition, in combination with the detection method of the phishing website in the foregoing embodiment, the embodiment of the present application may provide a computer-readable storage medium for implementation. The computer-readable storage medium stores computer program instructions; when the computer program instructions are executed by the processor, any one of the phishing website detection methods in the above-mentioned embodiments is implemented.
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above-mentioned embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the various technical features in the above-mentioned embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, All should be considered as the scope of this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation manners of the present application, and their description is relatively specific and detailed, but they should not be understood as a limitation on the scope of the patent application. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims (10)

  1. 一种钓鱼网站的检测方法,其特征在于,所述方法包括:A method for detecting phishing websites, which is characterized in that the method includes:
    获取待测网站的多个特征信息;Obtain multiple feature information of the website to be tested;
    获取每个特征信息对应的置信度值和权重值;Obtain the confidence value and weight value corresponding to each feature information;
    按照所述多个特征信息分别对应的置信度值和权重值,确定所述待测网站的加权置信度值;Determine the weighted confidence value of the website to be tested according to the confidence value and the weight value respectively corresponding to the multiple feature information;
    在所述加权置信度值大于预设阈值的情况下,确定所述待测网站为钓鱼网站。In a case where the weighted confidence value is greater than a preset threshold, it is determined that the website to be tested is a phishing website.
  2. 根据权利要求1所述的钓鱼网站的检测方法,其中,所述多个特征信息包括第一特征信息;获取待测网站的多个特征信息包括:The method for detecting a phishing website according to claim 1, wherein the plurality of characteristic information includes first characteristic information; and acquiring the plurality of characteristic information of the website to be tested includes:
    获取所述网站的URL;Obtain the URL of the website;
    根据所述URL获取所述第一特征信息,其中,所述第一特征信息包括以下至少之一:IP、域名、可执行文件的文件Hash值、Whois信息。The first characteristic information is acquired according to the URL, where the first characteristic information includes at least one of the following: IP, domain name, file hash value of the executable file, and Whois information.
  3. 根据权利要求2所述的钓鱼网站的检测方法,其中,所述多个特征信息还包括第二特征信息;获取待测网站的多个特征信息还包括以下至少之一:The method for detecting a phishing website according to claim 2, wherein the plurality of characteristic information further includes second characteristic information; acquiring the plurality of characteristic information of the website to be tested further includes at least one of the following:
    从情报库中获取与所述IP关联的域名信息,以及获取与所述域名关联的IP信息,所述第二特征信息包括与所述IP关联的域名信息,以及与所述域名关联的IP信息;Obtain the domain name information associated with the IP and obtain the IP information associated with the domain name from the information database. The second characteristic information includes the domain name information associated with the IP and the IP information associated with the domain name. ;
    通过Whois反查技术,获取与所述Whois信息关联的IP和域名信息,所述第二特征信息包括与所述Whois信息关联的IP和域名信息;Obtain the IP and domain name information associated with the Whois information through the Whois reverse search technology, and the second characteristic information includes the IP and domain name information associated with the Whois information;
    通过可执行文件动态分析技术,获取与所述可执行文件回连的IP和域名信息,其中,所述第二特征信息包括与所述可执行文件回连的IP和域名信息。Through the dynamic analysis technology of the executable file, the IP and domain name information connected back to the executable file is obtained, wherein the second characteristic information includes the IP and domain name information back connected to the executable file.
  4. 根据权利要求3所述的钓鱼网站的检测方法,其中,获取每个特征信息对应的置信度值和权重值包括:The method for detecting a phishing website according to claim 3, wherein obtaining the confidence value and the weight value corresponding to each feature information comprises:
    将所述待测网站的多个特征信息与所述情报库中的预设特征信息进行匹配;Matching multiple feature information of the website to be tested with preset feature information in the information database;
    根据匹配结果得到与所述待测网站中每个特征信息对应的置信度值和权重值。According to the matching result, a confidence value and a weight value corresponding to each feature information in the website to be tested are obtained.
  5. 根据权利要求4所述的钓鱼网站的检测方法,其中,获取每个特征信息对应的置信度值和权重值还包括:The method for detecting a phishing website according to claim 4, wherein obtaining the confidence value and the weight value corresponding to each feature information further comprises:
    从所述多个特征信息中确定第三特征信息,其中,所述第三特征信息包括以下至少之一:从所述URL中获取的网络传输协议信息、端口信息;Determine third characteristic information from the plurality of characteristic information, where the third characteristic information includes at least one of the following: network transmission protocol information and port information obtained from the URL;
    根据所述第三特征信息,获取对应于所述第三特征信息的置信度值和权重值。According to the third characteristic information, a confidence value and a weight value corresponding to the third characteristic information are obtained.
  6. 根据权利要求3所述的钓鱼网站的检测方法,其中,在按照所述多个特征信息分别对应的置信度值和权重值,确定所述待测网站的加权置信度值之后,所述方法还包括:The method for detecting a phishing website according to claim 3, wherein after the weighted confidence value of the website to be tested is determined according to the confidence value and the weight value corresponding to the plurality of feature information, the method further include:
    判断所述加权置信度值是否大于第一预设阈值;在判断到所述加权置信度值大于第一预设阈值的情况下,确定所述待测网站为钓鱼网站,并拒绝访问所述待测网站;Determine whether the weighted confidence value is greater than a first preset threshold; if it is determined that the weighted confidence value is greater than the first preset threshold, determine that the website to be tested is a phishing website, and deny access to the website to be tested Test website;
    判断所述加权置信度值是否大于第二预设阈值;在判断到所述加权置信度值大于第二预设阈值的情况下,确定所述待测网站为疑似钓鱼网站,并发出用于指示所述待测网站为疑似钓鱼网站的告警信息。Determine whether the weighted confidence value is greater than a second preset threshold; if it is determined that the weighted confidence value is greater than the second preset threshold, determine that the website to be tested is a suspected phishing website, and issue an instruction The website to be tested is warning information of a suspected phishing website.
  7. 根据权利要求6所述的钓鱼网站的检测方法,其中,在确定所述待测网站为钓鱼网 站或者疑似钓鱼网站的情况下之后,所述方法还包括:The method for detecting a phishing website according to claim 6, wherein after determining that the website to be tested is a phishing website or a suspected phishing website, the method further comprises:
    将所述待测网站的URL、所述待测网站中多个特征信息收录至所述情报库中。The URL of the website to be tested and multiple characteristic information of the website to be tested are included in the information database.
  8. 一种钓鱼网站的检测装置,其特征在于,包括:A detection device for phishing websites, which is characterized in that it comprises:
    第一获取模块,用于获取待测网站的多个特征信息;The first acquisition module is used to acquire multiple feature information of the website to be tested;
    第二获取模块,用于获取每个特征信息对应的置信度值和权重值;The second acquisition module is used to acquire the confidence value and weight value corresponding to each feature information;
    第一确定模块,用于按照所述多个特征信息分别对应的置信度值和权重值,确定所述待测网站的加权置信度值;The first determining module is configured to determine the weighted confidence value of the website to be tested according to the confidence value and the weight value respectively corresponding to the multiple feature information;
    第二确定模块,用于在所述加权置信度值大于预设阈值的情况下,确定所述待测网站为钓鱼网站。The second determining module is configured to determine that the website to be tested is a phishing website when the weighted confidence value is greater than a preset threshold.
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7中任一项所述的钓鱼网站的检测方法。A computer device, comprising a memory, a processor, and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program as claimed in claims 1 to 7. The method for detecting phishing websites described in any one of 7.
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现如权利要求1至7中任一项所述的钓鱼网站的检测方法。A computer-readable storage medium having a computer program stored thereon, wherein the program is executed by a processor to realize the method for detecting a phishing website according to any one of claims 1 to 7.
PCT/CN2021/088986 2020-06-22 2021-04-22 Phishing website detection method and apparatus, and device and computer readable storage medium WO2021258838A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010572801.0 2020-06-22
CN202010572801.0A CN111756724A (en) 2020-06-22 2020-06-22 Detection method, device and equipment for phishing website and computer readable storage medium

Publications (1)

Publication Number Publication Date
WO2021258838A1 true WO2021258838A1 (en) 2021-12-30

Family

ID=72675590

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/088986 WO2021258838A1 (en) 2020-06-22 2021-04-22 Phishing website detection method and apparatus, and device and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN111756724A (en)
WO (1) WO2021258838A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112217815B (en) * 2020-10-10 2022-09-13 杭州安恒信息技术股份有限公司 Phishing website identification method and device and computer equipment
CN112491784A (en) * 2020-10-14 2021-03-12 新浪网技术(中国)有限公司 Request processing method and device of Web site and computer readable storage medium
CN112785130B (en) * 2021-01-13 2024-04-16 上海派拉软件股份有限公司 Website risk level identification method, device, equipment and storage medium
CN113225343B (en) * 2021-05-10 2022-09-20 广州掌动智能科技有限公司 Risk website identification method and system based on identity characteristic information
CN113765752B (en) * 2021-09-03 2023-02-10 四川英得赛克科技有限公司 Operation and maintenance equipment identification method and system, electronic equipment and storage medium
CN113810395B (en) * 2021-09-06 2023-06-16 安天科技集团股份有限公司 Threat information detection method and device and electronic equipment
CN115801455B (en) * 2023-01-31 2023-05-26 北京微步在线科技有限公司 Method and device for detecting counterfeit website based on website fingerprint

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102833262A (en) * 2012-09-04 2012-12-19 珠海市君天电子科技有限公司 Whois information-based phishing website gathering, identification method and system
CN103546446A (en) * 2012-07-17 2014-01-29 腾讯科技(深圳)有限公司 Phishing website detection method, device and terminal
CN103634317A (en) * 2013-11-28 2014-03-12 北京奇虎科技有限公司 Method and system of performing safety appraisal on malicious web site information on basis of cloud safety
CN104954372A (en) * 2015-06-12 2015-09-30 中国科学院信息工程研究所 Method and system for performing evidence acquisition and verification on phishing website
US20160261618A1 (en) * 2015-03-05 2016-09-08 Maxim G. Koshelev System and method for selectively evolving phishing detection rules
CN110855716A (en) * 2019-11-29 2020-02-28 北京邮电大学 Self-adaptive security threat analysis method and system for counterfeit domain names

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7634810B2 (en) * 2004-12-02 2009-12-15 Microsoft Corporation Phishing detection, prevention, and notification
CN102647422B (en) * 2012-04-10 2014-09-17 中国科学院计算机网络信息中心 Phishing website detection method and device
CN106888220A (en) * 2017-04-12 2017-06-23 恒安嘉新(北京)科技股份公司 A kind of detection method for phishing site and equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103546446A (en) * 2012-07-17 2014-01-29 腾讯科技(深圳)有限公司 Phishing website detection method, device and terminal
CN102833262A (en) * 2012-09-04 2012-12-19 珠海市君天电子科技有限公司 Whois information-based phishing website gathering, identification method and system
CN103634317A (en) * 2013-11-28 2014-03-12 北京奇虎科技有限公司 Method and system of performing safety appraisal on malicious web site information on basis of cloud safety
US20160261618A1 (en) * 2015-03-05 2016-09-08 Maxim G. Koshelev System and method for selectively evolving phishing detection rules
CN104954372A (en) * 2015-06-12 2015-09-30 中国科学院信息工程研究所 Method and system for performing evidence acquisition and verification on phishing website
CN110855716A (en) * 2019-11-29 2020-02-28 北京邮电大学 Self-adaptive security threat analysis method and system for counterfeit domain names

Also Published As

Publication number Publication date
CN111756724A (en) 2020-10-09

Similar Documents

Publication Publication Date Title
WO2021258838A1 (en) Phishing website detection method and apparatus, and device and computer readable storage medium
Ramesh et al. An efficacious method for detecting phishing webpages through target domain identification
WO2019134334A1 (en) Network abnormal data detection method and apparatus, computer device and storage medium
US9621566B2 (en) System and method for detecting phishing webpages
JP6068506B2 (en) System and method for dynamic scoring of online fraud detection
JP4358188B2 (en) Invalid click detection device in Internet search engine
US10298599B1 (en) Systems for detecting a headless browser executing on a client computer
CN110537180B (en) System and method for tagging elements in internet content within a direct browser
EP4319054A2 (en) Identifying legitimate websites to remove false positives from domain discovery analysis
US8051484B2 (en) Method and security system for indentifying and blocking web attacks by enforcing read-only parameters
CN109768992B (en) Webpage malicious scanning processing method and device, terminal device and readable storage medium
WO2015096528A1 (en) Method and device for detecting security of online shopping environment
CN109690547A (en) For detecting the system and method cheated online
WO2014036801A1 (en) Method for detecting phishing website without depending on sample
WO2014063520A1 (en) Method and apparatus for determining phishing website
WO2014032619A1 (en) Web address access method and system
CN111753171B (en) Malicious website identification method and device
CN112929390B (en) Network intelligent monitoring method based on multi-strategy fusion
WO2018077035A1 (en) Malicious resource address detecting method and apparatus, and storage medium
CN106789939A (en) A kind of detection method for phishing site and device
Jain et al. Detection of phishing attacks in financial and e-banking websites using link and visual similarity relation
CN108270754B (en) Detection method and device for phishing website
CN109495471B (en) Method, device and equipment for judging WEB attack result and readable storage medium
WO2016201994A1 (en) Method and device for determining domain name credibility
Shahriar et al. Information source-based classification of automatic phishing website detectors

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21830042

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21830042

Country of ref document: EP

Kind code of ref document: A1