WO2020082763A1 - Decision trees-based method and apparatus for detecting phishing website, and computer device - Google Patents

Decision trees-based method and apparatus for detecting phishing website, and computer device Download PDF

Info

Publication number
WO2020082763A1
WO2020082763A1 PCT/CN2019/091878 CN2019091878W WO2020082763A1 WO 2020082763 A1 WO2020082763 A1 WO 2020082763A1 CN 2019091878 W CN2019091878 W CN 2019091878W WO 2020082763 A1 WO2020082763 A1 WO 2020082763A1
Authority
WO
WIPO (PCT)
Prior art keywords
website
url
detected
information
phishing
Prior art date
Application number
PCT/CN2019/091878
Other languages
French (fr)
Chinese (zh)
Inventor
谭杰
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020082763A1 publication Critical patent/WO2020082763A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing

Definitions

  • the present application relates to the field of intelligent decision-making technology, in particular to a method, device and computer equipment for detecting a phishing website based on a decision tree.
  • Phishing website means that criminals use various means to spoof the address and page content of a real website, or use vulnerabilities in the server program of a real website to insert dangerous HTML code in certain pages of the site to deceive users Bank account or credit card account, password and other private information.
  • the detection method of the phishing website in the related art there is a scheme to determine whether the website to be detected is a phishing website by comparing the domain name information and content identification information of the website to be detected with the target website.
  • the accuracy of the detection results is low by comparing with the target website.
  • the purpose of this application is to provide a method, device and computer equipment for detecting a phishing website based on a decision tree to solve the problems in the prior art.
  • the present application provides a method for detecting a phishing website based on a decision tree, including the following steps:
  • Step 01 Construct a random forest in advance, and the constructed random forest includes several decision trees;
  • Step 02 Determine the webpage information of the website to be tested
  • Step 03 Extract the feature information of the website to be detected according to the webpage information of the website to be detected;
  • Step 04 Use each decision tree of the random forest to classify and vote on the extracted feature information
  • Step 05 When the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website, it is determined that the website to be detected is a phishing website;
  • the construction of the random forest includes:
  • Step 011 randomly selecting n samples from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites;
  • Step 012 randomly select k feature information from all the set feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples;
  • Step 013 repeat steps 011-012 m times to generate m decision trees, and the generated m decision trees constitute a random forest;
  • n, k, m are all positive integers.
  • the present application also provides a phishing website detection device based on decision tree, including:
  • the random forest construction module is used to construct a random forest in advance to obtain a random forest classifier, and the constructed random forest includes several decision trees;
  • Webpage information determination module used to determine the webpage information of the website to be tested
  • a feature information extraction module configured to extract feature information of the website to be detected according to the webpage information of the website to be detected
  • the random forest classifier is used to classify and vote on the extracted feature information by using each decision tree of the random forest;
  • the detection result determination module is used to determine that the website to be tested is a phishing website when the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website;
  • the random forest construction module is specifically used to construct a random forest in the following ways:
  • Step 011 randomly selecting n samples from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites;
  • Step 012 randomly select k feature information from all the set feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples;
  • Step 013 repeat steps 011-012 m times to generate m decision trees, and the generated m decision trees constitute a random forest;
  • n, k, m are all positive integers.
  • the present application also provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor.
  • Step 01 Construct a random forest in advance, and the constructed random forest includes several decision trees;
  • Step 02 Determine the webpage information of the website to be tested
  • Step 03 Extract the feature information of the website to be detected according to the webpage information of the website to be detected;
  • Step 04 Use each decision tree of the random forest to classify and vote on the extracted feature information
  • Step 05 When the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website, it is determined that the website to be detected is a phishing website;
  • Step 01 includes:
  • Step 011 randomly selecting n samples from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites;
  • Step 012 randomly select k feature information from all the set feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples;
  • Step 013 repeat steps 011-012 m times to generate m decision trees, and the generated m decision trees constitute a random forest;
  • n, k, m are all positive integers.
  • the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following steps of a method for detecting a phishing website based on a decision tree are implemented:
  • Step 01 Construct a random forest in advance, and the constructed random forest includes several decision trees;
  • Step 02 Determine the webpage information of the website to be tested
  • Step 03 Extract the feature information of the website to be detected according to the webpage information of the website to be detected;
  • Step 04 Use each decision tree of the random forest to classify and vote on the extracted feature information
  • Step 05 When the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website, it is determined that the website to be detected is a phishing website;
  • step 01 includes:
  • Step 011 randomly selecting n samples from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites;
  • Step 012 randomly select k feature information from all the set feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples;
  • Step 013 repeat steps 011-012 m times to generate m decision trees, and the generated m decision trees constitute a random forest;
  • n, k, m are all positive integers.
  • the method, device and computer equipment for detecting a phishing website based on a decision tree provided by this application by determining the web page information of the website to be detected, extracting the characteristic information of the website to be detected according to the web page information of the website to be detected, and using In the random forest of trees, the extracted feature information is classified and voted.
  • the result of the classification vote is that the number of votes of the phishing website is more than the number of votes of the normal website
  • the website to be detected can be determined to be a phishing website.
  • the random forest is constructed by establishing a decision tree through a large number of website samples.
  • the types of phishing websites included are diverse.
  • the use of random forests for classification and voting has a high accuracy rate.
  • FIG. 1 is a flowchart of Embodiment 1 of a method for detecting a phishing website based on a decision tree;
  • FIG. 2 is a schematic diagram of a program module of a first embodiment of an application for detecting a phishing website based on a decision tree in this application;
  • FIG. 3 is a schematic diagram of another program module of the first embodiment of a phishing website detection device based on a decision tree of the present application;
  • FIG. 4 is a schematic diagram of a hardware structure of a first embodiment of a phishing website detection device based on a decision tree according to the present application;
  • FIG. 5 is a flowchart of Embodiment 2 of a method for detecting a phishing website based on a decision tree in this application.
  • the method, device and computer equipment for detecting a phishing website based on a decision tree are applicable to the field of intelligent decision-making technology, and are a method for classifying and voting through each decision tree in a random forest to detect whether it is a phishing website.
  • This application determines the webpage information of the website to be tested, extracts the feature information of the website to be tested based on the webpage information of the website to be tested, and uses the constructed random forest including several decision trees to classify the extracted feature information to vote.
  • the voting result is that the phishing website has more votes than the normal website, it can be determined whether the website to be tested is a phishing website.
  • the random forest is constructed by establishing a decision tree through a large number of website samples.
  • the types of phishing websites included are diverse, and the use of random forests for classification and voting has a high accuracy rate.
  • a method for detecting a phishing website based on a decision tree in this embodiment includes the following steps:
  • Step 01 Construct a random forest in advance, and the constructed random forest includes several decision trees.
  • a random forest is a classifier that contains multiple decision trees. It uses multiple decision trees to train samples and implement predictions.
  • the output category is determined by the mode of the category output by the individual trees.
  • random forest can be used to detect whether the website to be detected is a phishing website.
  • n samples are randomly selected from the sample set with replacement, and the sample set includes webpage information of several fishing websites and webpage information of several normal websites.
  • the webpage information included in the sample set may be a URL.
  • the phishing websites included in the sample set can be collected and constructed during the accumulation of experience, or they can be obtained from the blacklist of Google Safe API.
  • the Google safe API blacklist includes several URLs, and the websites corresponding to these URLs are all phishing sites. Therefore, when constructing the sample set, all or part of the URLs in the Google safe API blacklist can be added to the sample set .
  • the sample set In order to ensure that the decision tree voting classification constructed using the sample set is more accurate, the sample set also needs to include the webpage information of several normal websites, where the webpage information of the normal website can be obtained from websites other than the blacklist of Google safe API .
  • Step 012 randomly select k feature information from all the set feature information, and use the randomly selected k feature information to establish a decision tree for the selected n samples.
  • the data input to the classifier is the feature information, and the decision of each node in the established decision tree is determined based on these feature information.
  • the feature information includes at least one of the following information:
  • each information resource has a uniform and unique address on the Internet.
  • the address is called URL (Uniform Resource Locator), which is the unified resource location mark of WWW, which refers to the network address.
  • the most commonly used transmission protocol for URLs is the HTTP protocol, which is currently the most widely used protocol in WWW.
  • Common URL formats can include: http format, file format, ftp format, gopher format, etc. According to experience, when the URL is in IP format, the website corresponding to the URL may be a phishing website.
  • the time period in which the URL domain name exists can also be used as the feature information for establishing the decision tree.
  • the set number of days can be 30 days.
  • the website may be a phishing website. Therefore, whether the URL contains the @ character is also used as feature information for establishing a decision tree.
  • Some phishing websites will use multiple domain names to disguise. For example, when clicking on the URL of a website, there will be multiple jumps in the middle. Therefore, when at least two domain names are included in the URL, the website corresponding to the URL may be It will be a phishing website.
  • the purpose of a general phishing website is to steal user account password information. Therefore, if the account password information is included in the form of the website, the website may be a phishing website. Therefore, whether the account password information is included in the form as a decision tree Characteristic information.
  • the value before the URL jump is "Taobao”
  • the value after the URL jump is not "Taobao”
  • the website may be a phishing website, using the value before the URL jump to deceive the user.
  • the value after the URL jump can be obtained by parsing the opened web page.
  • the best splitting method may be calculated according to the k randomly selected feature information.
  • Splitting refers to the process of splitting the training data set into two sub-data sets again and again during the training process of the decision tree.
  • the value k of randomly selecting the number of feature information may be rounding the root number N.
  • the rounding of the root number N may be rounding up or rounding down, which may be preset in advance.
  • the number N of all the feature information set is equal to 10
  • the root number 10 is approximately 3.16.
  • rounding up is equal to 4
  • the number k of feature information randomly selected is 4
  • Take rounding down as an example, then rounding down is equal to 3, and the number k of randomly selected feature information is 3.
  • Step 013 Repeat steps 011-012 m times to generate m decision trees.
  • the generated m decision trees form a random forest; where m is a positive integer.
  • a random forest composed of m decision trees is a random forest classifier.
  • each splitting process of the decision tree in the random forest does not use all the feature information to be selected, but selects a certain amount of feature information randomly from all the feature information to be selected, and then The best feature information is selected from the selected feature information. This can make the decision trees in the random forest different from each other, improve the diversity of the system, and thus improve the classification performance.
  • the unsampled samples in the sample set can be used as the test data of the random forest classifier to verify the accuracy of the random forest classifier. For example, select several unsampled websites in the sample set Web page information, it is known that these selected types of unsampled websites, that is, phishing websites or normal websites, extract the feature information of each website, and input the extracted feature information into the random forest classifier respectively, according to the random forest classification The type of the website detected by the device is compared with its true type. If the accuracy rate exceeds the set probability, it indicates that the accuracy rate of the random forest classifier meets the requirements and can be used.
  • Step 02 Determine the webpage information of the website to be detected.
  • the determined webpage information of the website to be detected may be a URL.
  • a blacklist of phishing websites may be constructed in advance, where the blacklist includes URLs of several websites, all of which have been determined
  • the URL of the website to be tested can be obtained according to the webpage information of the website to be tested, and the URL of the website to be tested can be compared with a pre-built blacklist, If the URL of the website to be detected is included in the blacklist, it is determined that the website to be detected is a phishing website, and if the URL of the website to be detected is not included in the blacklist, the website to be detected needs to be further To check, you need to perform step 03.
  • the blacklist may be a blacklist of Google Safe API.
  • Step 03 Extract the feature information of the website to be detected according to the webpage information of the website to be detected.
  • the extracted feature information may be the same as the feature information in step 012, and the feature information may include at least one of the following information: (1) whether the URL is in IP format; (2) the time period when the URL domain name exists Whether it is less than the set number of days; (3) whether the URL contains the @ character; (4) whether the URL includes at least two domain names; (5) whether the account password information is included in the form; (6) the value and jump after the URL jump Is it the same before transfer?
  • the feature information extracted for the website to be detected is the above six pieces of information.
  • the extracted feature information may be Booleanized and converted into corresponding feature values. For example, for the above six feature information:
  • the URL of the website to be tested is in IP format, it will be converted to feature value 1, if the URL of the website to be tested is not in IP format, it will be converted to feature value 0;
  • time period of the URL domain name of the detection website is less than the set number of days, it will be converted to feature value 1, if the time period of the URL domain name of the detection website is not less than the set number of days, it will be converted to feature value 0;
  • the URL includes at least two domain names, it will be converted to feature value 1, if the URL does not include at least two domain names, it will be converted to feature value 0;
  • the account password information is included in the form, it will be converted to feature value 1, if the account password information is not included in the form, it will be converted to feature value 0;
  • the six feature values can also be converted into feature vectors.
  • the extracted six feature information are: URL is not in IP format, URL domain name exists for a period of time not less than the set number of days, and URL does not contain @ Characters, URL includes a domain name, the form does not include account password information, the value after the URL jump is not the same as before the jump; then the converted feature vector is [0,0,0,0,0,1].
  • Step 04 Use each decision tree of the random forest to classify and vote on the extracted feature information.
  • Step 05 When the result of the classification vote is that the number of votes of the phishing website is more than the number of votes of the normal website, it is determined that the website to be detected is a phishing website.
  • the website to be tested is determined to be a phishing website, and if the classification voting result is that the phishing website has fewer votes than the normal website , The website to be tested is determined to be a normal website.
  • the random forest classifier detects whether the website is a phishing website or a normal website by using type identifiers 1, 0, where output 1 indicates that the website is a phishing website and output 0 indicates that the website is a normal website.
  • it may further include: when it is determined that the website to be detected is a phishing website according to the voting result, adding the URL of the website to be detected to Pre-built blacklist.
  • the feature information of the website to be detected is extracted according to the webpage information of the website to be tested, and the random feature forest including a plurality of decision trees is constructed to classify the extracted feature information to vote ,
  • the random forest is constructed by establishing a decision tree through a large number of website samples. The types of phishing websites included are diverse, and the use of random forests for classification and voting has a high accuracy rate.
  • the deciding tree detection device 10 based on a decision tree may include or be divided into one or more program modules, one or Multiple program modules are stored in the storage medium and executed by one or more processors to complete the present application, and can implement the above-mentioned decision tree-based phishing website detection method.
  • the program module referred to in this application refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable for describing the execution process of the phishing website detection device 10 based on the decision tree in the storage medium than the program itself. The following description will specifically introduce the functions of the program modules of this embodiment:
  • the random forest construction module 11 is used to construct a random forest in advance to obtain a random forest classifier 14, and the constructed random forest includes several decision trees;
  • the webpage information determination module 12 is used to determine the webpage information of the website to be detected
  • the feature information extraction module 13 is configured to extract feature information of the website to be detected according to the webpage information of the website to be detected;
  • the random forest classifier 14 is used to classify and vote on the extracted feature information by using each decision tree of the random forest;
  • the detection result determination module 15 is used to determine that the website to be detected is a phishing website when the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website;
  • the random forest construction module 11 is specifically used to select n samples randomly from the sample set with replacement; the sample set includes webpage information of several phishing websites and webpage information of several normal websites; from the settings Randomly select k feature information from all the feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples; repeat the above steps m times to generate m decision trees, which consist of m decision trees Random forest; where n, k, and m are all positive integers.
  • the phishing website detection device 10 based on the decision tree may further include: a Boolean processing module 16, which is used to Booleanize the extracted feature information to convert into corresponding feature values And output the converted feature value to the random forest classifier 14.
  • a Boolean processing module 16 which is used to Booleanize the extracted feature information to convert into corresponding feature values And output the converted feature value to the random forest classifier 14.
  • the phishing website detection device 10 based on the decision tree may further include: a primary detection module 17, configured to obtain the website to be detected according to the webpage information of the website to be detected URL, comparing the URL of the website to be detected with a pre-built blacklist, if the URL of the website to be detected is included in the blacklist, it is determined that the website to be detected is a phishing website, and if the black If the URL of the website to be tested is not included in the list, the web page information of the website to be tested is output to the feature information extraction module 13.
  • a primary detection module 17 configured to obtain the website to be detected according to the webpage information of the website to be detected URL, comparing the URL of the website to be detected with a pre-built blacklist, if the URL of the website to be detected is included in the blacklist, it is determined that the website to be detected is a phishing website, and if the black If the URL of the website to be tested is not included in the list, the web page information of the website to be tested is output to
  • the phishing website detection device 10 based on the decision tree may further include: a blacklist adding module 18, used to determine that the website to be detected is a phishing website based on the voting result Add the URL of the website to be detected to the pre-built blacklist.
  • This embodiment also provides a computer device, such as a smartphone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a rack server (including an independent server, or A server cluster composed of multiple servers), etc.
  • the computer device 20 of this embodiment includes at least but not limited to: a memory 21 and a processor 22 that can be communicatively connected to each other through a system bus, as shown in FIG. 4. It should be noted that FIG. 4 only shows the computer device 20 having the components 21-22, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.
  • the memory 21 (read-only storage medium) includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), Read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
  • the memory 21 may be an internal storage unit of the computer device 20, such as a hard disk or memory of the computer device 20.
  • the memory 21 may also be an external storage device of the computer device 20, for example, a plug-in hard disk equipped on the computer device 20, a smart memory card (Smart Media, Card, SMC), and secure digital (Secure Digital, SD) card, flash card (Flash Card), etc.
  • the memory 21 may also include both the internal storage unit of the computer device 20 and its external storage device.
  • the memory 21 is generally used to store the operating system and various application software installed in the computer device 20, such as the program code of the phishing website detection apparatus 10 based on the decision tree in the first embodiment.
  • the memory 21 may also be used to temporarily store various types of data that have been output or will be output.
  • the processor 22 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
  • the processor 22 is generally used to control the overall operation of the computer device 20.
  • the processor 22 is used to run the program code or process data stored in the memory 21, for example, to run the decision tree-based phishing website detection device 10, so as to implement the decision tree-based phishing website detection method of Embodiment 1.
  • This embodiment also provides a computer-readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), only Read memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, server, App store, etc., on which computer programs are stored, When the program is executed by the processor, the corresponding function is realized.
  • the computer-readable storage medium of this embodiment is used to store a decision tree-based phishing website detection device 10, and when executed by a processor, implements the decision tree-based phishing website detection method of Embodiment 1.
  • the method for detecting a phishing website based on a decision tree in this embodiment is based on Embodiment 1, and includes the following steps:
  • Step 01 Construct a random forest.
  • the constructed random forest includes several decision trees.
  • n samples are randomly selected from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites.
  • the webpage information included in the sample set may be a URL.
  • Step 012 randomly select k feature information from all the set feature information, and use the randomly selected k feature information to establish a decision tree for the selected n samples.
  • the feature information includes at least one of the following information: (1) whether the URL is in IP format; (2) the URL domain name exists Whether the time period is less than the set number of days; (3) whether the URL contains the @ character; (4) whether the URL includes at least two domain names; (5) whether the account password information is included in the form; (6) the value after the URL jump Is it the same as before the jump?
  • Step 013. Repeat steps 011-012 m times to generate m decision trees.
  • the generated m decision trees form a random forest; where m is a positive integer.
  • Step 02 Pre-construct a blacklist including webpage information of several phishing websites.
  • the blacklist may be a blacklist of Google Safe API.
  • Step 03 Determine the webpage information of the website to be detected.
  • Step 04 Obtain the URL of the website to be tested according to the webpage information of the website to be tested, and compare the URL of the website to be tested with a pre-built blacklist, if the blacklist includes the to-be-detected The URL of the website determines that the website to be detected is a phishing website. If the URL of the website to be detected is not included in the blacklist, step 05 is performed.
  • Step 05 Extract the feature information of the website to be detected according to the webpage information of the website to be detected.
  • the extracted feature information may be the same as the feature information in step 012.
  • Step 06 Booleanize the extracted feature information to convert to corresponding feature values.
  • the time period of the URL domain name of the detected website is not less than the set number of days, it will be converted to feature value 0; Feature value 1, if the URL does not contain the @ character, it is converted to feature value 0; if the URL includes at least two domain names, it is converted to feature value 1, if the URL does not include at least two domain names, it is converted to feature value 0; if the account password information is included in the form, it will be converted to feature value 1, if the account password information is not included in the form, it will be converted to feature value 0; if the value after the URL jump is not the same as before the jump, it will be converted to Feature value 1, if the value after URL jump is the same as before jump, it will be converted to feature value 0.
  • Step 07 Use each decision tree of the random forest to classify and vote on the extracted feature information.
  • Step 08 When the voting result is that the number of votes of the phishing website is more than the number of votes of the normal website, the website to be detected is determined to be a phishing website; otherwise, it is determined to be a normal website; The website, and perform step 09; if it is a normal website, the user is prompted to visit the website.
  • the random forest classifier detects whether the website is a phishing website or a normal website by using type identifiers 1, 0, where output 1 indicates that the website is a phishing website and output 0 indicates that the website is a normal website.
  • Step 09 Add the URL of the website to be detected to the pre-built blacklist.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application provides a decision trees-based method and apparatus for detecting a phishing website, and a computer device, belonging to the technical field of intelligent decisions, said method comprising: pre-constructing a random forest as a classification model; determining webpage information about a website to be detected, and extracting feature information about said website according to the webpage information about said website; using the constructed random forest comprising several decision trees to perform classification vote on the extracted feature information; and when the result of the classification vote indicates that the votes of a phishing website are greater than the votes of a normal website, determining that said website is a phishing website. In the present application, the random forest is constructed by establishing decision trees using a large number of website samples, and comprises diverse types of phishing websites; and the random forest is used as a classification model to perform classification vote, achieving high accuracy.

Description

基于决策树的钓鱼网站检测方法、装置及计算机设备Method, device and computer equipment for detecting phishing website based on decision tree
本申请申明享有2018年10月26日递交的申请号为CN2018112561895、名称为“基于决策树的钓鱼网站检测方法、装置及计算机设备”的中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。This application declares to enjoy the priority of the Chinese patent application filed on October 26, 2018 with the application number CN2018112561895 and the name "decision tree-based phishing website detection method, device and computer equipment". The overall content of the Chinese patent application is: The way of reference is incorporated in this application.
技术领域Technical field
本申请涉及智能决策技术领域,尤其涉及一种基于决策树的钓鱼网站检测方法、装置及计算机设备。The present application relates to the field of intelligent decision-making technology, in particular to a method, device and computer equipment for detecting a phishing website based on a decision tree.
背景技术Background technique
“钓鱼网站”是指不法分子利用各种手段,仿冒真实网站的地址以及页面内容,或者,利用真实网站服务器程序上的漏洞在站点的某些网页中插入危险的HTML代码,以此来骗取用户银行账号或信用卡账号、密码等私人资料。"Phishing website" means that criminals use various means to spoof the address and page content of a real website, or use vulnerabilities in the server program of a real website to insert dangerous HTML code in certain pages of the site to deceive users Bank account or credit card account, password and other private information.
在相关技术中的钓鱼网站的检测方法,存在通过将待检测网站与目标网站的域名信息、内容标识信息进行比较,来确定待检测网站是否为钓鱼网站的方案。然而,钓鱼网站的种类多样,且钓鱼手段层出不穷,因此,通过与目标网站的比对,检测结果的准确率较低。In the detection method of the phishing website in the related art, there is a scheme to determine whether the website to be detected is a phishing website by comparing the domain name information and content identification information of the website to be detected with the target website. However, there are various types of phishing websites and endless phishing methods. Therefore, the accuracy of the detection results is low by comparing with the target website.
发明内容Summary of the invention
本申请的目的是提供一种基于决策树的钓鱼网站检测方法、装置及计算机设备,用于解决现有技术存在的问题。The purpose of this application is to provide a method, device and computer equipment for detecting a phishing website based on a decision tree to solve the problems in the prior art.
为实现上述目的,本申请提供一种基于决策树的钓鱼网站检测方法,包括以下步骤:To achieve the above purpose, the present application provides a method for detecting a phishing website based on a decision tree, including the following steps:
步骤01,预先构建随机森林,构建的随机森林中包括若干棵决策树;Step 01: Construct a random forest in advance, and the constructed random forest includes several decision trees;
步骤02,确定待检测网站的网页信息;Step 02: Determine the webpage information of the website to be tested;
步骤03,根据所述待检测网站的网页信息,提取所述待检测网站的特征信息;Step 03: Extract the feature information of the website to be detected according to the webpage information of the website to be detected;
步骤04,利用随机森林的每棵决策树对提取的特征信息进行分类投票;Step 04: Use each decision tree of the random forest to classify and vote on the extracted feature information;
步骤05,在分类投票结果为钓鱼网站的投票数多于正常网站的投票数时,则确定所述待检测网站为钓鱼网站;Step 05: When the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website, it is determined that the website to be detected is a phishing website;
所述构建随机森林,包括:The construction of the random forest includes:
步骤011,从样本集中有放回地随机采样选出n个样本;所述样本集中包括若干个钓鱼网站的网页信息和若干个正常网站的网页信息;Step 011, randomly selecting n samples from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites;
步骤012,从设定的所有特征信息中随机选择k个特征信息,利用随机选择的k个特征信息对选出的n个样本建立决策树;Step 012: randomly select k feature information from all the set feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples;
步骤013,重复m次步骤011-012,生成m棵决策树,生成的m棵决策树组成随机森林;Step 013, repeat steps 011-012 m times to generate m decision trees, and the generated m decision trees constitute a random forest;
其中,n、k、m均为正整数。Among them, n, k, m are all positive integers.
为实现上述目的,本申请还提供一种基于决策树的钓鱼网站检测装置,包括:To achieve the above objective, the present application also provides a phishing website detection device based on decision tree, including:
随机森林构建模块,用于预先构建随机森林,得到随机森林分类器,构建的随机森林中包括若干棵决策树;The random forest construction module is used to construct a random forest in advance to obtain a random forest classifier, and the constructed random forest includes several decision trees;
网页信息确定模块,用于确定待检测网站的网页信息;Webpage information determination module, used to determine the webpage information of the website to be tested;
特征信息提取模块,用于根据所述待检测网站的网页信息,提取所述待检测网站的特征信息;A feature information extraction module, configured to extract feature information of the website to be detected according to the webpage information of the website to be detected;
所述随机森林分类器,用于利用随机森林的每棵决策树对提取的特征信息进行分类投票;The random forest classifier is used to classify and vote on the extracted feature information by using each decision tree of the random forest;
检测结果确定模块,用于在分类投票结果为钓鱼网站的投票数多于正常网站的投票数时,则确定所述待检测网站为钓鱼网站;The detection result determination module is used to determine that the website to be tested is a phishing website when the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website;
所述随机森林构建模块,具体用于通过如下方式构建随机森林:The random forest construction module is specifically used to construct a random forest in the following ways:
步骤011,从样本集中有放回地随机采样选出n个样本;所述样本集中包括若干个钓鱼网站的网页信息和若干个正常网站的网页信息;Step 011, randomly selecting n samples from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites;
步骤012,从设定的所有特征信息中随机选择k个特征信息,利用随机选择的k个特征信息对选出的n个样本建立决策树;Step 012: randomly select k feature information from all the set feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples;
步骤013,重复m次步骤011-012,生成m棵决策树,生成的m棵决策树组成随机森林;Step 013, repeat steps 011-012 m times to generate m decision trees, and the generated m decision trees constitute a random forest;
其中,n、k、m均为正整数。Among them, n, k, m are all positive integers.
为实现上述目的,本申请还提供一种计算机设备,包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现基于决策树的钓鱼网站检测方法的以下步骤:In order to achieve the above object, the present application also provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor. The following steps of the phishing website detection method:
步骤01,预先构建随机森林,构建的随机森林中包括若干棵决策树;Step 01: Construct a random forest in advance, and the constructed random forest includes several decision trees;
步骤02,确定待检测网站的网页信息;Step 02: Determine the webpage information of the website to be tested;
步骤03,根据所述待检测网站的网页信息,提取所述待检测网站的特征信息;Step 03: Extract the feature information of the website to be detected according to the webpage information of the website to be detected;
步骤04,利用随机森林的每棵决策树对提取的特征信息进行分类投票;Step 04: Use each decision tree of the random forest to classify and vote on the extracted feature information;
步骤05,在分类投票结果为钓鱼网站的投票数多于正常网站的投票数时,则确定所述待检测网站为钓鱼网站;Step 05: When the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website, it is determined that the website to be detected is a phishing website;
其中步骤01包括:Step 01 includes:
步骤011,从样本集中有放回地随机采样选出n个样本;所述样本集中包括若干个钓鱼网站的网页信息和若干个正常网站的网页信息;Step 011, randomly selecting n samples from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites;
步骤012,从设定的所有特征信息中随机选择k个特征信息,利用随机选择的k个特征信息对选出的n个样本建立决策树;Step 012: randomly select k feature information from all the set feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples;
步骤013,重复m次步骤011-012,生成m棵决策树,生成的m棵决策树组成随机森林;Step 013, repeat steps 011-012 m times to generate m decision trees, and the generated m decision trees constitute a random forest;
其中,n、k、m均为正整数。Among them, n, k, m are all positive integers.
为实现上述目的,本申请还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现基于决策树的钓鱼网站检测方法的以下步骤:To achieve the above object, the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following steps of a method for detecting a phishing website based on a decision tree are implemented:
步骤01,预先构建随机森林,构建的随机森林中包括若干棵决策树;Step 01: Construct a random forest in advance, and the constructed random forest includes several decision trees;
步骤02,确定待检测网站的网页信息;Step 02: Determine the webpage information of the website to be tested;
步骤03,根据所述待检测网站的网页信息,提取所述待检测网站的特征信息;Step 03: Extract the feature information of the website to be detected according to the webpage information of the website to be detected;
步骤04,利用随机森林的每棵决策树对提取的特征信息进行分类投票;Step 04: Use each decision tree of the random forest to classify and vote on the extracted feature information;
步骤05,在分类投票结果为钓鱼网站的投票数多于正常网站的投票数时,则确定所述待检测网站为钓鱼网站;Step 05: When the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website, it is determined that the website to be detected is a phishing website;
其中,步骤01包括:Among them, step 01 includes:
步骤011,从样本集中有放回地随机采样选出n个样本;所述样本集中包括若干个钓鱼网站的网页信息和若干个正常网站的网页信息;Step 011, randomly selecting n samples from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites;
步骤012,从设定的所有特征信息中随机选择k个特征信息,利用随机选择的k个特征信息对选出的n个样本建立决策树;Step 012: randomly select k feature information from all the set feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples;
步骤013,重复m次步骤011-012,生成m棵决策树,生成的m棵决策树组成随机森林;Step 013, repeat steps 011-012 m times to generate m decision trees, and the generated m decision trees constitute a random forest;
其中,n、k、m均为正整数。Among them, n, k, m are all positive integers.
本申请提供的基于决策树的钓鱼网站检测方法、装置及计算机设备,通过确定待检测网站的网页信息,根据待检测网站的网页信息提取待检测网站的特征信息,并利用构建的包括若干棵决策树的随机森林,对提取的特征信息进行分类投票,在分类投票结果为钓鱼网站的投票数多于正常网站的投票数时,则可以确定待检测网站为钓鱼网站。本申请中, 随机森林是通过大量网站样本进行决策树的建立以构建出来的,包含的钓鱼网站的种类具有多样性,利用随机森林进行分类投票,准确率较高。The method, device and computer equipment for detecting a phishing website based on a decision tree provided by this application, by determining the web page information of the website to be detected, extracting the characteristic information of the website to be detected according to the web page information of the website to be detected, and using In the random forest of trees, the extracted feature information is classified and voted. When the result of the classification vote is that the number of votes of the phishing website is more than the number of votes of the normal website, the website to be detected can be determined to be a phishing website. In this application, the random forest is constructed by establishing a decision tree through a large number of website samples. The types of phishing websites included are diverse. The use of random forests for classification and voting has a high accuracy rate.
附图说明BRIEF DESCRIPTION
图1为本申请基于决策树的钓鱼网站检测方法实施例一的流程图;FIG. 1 is a flowchart of Embodiment 1 of a method for detecting a phishing website based on a decision tree;
图2为本申请基于决策树的钓鱼网站检测装置实施例一的程序模块示意图;2 is a schematic diagram of a program module of a first embodiment of an application for detecting a phishing website based on a decision tree in this application;
图3为本申请基于决策树的钓鱼网站检测装置实施例一的另一种程序模块示意图;3 is a schematic diagram of another program module of the first embodiment of a phishing website detection device based on a decision tree of the present application;
图4为本申请基于决策树的钓鱼网站检测装置实施例一的一种硬件结构示意图;4 is a schematic diagram of a hardware structure of a first embodiment of a phishing website detection device based on a decision tree according to the present application;
图5为本申请基于决策树的钓鱼网站检测方法实施例二的流程图。FIG. 5 is a flowchart of Embodiment 2 of a method for detecting a phishing website based on a decision tree in this application.
具体实施方式detailed description
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of the present application more clear, the following describes the present application in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative work fall within the scope of protection of this application.
本申请提供的基于决策树的钓鱼网站检测方法、装置及计算机设备,适用于智能决策技术领域,为一种通过随机森林中的每棵决策树进行分类投票以检测是否为钓鱼网站的方法。本申请通过确定待检测网站的网页信息,根据待检测网站的网页信息提取待检测网站的特征信息,并利用构建的包括若干棵决策树的随机森林,对提取的特征信息进行分类投票,在分类投票结果为钓鱼网站的投票数多于正常网站的投票数时,则可以确定待检测网站是否为钓鱼网站。本申请中,随机森林是通过大量网站样本进行决策树的建立以构建出来的,包含的钓鱼网站的种类具有多样性,利用随机森林进行分类投票,准确率较高。The method, device and computer equipment for detecting a phishing website based on a decision tree provided by the present application are applicable to the field of intelligent decision-making technology, and are a method for classifying and voting through each decision tree in a random forest to detect whether it is a phishing website. This application determines the webpage information of the website to be tested, extracts the feature information of the website to be tested based on the webpage information of the website to be tested, and uses the constructed random forest including several decision trees to classify the extracted feature information to vote. When the voting result is that the phishing website has more votes than the normal website, it can be determined whether the website to be tested is a phishing website. In this application, the random forest is constructed by establishing a decision tree through a large number of website samples. The types of phishing websites included are diverse, and the use of random forests for classification and voting has a high accuracy rate.
实施例一Example one
请参阅图1,本实施例的一种基于决策树的钓鱼网站检测方法中,包括以下步骤:Referring to FIG. 1, a method for detecting a phishing website based on a decision tree in this embodiment includes the following steps:
步骤01,预先构建随机森林,构建的随机森林中包括若干棵决策树。Step 01: Construct a random forest in advance, and the constructed random forest includes several decision trees.
在机器学习中,随机森林是一个包含多棵决策树的分类器,利用多棵决策树对样本进行训练并实现预测,其输出的类别是由个别树输出的类别的众数而定。在本实施例中,可以利用随机森林实现对待检测网站是否为钓鱼网站的检测。In machine learning, a random forest is a classifier that contains multiple decision trees. It uses multiple decision trees to train samples and implement predictions. The output category is determined by the mode of the category output by the individual trees. In this embodiment, random forest can be used to detect whether the website to be detected is a phishing website.
在本实施例中,为了检测待检测网站是否为钓鱼网站,至少可以采用如下一种方式构建随机森林:In this embodiment, in order to detect whether the website to be detected is a phishing website, at least one of the following ways can be used to construct a random forest:
步骤011,从样本集中有放回地随机采样选出n个样本;所述样本集中包括若干个钓 鱼网站的网页信息和若干个正常网站的网页信息。In step 011, n samples are randomly selected from the sample set with replacement, and the sample set includes webpage information of several fishing websites and webpage information of several normal websites.
在本实施例中,样本集中包括的网页信息可以是URL。In this embodiment, the webpage information included in the sample set may be a URL.
样本集中包括的钓鱼网站可以是经验积累过程中收集构建起来的,也可以是从Google safe api的黑名单中获取到的。其中,Google safe api的黑名单中包括若干个URL,这些URL对应的网站都是钓鱼网站,因此,在构建样本集时,可以将Google safe api的黑名单中的全部或部分URL添加到样本集中。The phishing websites included in the sample set can be collected and constructed during the accumulation of experience, or they can be obtained from the blacklist of Google Safe API. Among them, the Google safe API blacklist includes several URLs, and the websites corresponding to these URLs are all phishing sites. Therefore, when constructing the sample set, all or part of the URLs in the Google safe API blacklist can be added to the sample set .
为了保证利用样本集构建的决策树投票分类更加准确,样本集中还需要包括若干个正常网站的网页信息,其中,正常网站的网页信息可以是从除Google safe api的黑名单以外的网站中获取到。In order to ensure that the decision tree voting classification constructed using the sample set is more accurate, the sample set also needs to include the webpage information of several normal websites, where the webpage information of the normal website can be obtained from websites other than the blacklist of Google safe API .
步骤012,从设定的所有特征信息中随机选择k个特征信息,利用随机选择的k个特征信息对选出的n个样本建立决策树。Step 012: randomly select k feature information from all the set feature information, and use the randomly selected k feature information to establish a decision tree for the selected n samples.
在分类问题中,输入到分类器中的数据即为特征信息,且建立的决策树上每个节点的决定都是基于这些特征信息确定的。In the classification problem, the data input to the classifier is the feature information, and the decision of each node in the established decision tree is determined based on these feature information.
在本实施例中,为了构建能够实现对网站检测是否为钓鱼网站的决策树,所述特征信息包括下述信息中的至少一个:In this embodiment, in order to construct a decision tree that can detect whether a website is a phishing website, the feature information includes at least one of the following information:
(1)URL是否为IP格式;(1) Whether the URL is in IP format;
在WWW上,每一信息资源都有统一的且在网上唯一的地址,该地址就叫URL(Uniform Resource Locator,统一资源定位符),它是WWW的统一资源定位标志,就是指网络地址。On WWW, each information resource has a uniform and unique address on the Internet. The address is called URL (Uniform Resource Locator), which is the unified resource location mark of WWW, which refers to the network address.
URL使用的传输协议中最常用的是HTTP协议,是目前WWW中应用最广的协议,常见的URL格式可以包括:http格式、file格式、ftp格式、gopher格式等。根据经验可知,在URL为IP格式时,刚URL对应的网站可能会是钓鱼网站。The most commonly used transmission protocol for URLs is the HTTP protocol, which is currently the most widely used protocol in WWW. Common URL formats can include: http format, file format, ftp format, gopher format, etc. According to experience, when the URL is in IP format, the website corresponding to the URL may be a phishing website.
(2)URL域名存在的时间段是否小于设定天数;(2) Whether the period of time when the URL domain name exists is less than the set number of days;
一般情况下,钓鱼网站域名存在的时间段越长,被举报的可能性越高,URL域名存在的时间段越短,是钓鱼网站的概率越大。因此,URL域名存在的时间段也可以作为建立决策树的特征信息。其中,该设定天数可以为30天。In general, the longer the phishing website domain name exists, the higher the possibility of being reported, and the shorter the URL domain name existence period, the greater the probability of being a phishing website. Therefore, the time period in which the URL domain name exists can also be used as the feature information for establishing the decision tree. The set number of days can be 30 days.
(3)URL中是否包含@字符;(3) Whether the @ character is included in the URL;
在本实施例中,当URL中包含有@字符时,该网站可能会是钓鱼网站,因此,将URL中是否包含@字符也作为建立决策树的特征信息。In this embodiment, when the URL contains the @ character, the website may be a phishing website. Therefore, whether the URL contains the @ character is also used as feature information for establishing a decision tree.
(4)URL中是否包括至少两个域名;(4) Whether at least two domain names are included in the URL;
一些钓鱼网站会用多个域名来进行伪装,例如,在点击某个网站的URL地址时,中间会存在多次跳转,因此,在URL中包括至少两个域名时,该URL对应的网站可能会是钓鱼网站。Some phishing websites will use multiple domain names to disguise. For example, when clicking on the URL of a website, there will be multiple jumps in the middle. Therefore, when at least two domain names are included in the URL, the website corresponding to the URL may be It will be a phishing website.
(5)表单中是否包括账号密码信息;(5) Whether the account password information is included in the form;
一般钓鱼网站的目的是用来盗取用户的账号密码信息,因此,若网站的表单中包括账号密码信息,那么该网站可能会是钓鱼网站,因此,表单中是否包括账号密码信息作为建立决策树的特征信息。The purpose of a general phishing website is to steal user account password information. Therefore, if the account password information is included in the form of the website, the website may be a phishing website. Therefore, whether the account password information is included in the form as a decision tree Characteristic information.
(6)URL跳转后的值与跳转前是否相同。(6) Whether the value after the URL jump is the same as before the jump.
例如,URL跳转前的值是“淘宝”,在点击URL链接之后,URL跳转后的值不是“淘宝”,那么该网站可能是钓鱼网站,利用URL跳转前的值来欺骗用户。其中,URL跳转后的值可以通过对打开的网页进行解析获取到。For example, the value before the URL jump is "Taobao", and after clicking the URL link, the value after the URL jump is not "Taobao", then the website may be a phishing website, using the value before the URL jump to deceive the user. Among them, the value after the URL jump can be obtained by parsing the opened web page.
在本实施例中,可以根据随机选择的这k个特征信息,计算其最佳的分裂方式。分裂是指在决策树的训练过程中,需要一次次的将训练数据集分裂成两个子数据集的过程。In this embodiment, the best splitting method may be calculated according to the k randomly selected feature information. Splitting refers to the process of splitting the training data set into two sub-data sets again and again during the training process of the decision tree.
在本实施例中,假设设定的所有特征信息的个数为N,随机选择特征信息的个数k值可以为对根号N取整。其中,对根号N取整,可以是向上取整,也可以是向下取整,具体可以预先设定。例如,设定的所有特征信息的个数N等于10,根号10约等于3.16,以向上取整为例,那么向上取整等于4,随机选择特征信息的个数k为4个;以向下取整为例,那么向下取整等于3,随机选择特征信息的个数k为3个。In this embodiment, assuming that the set number of all feature information is N, the value k of randomly selecting the number of feature information may be rounding the root number N. The rounding of the root number N may be rounding up or rounding down, which may be preset in advance. For example, the number N of all the feature information set is equal to 10, and the root number 10 is approximately 3.16. Taking rounding up as an example, then rounding up is equal to 4, and the number k of feature information randomly selected is 4; Take rounding down as an example, then rounding down is equal to 3, and the number k of randomly selected feature information is 3.
其中,n、k均为正整数。步骤013,重复m次步骤011-012,生成m棵决策树,生成的m棵决策树组成随机森林;其中,m为正整数。Among them, n and k are positive integers. Step 013. Repeat steps 011-012 m times to generate m decision trees. The generated m decision trees form a random forest; where m is a positive integer.
其中,m棵决策树组成的随机森林即为随机森林分类器。Among them, a random forest composed of m decision trees is a random forest classifier.
在本实施例中,随机森林中的决策树的每一个分裂过程并未用到所有的待选特征信息,而是从所有的待选特征信息中随机选取一定数量的特征信息,之后再在随机选取的特征信息中选取最优的特征信息。这样能够使得随机森林中的决策树都能够彼此不同,提升系统的多样性,从而提升分类性能。In this embodiment, each splitting process of the decision tree in the random forest does not use all the feature information to be selected, but selects a certain amount of feature information randomly from all the feature information to be selected, and then The best feature information is selected from the selected feature information. This can make the decision trees in the random forest different from each other, improve the diversity of the system, and thus improve the classification performance.
在本申请一个实施例中,样本集中未被采样的样本可以作为随机森林分类器的测试数据,用来验证随机森林分类器的准确率,例如,在样本集中选择若干个未被采样的网站的网页信息,已知这些选择的若干未被采样的网站的类型,即是钓鱼网站还是正常网站,提取各个网站的特征信息,分别将提取的特征信息输入到随机森林分类器中,根据随机森林分类器检测出的网站的类型与其真实类型进行比较,若准确率超过设定概率,则表明该该随机森林分类器准确率达到要求,可以使用。In an embodiment of the present application, the unsampled samples in the sample set can be used as the test data of the random forest classifier to verify the accuracy of the random forest classifier. For example, select several unsampled websites in the sample set Web page information, it is known that these selected types of unsampled websites, that is, phishing websites or normal websites, extract the feature information of each website, and input the extracted feature information into the random forest classifier respectively, according to the random forest classification The type of the website detected by the device is compared with its true type. If the accuracy rate exceeds the set probability, it indicates that the accuracy rate of the random forest classifier meets the requirements and can be used.
步骤02,确定待检测网站的网页信息。Step 02: Determine the webpage information of the website to be detected.
其中,确定的待检测网站的网页信息可以是URL。The determined webpage information of the website to be detected may be a URL.
在本申请一个实施例中,为了提高待检测网站是否为钓鱼网站的检测效率,可以预先构建钓鱼网站的黑名单,其中,黑名单中包括若干网站的URL,这些网站都是已经经过确 定了的钓鱼网站,在需要对待检测网站进行检测时,可以先根据所述待检测网站的网页信息获取所述待检测网站的URL,将所述待检测网站的URL与预先构建的黑名单进行比对,若所述黑名单中包括所述待检测网站的URL,则确定所述待检测网站为钓鱼网站,若所述黑名单中不包括所述待检测网站的URL,则需要进一步对该待检测网站进行检测,即需要执行步骤03。In an embodiment of the present application, in order to improve the detection efficiency of whether the website to be detected is a phishing website, a blacklist of phishing websites may be constructed in advance, where the blacklist includes URLs of several websites, all of which have been determined When a phishing website needs to be tested, the URL of the website to be tested can be obtained according to the webpage information of the website to be tested, and the URL of the website to be tested can be compared with a pre-built blacklist, If the URL of the website to be detected is included in the blacklist, it is determined that the website to be detected is a phishing website, and if the URL of the website to be detected is not included in the blacklist, the website to be detected needs to be further To check, you need to perform step 03.
其中,该黑名单可以是Google safe api的黑名单。Among them, the blacklist may be a blacklist of Google Safe API.
步骤03,根据所述待检测网站的网页信息,提取所述待检测网站的特征信息。Step 03: Extract the feature information of the website to be detected according to the webpage information of the website to be detected.
在本实施例中,提取的特征信息可以与步骤012中的特征信息相同,特征信息可以包括下述信息中的至少一个:(1)URL是否为IP格式;(2)URL域名存在的时间段是否小于设定天数;(3)URL中是否包含@字符;(4)URL中是否包括至少两个域名;(5)表单中是否包括账号密码信息;(6)URL跳转后的值与跳转前是否相同。In this embodiment, the extracted feature information may be the same as the feature information in step 012, and the feature information may include at least one of the following information: (1) whether the URL is in IP format; (2) the time period when the URL domain name exists Whether it is less than the set number of days; (3) whether the URL contains the @ character; (4) whether the URL includes at least two domain names; (5) whether the account password information is included in the form; (6) the value and jump after the URL jump Is it the same before transfer?
优选地,针对待检测网站提取的特征信息为上述六个信息。Preferably, the feature information extracted for the website to be detected is the above six pieces of information.
在本实施例中,为了便于随机森林分类器对待检测网站进行投票分类,可以将提取的特征信息进行布尔化,已转换为相应的特征值。例如,针对上述六个特征信息:In this embodiment, in order to facilitate the random forest classifier to classify the website to be detected, the extracted feature information may be Booleanized and converted into corresponding feature values. For example, for the above six feature information:
若该待检测网站的URL为IP格式,则转换为特征值1,若该待检测网站的URL不是IP格式,则转换为特征值0;If the URL of the website to be tested is in IP format, it will be converted to feature value 1, if the URL of the website to be tested is not in IP format, it will be converted to feature value 0;
若该检测网站的URL域名存在的时间段小于设定天数,则转换为特征值1,若该检测网站的URL域名存在的时间段不小于设定天数,则转换为特征值0;If the time period of the URL domain name of the detection website is less than the set number of days, it will be converted to feature value 1, if the time period of the URL domain name of the detection website is not less than the set number of days, it will be converted to feature value 0;
若URL中包含@字符,则转换为特征值1,若URL中不包含@字符,则转换为特征值0;If the @ character is included in the URL, it will be converted to feature value 1, if the @ character is not included in the URL, it will be converted to feature value 0;
若URL中包括至少两个域名,则转换为特征值1,若URL中不包括至少两个域名,则转换为特征值0;If the URL includes at least two domain names, it will be converted to feature value 1, if the URL does not include at least two domain names, it will be converted to feature value 0;
若表单中包括账号密码信息,则转换为特征值1,若表单中不包括账号密码信息,则转换为特征值0;If the account password information is included in the form, it will be converted to feature value 1, if the account password information is not included in the form, it will be converted to feature value 0;
若URL跳转后的值与跳转前不相同,则转换为特征值1,若URL跳转后的值与跳转前相同,则转换为特征值0。If the value after the URL jumps is different from that before the jump, it is converted to the feature value 1, and if the value after the URL jump is the same as before the jump, it is converted to the feature value 0.
进一步地,还可以将六个特征值转换为特征向量,例如,对于提取的上述六个特征信息分别为:URL不是IP格式、URL域名存在的时间段不小于设定天数、URL中不包含@字符、URL中包括一个域名、表单中不包括账号密码信息、URL跳转后的值与跳转前不相同;那么转换的特征向量为[0,0,0,0,0,1]。Further, the six feature values can also be converted into feature vectors. For example, the extracted six feature information are: URL is not in IP format, URL domain name exists for a period of time not less than the set number of days, and URL does not contain @ Characters, URL includes a domain name, the form does not include account password information, the value after the URL jump is not the same as before the jump; then the converted feature vector is [0,0,0,0,0,1].
步骤04,利用随机森林的每棵决策树对提取的特征信息进行分类投票。Step 04: Use each decision tree of the random forest to classify and vote on the extracted feature information.
将提取的特征信息输入到随机森林分类器中,随机森林中的每棵决策树对提取的特征 信息进行分类投票,并统计所有决策树对提取的特征信息进行投票的结果。Input the extracted feature information into the random forest classifier. Each decision tree in the random forest classifies the extracted feature information and counts the results of all decision trees voting on the extracted feature information.
步骤05,在分类投票结果为钓鱼网站的投票数多于正常网站的投票数时,则确定所述待检测网站为钓鱼网站。Step 05: When the result of the classification vote is that the number of votes of the phishing website is more than the number of votes of the normal website, it is determined that the website to be detected is a phishing website.
由于在进行分类投票时,若分类投票结果为钓鱼网站的投票数多于正常网站的投票数时,则确定该待检测网站为钓鱼网站,若分类投票结果为钓鱼网站的投票数少于正常网站的投票数时,则确定该待检测网站为正常网站。During the classification voting, if the classification voting result is that the phishing website has more votes than the normal website, the website to be tested is determined to be a phishing website, and if the classification voting result is that the phishing website has fewer votes than the normal website , The website to be tested is determined to be a normal website.
在本实施例中,随机森林分类器检测网站为钓鱼网站还是正常网站可以使用类型标识符1、0来区分,其中,输出1表明该网站为钓鱼网站,输出0表明该网站为正常网站。In this embodiment, the random forest classifier detects whether the website is a phishing website or a normal website by using type identifiers 1, 0, where output 1 indicates that the website is a phishing website and output 0 indicates that the website is a normal website.
在本申请一个实施例中,为了进一步提高网站检测的效率,丰富黑名单,可以进一步包括:在根据投票结果确定所述待检测网站为钓鱼网站时,则将所述待检测网站的URL添加到预先构建的黑名单中。In an embodiment of the present application, in order to further improve the efficiency of website detection and enrich the blacklist, it may further include: when it is determined that the website to be detected is a phishing website according to the voting result, adding the URL of the website to be detected to Pre-built blacklist.
本申请实施例,通过确定待检测网站的网页信息,根据待检测网站的网页信息提取待检测网站的特征信息,并利用构建的包括若干棵决策树的随机森林,对提取的特征信息进行分类投票,在分类投票结果为钓鱼网站的投票数多于正常网站的投票数时,则可以确定待检测网站为钓鱼网站。本申请中,随机森林是通过大量网站样本进行决策树的建立以构建出来的,包含的钓鱼网站的种类具有多样性,利用随机森林进行分类投票,准确率较高。In the embodiment of the present application, by determining the webpage information of the website to be detected, the feature information of the website to be detected is extracted according to the webpage information of the website to be tested, and the random feature forest including a plurality of decision trees is constructed to classify the extracted feature information to vote , When the result of the classified vote is that the number of votes of the phishing website is more than the number of votes of the normal website, it can be determined that the website to be detected is a phishing website. In this application, the random forest is constructed by establishing a decision tree through a large number of website samples. The types of phishing websites included are diverse, and the use of random forests for classification and voting has a high accuracy rate.
请继续参阅图2,示出了一种基于决策树的钓鱼网站检测装置,在本实施例中,基于决策树的钓鱼网站检测装置10可以包括或被分割成一个或多个程序模块,一个或者多个程序模块被存储于存储介质中,并由一个或多个处理器所执行,以完成本申请,并可实现上述基于决策树的钓鱼网站检测方法。本申请所称的程序模块是指能够完成特定功能的一系列计算机程序指令段,比程序本身更适合于描述基于决策树的钓鱼网站检测装置10在存储介质中的执行过程。以下描述将具体介绍本实施例各程序模块的功能:Please continue to refer to FIG. 2, which shows a phishing website detection device based on a decision tree. In this embodiment, the deciding tree detection device 10 based on a decision tree may include or be divided into one or more program modules, one or Multiple program modules are stored in the storage medium and executed by one or more processors to complete the present application, and can implement the above-mentioned decision tree-based phishing website detection method. The program module referred to in this application refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable for describing the execution process of the phishing website detection device 10 based on the decision tree in the storage medium than the program itself. The following description will specifically introduce the functions of the program modules of this embodiment:
随机森林构建模块11,用于预先构建随机森林,得到随机森林分类器14,构建的随机森林中包括若干棵决策树;The random forest construction module 11 is used to construct a random forest in advance to obtain a random forest classifier 14, and the constructed random forest includes several decision trees;
网页信息确定模块12,用于确定待检测网站的网页信息;The webpage information determination module 12 is used to determine the webpage information of the website to be detected;
特征信息提取模块13,用于根据所述待检测网站的网页信息,提取所述待检测网站的特征信息;The feature information extraction module 13 is configured to extract feature information of the website to be detected according to the webpage information of the website to be detected;
所述随机森林分类器14,用于利用随机森林的每棵决策树对提取的特征信息进行分类投票;The random forest classifier 14 is used to classify and vote on the extracted feature information by using each decision tree of the random forest;
检测结果确定模块15,用在分类投票结果为钓鱼网站的投票数多于正常网站的投票数时,则确定所述待检测网站为钓鱼网站;The detection result determination module 15 is used to determine that the website to be detected is a phishing website when the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website;
其中,随机森林构建模块11,具体用于从样本集中有放回地随机采样选出n个样本;所述样本集中包括若干个钓鱼网站的网页信息和若干个正常网站的网页信息;从设定的所有特征信息中随机选择k个特征信息,利用随机选择的k个特征信息对选出的n个样本建立决策树;重复上述步骤m次,生成m棵决策树,生成的m棵决策树组成随机森林;其中,n、k、m均为正整数。Among them, the random forest construction module 11 is specifically used to select n samples randomly from the sample set with replacement; the sample set includes webpage information of several phishing websites and webpage information of several normal websites; from the settings Randomly select k feature information from all the feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples; repeat the above steps m times to generate m decision trees, which consist of m decision trees Random forest; where n, k, and m are all positive integers.
在本申请一个实施例中,请参考图3,基于决策树的钓鱼网站检测装置10还可以包括:布尔化处理模块16,用于对提取的特征信息进行布尔化,以转换为相应的特征值,将转换后的特征值输出给随机森林分类器14中。In an embodiment of the present application, please refer to FIG. 3, the phishing website detection device 10 based on the decision tree may further include: a Boolean processing module 16, which is used to Booleanize the extracted feature information to convert into corresponding feature values And output the converted feature value to the random forest classifier 14.
在本申请一个实施例中,请参考图3,基于决策树的钓鱼网站检测装置10还可以包括:一级检测模块17,用于根据所述待检测网站的网页信息获取所述待检测网站的URL,将所述待检测网站的URL与预先构建的黑名单进行比对,若所述黑名单中包括所述待检测网站的URL,则确定所述待检测网站为钓鱼网站,若所述黑名单中不包括所述待检测网站的URL,则将待检测网站的网页信息输出给特征信息提取模块13。In an embodiment of the present application, please refer to FIG. 3, the phishing website detection device 10 based on the decision tree may further include: a primary detection module 17, configured to obtain the website to be detected according to the webpage information of the website to be detected URL, comparing the URL of the website to be detected with a pre-built blacklist, if the URL of the website to be detected is included in the blacklist, it is determined that the website to be detected is a phishing website, and if the black If the URL of the website to be tested is not included in the list, the web page information of the website to be tested is output to the feature information extraction module 13.
在本申请一个实施例中,请参考图3,基于决策树的钓鱼网站检测装置10还可以包括:黑名单添加模块18,用于在根据投票结果确定所述待检测网站为钓鱼网站时,则将所述待检测网站的URL添加到预先构建的黑名单中。In one embodiment of the present application, please refer to FIG. 3, the phishing website detection device 10 based on the decision tree may further include: a blacklist adding module 18, used to determine that the website to be detected is a phishing website based on the voting result Add the URL of the website to be detected to the pre-built blacklist.
本实施例还提供一种计算机设备,如可以执行程序的智能手机、平板电脑、笔记本电脑、台式计算机、机架式服务器、刀片式服务器、塔式服务器或机柜式服务器(包括独立的服务器,或者多个服务器所组成的服务器集群)等。本实施例的计算机设备20至少包括但不限于:可通过系统总线相互通信连接的存储器21、处理器22,如图4所示。需要指出的是,图4仅示出了具有组件21-22的计算机设备20,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。This embodiment also provides a computer device, such as a smartphone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a rack server (including an independent server, or A server cluster composed of multiple servers), etc. The computer device 20 of this embodiment includes at least but not limited to: a memory 21 and a processor 22 that can be communicatively connected to each other through a system bus, as shown in FIG. 4. It should be noted that FIG. 4 only shows the computer device 20 having the components 21-22, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.
本实施例中,存储器21(即可读存储介质)包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,存储器21可以是计算机设备20的内部存储单元,例如该计算机设备20的硬盘或内存。在另一些实施例中,存储器21也可以是计算机设备20的外部存储设备,例如该计算机设备20上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,存储器21还可以既包括计算机设备20的内部存储单元也包括其外部存储设备。本实施例中,存储器21通常用于存储安装于计算机设备20的操作系统和各类应用软件,例 如实施例一的基于决策树的钓鱼网站检测装置10的程序代码等。此外,存储器21还可以用于暂时地存储已经输出或者将要输出的各类数据。In this embodiment, the memory 21 (read-only storage medium) includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), Read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 21 may be an internal storage unit of the computer device 20, such as a hard disk or memory of the computer device 20. In other embodiments, the memory 21 may also be an external storage device of the computer device 20, for example, a plug-in hard disk equipped on the computer device 20, a smart memory card (Smart Media, Card, SMC), and secure digital (Secure Digital, SD) card, flash card (Flash Card), etc. Of course, the memory 21 may also include both the internal storage unit of the computer device 20 and its external storage device. In this embodiment, the memory 21 is generally used to store the operating system and various application software installed in the computer device 20, such as the program code of the phishing website detection apparatus 10 based on the decision tree in the first embodiment. In addition, the memory 21 may also be used to temporarily store various types of data that have been output or will be output.
处理器22在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器22通常用于控制计算机设备20的总体操作。本实施例中,处理器22用于运行存储器21中存储的程序代码或者处理数据,例如运行基于决策树的钓鱼网站检测装置10,以实现实施例一的基于决策树的钓鱼网站检测方法。The processor 22 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 22 is generally used to control the overall operation of the computer device 20. In this embodiment, the processor 22 is used to run the program code or process data stored in the memory 21, for example, to run the decision tree-based phishing website detection device 10, so as to implement the decision tree-based phishing website detection method of Embodiment 1.
本实施例还提供一种计算机可读存储介质,如闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘、服务器、App应用商城等等,其上存储有计算机程序,程序被处理器执行时实现相应功能。本实施例的计算机可读存储介质用于存储基于决策树的钓鱼网站检测装置10,被处理器执行时实现实施例一的基于决策树的钓鱼网站检测方法。This embodiment also provides a computer-readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), only Read memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, server, App store, etc., on which computer programs are stored, When the program is executed by the processor, the corresponding function is realized. The computer-readable storage medium of this embodiment is used to store a decision tree-based phishing website detection device 10, and when executed by a processor, implements the decision tree-based phishing website detection method of Embodiment 1.
实施例二Example 2
请参考图5,本实施例的基于决策树的钓鱼网站检测方法以实施例一为基础,包括以下步骤:Referring to FIG. 5, the method for detecting a phishing website based on a decision tree in this embodiment is based on Embodiment 1, and includes the following steps:
步骤01,构建随机森林,构建的随机森林中包括若干棵决策树。Step 01: Construct a random forest. The constructed random forest includes several decision trees.
在本实施例中,为了检测待检测网站是否为钓鱼网站,至少可以采用如下一种方式构建随机森林:In this embodiment, in order to detect whether the website to be detected is a phishing website, at least one of the following ways can be used to construct a random forest:
步骤011,从样本集中有放回地随机采样选出n个样本;所述样本集中包括若干个钓鱼网站的网页信息和若干个正常网站的网页信息。In step 011, n samples are randomly selected from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites.
在本实施例中,样本集中包括的网页信息可以是URL。In this embodiment, the webpage information included in the sample set may be a URL.
步骤012,从设定的所有特征信息中随机选择k个特征信息,利用随机选择的k个特征信息对选出的n个样本建立决策树。Step 012: randomly select k feature information from all the set feature information, and use the randomly selected k feature information to establish a decision tree for the selected n samples.
在本实施例中,为了构建能够实现对网站检测是否为钓鱼网站的决策树,所述特征信息包括下述信息中的至少一个:(1)URL是否为IP格式;(2)URL域名存在的时间段是否小于设定天数;(3)URL中是否包含@字符;(4)URL中是否包括至少两个域名;(5)表单中是否包括账号密码信息;(6)URL跳转后的值与跳转前是否相同。In this embodiment, in order to construct a decision tree that can detect whether a website is a phishing website, the feature information includes at least one of the following information: (1) whether the URL is in IP format; (2) the URL domain name exists Whether the time period is less than the set number of days; (3) whether the URL contains the @ character; (4) whether the URL includes at least two domain names; (5) whether the account password information is included in the form; (6) the value after the URL jump Is it the same as before the jump?
步骤013,重复m次步骤011-012,生成m棵决策树,生成的m棵决策树组成随机森林;其中,m为正整数。Step 013. Repeat steps 011-012 m times to generate m decision trees. The generated m decision trees form a random forest; where m is a positive integer.
步骤02,预先构建包括若干钓鱼网站的网页信息的黑名单。Step 02: Pre-construct a blacklist including webpage information of several phishing websites.
其中,该黑名单可以是Google safe api的黑名单。Among them, the blacklist may be a blacklist of Google Safe API.
步骤03,确定待检测网站的网页信息。Step 03: Determine the webpage information of the website to be detected.
步骤04,根据所述待检测网站的网页信息获取所述待检测网站的URL,将所述待检测网站的URL与预先构建的黑名单进行比对,若所述黑名单中包括所述待检测网站的URL,则确定所述待检测网站为钓鱼网站,若所述黑名单中不包括所述待检测网站的URL,执行步骤05。Step 04: Obtain the URL of the website to be tested according to the webpage information of the website to be tested, and compare the URL of the website to be tested with a pre-built blacklist, if the blacklist includes the to-be-detected The URL of the website determines that the website to be detected is a phishing website. If the URL of the website to be detected is not included in the blacklist, step 05 is performed.
步骤05,根据所述待检测网站的网页信息,提取所述待检测网站的特征信息。Step 05: Extract the feature information of the website to be detected according to the webpage information of the website to be detected.
在本实施例中,提取的特征信息可以与步骤012中的特征信息相同。In this embodiment, the extracted feature information may be the same as the feature information in step 012.
步骤06,对提取的特征信息进行布尔化,以转换为相应的特征值。Step 06: Booleanize the extracted feature information to convert to corresponding feature values.
针对上述六个特征信息:若该待检测网站的URL为IP格式,则转换为特征值1,若该待检测网站的URL不是IP格式,则转换为特征值0;若该检测网站的URL域名存在的时间段小于设定天数,则转换为特征值1,若该检测网站的URL域名存在的时间段不小于设定天数,则转换为特征值0;若URL中包含@字符,则转换为特征值1,若URL中不包含@字符,则转换为特征值0;若URL中包括至少两个域名,则转换为特征值1,若URL中不包括至少两个域名,则转换为特征值0;若表单中包括账号密码信息,则转换为特征值1,若表单中不包括账号密码信息,则转换为特征值0;若URL跳转后的值与跳转前不相同,则转换为特征值1,若URL跳转后的值与跳转前相同,则转换为特征值0。For the above six feature information: if the URL of the website to be tested is in IP format, it will be converted to feature value 1, if the URL of the website to be tested is not in IP format, it will be converted to feature value 0; if the URL domain name of the website to be tested If the time period of existence is less than the set number of days, it will be converted to feature value 1. If the time period of the URL domain name of the detected website is not less than the set number of days, it will be converted to feature value 0; Feature value 1, if the URL does not contain the @ character, it is converted to feature value 0; if the URL includes at least two domain names, it is converted to feature value 1, if the URL does not include at least two domain names, it is converted to feature value 0; if the account password information is included in the form, it will be converted to feature value 1, if the account password information is not included in the form, it will be converted to feature value 0; if the value after the URL jump is not the same as before the jump, it will be converted to Feature value 1, if the value after URL jump is the same as before jump, it will be converted to feature value 0.
步骤07,利用随机森林的每棵决策树对提取的特征信息进行分类投票。Step 07: Use each decision tree of the random forest to classify and vote on the extracted feature information.
步骤08,在投票结果为钓鱼网站的投票数多于正常网站的投票数时,则确定所述待检测网站为钓鱼网站,否则,确定为正常网站;若为钓鱼网站,则提示用户并禁止访问该网站,并执行步骤09;若为正常网站则提示用户可以访问该网站。Step 08: When the voting result is that the number of votes of the phishing website is more than the number of votes of the normal website, the website to be detected is determined to be a phishing website; otherwise, it is determined to be a normal website; The website, and perform step 09; if it is a normal website, the user is prompted to visit the website.
在本实施例中,随机森林分类器检测网站为钓鱼网站还是正常网站可以使用类型标识符1、0来区分,其中,输出1表明该网站为钓鱼网站,输出0表明该网站为正常网站。In this embodiment, the random forest classifier detects whether the website is a phishing website or a normal website by using type identifiers 1, 0, where output 1 indicates that the website is a phishing website and output 0 indicates that the website is a normal website.
步骤09,将所述待检测网站的URL添加到预先构建的黑名单中。Step 09: Add the URL of the website to be detected to the pre-built blacklist.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The sequence numbers of the above embodiments of the present application are for description only, and do not represent the advantages and disadvantages of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。Through the description of the above embodiments, those skilled in the art can clearly understand that the methods in the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course, can also be implemented by hardware, but in many cases the former is better Implementation.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领 域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the present application, and do not limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made by using the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

  1. 一种基于决策树的钓鱼网站检测方法,其特征在于,包括以下步骤:A method for detecting a phishing website based on a decision tree is characterized in that it includes the following steps:
    步骤01,预先构建随机森林,构建的随机森林中包括若干棵决策树;Step 01: Construct a random forest in advance, and the constructed random forest includes several decision trees;
    步骤02,确定待检测网站的网页信息;Step 02: Determine the webpage information of the website to be tested;
    步骤03,根据所述待检测网站的网页信息,提取所述待检测网站的特征信息;Step 03: Extract the feature information of the website to be detected according to the webpage information of the website to be detected;
    步骤04,利用随机森林的每棵决策树对提取的特征信息进行分类投票;Step 04: Use each decision tree of the random forest to classify and vote on the extracted feature information;
    步骤05,在分类投票结果为钓鱼网站的投票数多于正常网站的投票数时,则确定所述待检测网站为钓鱼网站;Step 05: When the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website, it is determined that the website to be detected is a phishing website;
    所述构建随机森林,包括:The construction of the random forest includes:
    步骤011,从样本集中有放回地随机采样选出n个样本;所述样本集中包括若干个钓鱼网站的网页信息和若干个正常网站的网页信息;Step 011, randomly selecting n samples from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites;
    步骤012,从设定的所有特征信息中随机选择k个特征信息,利用随机选择的k个特征信息对选出的n个样本建立决策树;Step 012: randomly select k feature information from all the set feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples;
    步骤013,重复m次步骤011-012,生成m棵决策树,生成的m棵决策树组成随机森林;Step 013, repeat steps 011-012 m times to generate m decision trees, and the generated m decision trees constitute a random forest;
    其中,n、k、m均为正整数。Among them, n, k, m are all positive integers.
  2. 根据权利要求1所述的基于决策树的钓鱼网站检测方法,其特征在于,所述特征信息包括下述信息中的至少一个:URL是否为IP格式、URL域名存在的时间段是否小于设定天数、URL中是否包含@字符、URL中是否包括至少两个域名、表单中是否包括账号密码信息、以及URL跳转后的值与跳转前是否相同。The method for detecting a phishing website based on a decision tree according to claim 1, wherein the characteristic information includes at least one of the following information: whether the URL is in IP format, and whether the URL domain name exists for a period of time less than a set number of days , Whether the URL contains the @ character, whether the URL includes at least two domain names, whether the account password information is included in the form, and whether the value after the URL jump is the same as before the jump.
  3. 根据权利要求1所述的基于决策树的钓鱼网站检测方法,其特征在于,步骤012中k值为对根号N取整,其中,N为设定的所有特征信息的个数。The method for detecting a phishing website based on a decision tree according to claim 1, wherein in step 012, the k value is rounded to the root number N, where N is the set number of all feature information.
  4. 根据权利要求1所述的基于决策树的钓鱼网站检测方法,其特征在于,在步骤04之前还包括:对提取的特征信息进行布尔化,以转换为相应的特征值,根据转换后的特征值执行步骤04。The method for detecting a phishing website based on a decision tree according to claim 1, wherein before step 04, the method further comprises: booleanizing the extracted feature information to convert it into corresponding feature values, and according to the converted feature values Go to step 04.
  5. 根据权利要求1或3所述的基于决策树的钓鱼网站检测方法,其特征在于,在步骤03之前,还包括:根据所述待检测网站的网页信息获取所述待检测网站的URL,将所述待检测网站的URL与预先构建的黑名单进行比对,若所述黑名单中包括所述待检测网站的URL,则确定所述待检测网站为钓鱼网站,若所述黑名单中不包括所述待检测网站的URL,则执行步骤03。The method for detecting a phishing website based on a decision tree according to claim 1 or 3, further comprising: before step 03, further comprising: obtaining the URL of the website to be detected according to the webpage information of the website to be detected Comparing the URL of the website to be detected with a pre-built blacklist, if the URL of the website to be detected is included in the blacklist, it is determined that the website to be detected is a phishing website, and if the blacklist does not include If the URL of the website to be detected, step 03 is performed.
  6. 根据权利要求5所述的基于决策树的钓鱼网站检测方法,其特征在于,在步骤 05之后,还包括:在根据投票结果确定所述待检测网站为钓鱼网站时,则将所述待检测网站的URL添加到预先构建的黑名单中。The method for detecting a phishing website based on a decision tree according to claim 5, after step 05, further comprising: when determining that the website to be detected is a phishing website according to the voting result, Is added to the pre-built blacklist.
  7. 一种基于决策树的钓鱼网站检测装置,其特征在于,包括:A phishing website detection device based on decision tree is characterized in that it includes:
    随机森林构建模块,用于预先构建随机森林,得到随机森林分类器,构建的随机森林中包括若干棵决策树;The random forest construction module is used to construct a random forest in advance to obtain a random forest classifier, and the constructed random forest includes several decision trees;
    网页信息确定模块,用于确定待检测网站的网页信息;Webpage information determination module, used to determine the webpage information of the website to be tested;
    特征信息提取模块,用于根据所述待检测网站的网页信息,提取所述待检测网站的特征信息;A feature information extraction module, configured to extract feature information of the website to be detected according to the webpage information of the website to be detected;
    所述随机森林分类器,用于利用随机森林的每棵决策树对提取的特征信息进行分类投票;The random forest classifier is used to classify and vote on the extracted feature information by using each decision tree of the random forest;
    检测结果确定模块,用于在分类投票结果为钓鱼网站的投票数多于正常网站的投票数时,则确定所述待检测网站为钓鱼网站;The detection result determination module is used to determine that the website to be tested is a phishing website when the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website;
    所述随机森林构建模块,具体用于通过如下方式构建随机森林:The random forest construction module is specifically used to construct a random forest in the following ways:
    步骤011,从样本集中有放回地随机采样选出n个样本;所述样本集中包括若干个钓鱼网站的网页信息和若干个正常网站的网页信息;Step 011, randomly selecting n samples from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites;
    步骤012,从设定的所有特征信息中随机选择k个特征信息,利用随机选择的k个特征信息对选出的n个样本建立决策树;Step 012: randomly select k feature information from all the set feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples;
    步骤013,重复m次步骤011-012,生成m棵决策树,生成的m棵决策树组成随机森林;Step 013, repeat steps 011-012 m times to generate m decision trees, and the generated m decision trees constitute a random forest;
    其中,n、k、m均为正整数。Among them, n, k, m are all positive integers.
  8. 根据权利要求7所述基于决策树的钓鱼网站检测装置,其特征在于,还包括:The phishing website detection device based on the decision tree according to claim 7, further comprising:
    布尔化处理模块,用于对提取的特征信息进行布尔化,以转换为相应的特征值,将转换后的特征值输出给所述随机森林分类器。The Boolean processing module is used to Booleanize the extracted feature information to convert to corresponding feature values, and output the converted feature values to the random forest classifier.
  9. 根据权利要求7所述基于决策树的钓鱼网站检测装置,其特征在于,还包括:The phishing website detection device based on the decision tree according to claim 7, further comprising:
    一级检测模块,用于根据所述待检测网站的网页信息获取所述待检测网站的URL,将所述待检测网站的URL与预先构建的黑名单进行比对,若所述黑名单中包括所述待检测网站的URL,则输出所述待检测网站为钓鱼网站的通知,若所述黑名单中不包括所述待检测网站的URL,则将所述待检测网站的网页信息输出给所述特征信息提取模块。The first-level detection module is used to obtain the URL of the website to be detected according to the webpage information of the website to be tested, and compare the URL of the website to be tested with a pre-built blacklist, if the blacklist includes The URL of the website to be detected, then output a notification that the website to be detected is a phishing website, and if the URL of the website to be detected is not included in the blacklist, then output the webpage information of the website to be detected to all The feature information extraction module.
  10. 根据权利要求7所述基于决策树的钓鱼网站检测装置,其特征在于,还包括:The phishing website detection device based on the decision tree according to claim 7, further comprising:
    黑名单添加模块,用于在根据投票结果确定所述待检测网站为钓鱼网站时,则将所述待检测网站的URL添加到预先构建的黑名单中。The blacklist adding module is used to add the URL of the website to be detected to the pre-built blacklist when it is determined that the website to be detected is a phishing website according to the voting result.
  11. 一种计算机设备,包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现基于决策树的钓鱼网站检测方法的以下步骤:A computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, characterized in that, when the processor executes the computer program, a method for detecting a phishing website based on a decision tree The following steps:
    步骤01,预先构建随机森林,构建的随机森林中包括若干棵决策树;Step 01: Construct a random forest in advance, and the constructed random forest includes several decision trees;
    步骤02,确定待检测网站的网页信息;Step 02: Determine the webpage information of the website to be tested;
    步骤03,根据所述待检测网站的网页信息,提取所述待检测网站的特征信息;Step 03: Extract the feature information of the website to be detected according to the webpage information of the website to be detected;
    步骤04,利用随机森林的每棵决策树对提取的特征信息进行分类投票;Step 04: Use each decision tree of the random forest to classify and vote on the extracted feature information;
    步骤05,在分类投票结果为钓鱼网站的投票数多于正常网站的投票数时,则确定所述待检测网站为钓鱼网站;Step 05: When the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website, it is determined that the website to be detected is a phishing website;
    其中步骤01包括:Step 01 includes:
    步骤011,从样本集中有放回地随机采样选出n个样本;所述样本集中包括若干个钓鱼网站的网页信息和若干个正常网站的网页信息;Step 011, randomly selecting n samples from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites;
    步骤012,从设定的所有特征信息中随机选择k个特征信息,利用随机选择的k个特征信息对选出的n个样本建立决策树;Step 012: randomly select k feature information from all the set feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples;
    步骤013,重复m次步骤011-012,生成m棵决策树,生成的m棵决策树组成随机森林;Step 013, repeat steps 011-012 m times to generate m decision trees, and the generated m decision trees constitute a random forest;
    其中,n、k、m均为正整数。Among them, n, k, m are all positive integers.
  12. 根据权利要求11所述的计算机设备,其特征在于,所述特征信息包括下述信息中的至少一个:URL是否为IP格式、URL域名存在的时间段是否小于设定天数、URL中是否包含@字符、URL中是否包括至少两个域名、表单中是否包括账号密码信息、以及URL跳转后的值与跳转前是否相同。The computer device according to claim 11, wherein the characteristic information includes at least one of the following information: whether the URL is in IP format, whether the time period of the URL domain name exists is less than a set number of days, and whether the URL contains @ The characters, whether the URL includes at least two domain names, whether the form includes account password information, and whether the value after the URL jump is the same as before the jump.
  13. 根据权利要求11所述的计算机设备,其特征在于,步骤012中k值为对根号N取整,其中,N为设定的所有特征信息的个数。The computer device according to claim 11, wherein the value of k in step 012 is to round the root number N, where N is the set number of all feature information.
  14. 根据权利要求11所述的计算机设备,其特征在于,在步骤04之前还包括:对提取的特征信息进行布尔化,以转换为相应的特征值,根据转换后的特征值执行步骤04。The computer device according to claim 11, characterized in that before step 04, the method further comprises: performing Boolean conversion on the extracted feature information to convert to corresponding feature values, and performing step 04 according to the converted feature values.
  15. 根据权利要求11或14所述的计算机设备,其特征在于,在步骤03之前,还包括:根据所述待检测网站的网页信息获取所述待检测网站的URL,将所述待检测网站的URL与预先构建的黑名单进行比对,若所述黑名单中包括所述待检测网站的URL,则确定所述待检测网站为钓鱼网站,若所述黑名单中不包括所述待检测网站的URL,则执行步骤03。The computer device according to claim 11 or 14, further comprising: before step 03, further comprising: acquiring the URL of the website to be detected according to the webpage information of the website to be detected, and converting the URL of the website to be detected Compared with the pre-built blacklist, if the blacklist includes the URL of the website to be detected, it is determined that the website to be detected is a phishing website, and if the blacklist does not include the website to be detected URL, go to step 03.
  16. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算 机程序被处理器执行时实现基于决策树的钓鱼网站检测方法的以下步骤:A computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the following steps of a method for detecting a phishing website based on a decision tree are implemented:
    步骤01,预先构建随机森林,构建的随机森林中包括若干棵决策树;Step 01: Construct a random forest in advance, and the constructed random forest includes several decision trees;
    步骤02,确定待检测网站的网页信息;Step 02: Determine the webpage information of the website to be tested;
    步骤03,根据所述待检测网站的网页信息,提取所述待检测网站的特征信息;Step 03: Extract the feature information of the website to be detected according to the webpage information of the website to be detected;
    步骤04,利用随机森林的每棵决策树对提取的特征信息进行分类投票;Step 04: Use each decision tree of the random forest to classify and vote on the extracted feature information;
    步骤05,在分类投票结果为钓鱼网站的投票数多于正常网站的投票数时,则确定所述待检测网站为钓鱼网站;Step 05: When the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website, it is determined that the website to be detected is a phishing website;
    其中,步骤01包括:Among them, step 01 includes:
    步骤011,从样本集中有放回地随机采样选出n个样本;所述样本集中包括若干个钓鱼网站的网页信息和若干个正常网站的网页信息;Step 011, randomly selecting n samples from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites;
    步骤012,从设定的所有特征信息中随机选择k个特征信息,利用随机选择的k个特征信息对选出的n个样本建立决策树;Step 012: randomly select k feature information from all the set feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples;
    步骤013,重复m次步骤011-012,生成m棵决策树,生成的m棵决策树组成随机森林;Step 013, repeat steps 011-012 m times to generate m decision trees, and the generated m decision trees constitute a random forest;
    其中,n、k、m均为正整数。Among them, n, k, m are all positive integers.
  17. 根据权利要求16所述的计算机可读存储介质,其特征在于,所述特征信息包括下述信息中的至少一个:URL是否为IP格式、URL域名存在的时间段是否小于设定天数、URL中是否包含@字符、URL中是否包括至少两个域名、表单中是否包括账号密码信息、以及URL跳转后的值与跳转前是否相同。The computer-readable storage medium according to claim 16, wherein the characteristic information includes at least one of the following information: whether the URL is in IP format, whether the time period of the URL domain name exists is less than a set number of days, the URL Whether the @ character is included, whether at least two domain names are included in the URL, whether account password information is included in the form, and whether the value after the URL jump is the same as before the jump.
  18. 根据权利要求16所述的计算机可读存储介质,其特征在于,步骤012中k值为对根号N取整,其中,N为设定的所有特征信息的个数。The computer-readable storage medium according to claim 16, wherein the k value in step 012 is to round the root number N, where N is the set number of all feature information.
  19. 根据权利要求16所述的计算机可读存储介质,其特征在于,在步骤04之前还包括:对提取的特征信息进行布尔化,以转换为相应的特征值,根据转换后的特征值执行步骤04。The computer-readable storage medium according to claim 16, further comprising: before step 04, further comprising: performing Boolean conversion on the extracted feature information to convert to corresponding feature values, and performing step 04 according to the converted feature values .
  20. 根据权利要求16或18所述的计算机可读存储介质,其特征在于,在步骤03之前,还包括:根据所述待检测网站的网页信息获取所述待检测网站的URL,将所述待检测网站的URL与预先构建的黑名单进行比对,若所述黑名单中包括所述待检测网站的URL,则确定所述待检测网站为钓鱼网站,若所述黑名单中不包括所述待检测网站的URL,则执行步骤03。The computer-readable storage medium according to claim 16 or 18, characterized in that before step 03, it further comprises: acquiring the URL of the website to be detected according to the webpage information of the website to be detected, and converting the The URL of the website is compared with the pre-built blacklist. If the URL of the website to be detected is included in the blacklist, the website to be detected is determined to be a phishing website. If the blacklist does not include the website To check the URL of the website, go to step 03.
PCT/CN2019/091878 2018-10-26 2019-06-19 Decision trees-based method and apparatus for detecting phishing website, and computer device WO2020082763A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811256189.5A CN109450880A (en) 2018-10-26 2018-10-26 Detection method for phishing site, device and computer equipment based on decision tree
CN201811256189.5 2018-10-26

Publications (1)

Publication Number Publication Date
WO2020082763A1 true WO2020082763A1 (en) 2020-04-30

Family

ID=65548383

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/091878 WO2020082763A1 (en) 2018-10-26 2019-06-19 Decision trees-based method and apparatus for detecting phishing website, and computer device

Country Status (2)

Country Link
CN (1) CN109450880A (en)
WO (1) WO2020082763A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109450880A (en) * 2018-10-26 2019-03-08 平安科技(深圳)有限公司 Detection method for phishing site, device and computer equipment based on decision tree
CN110061975A (en) * 2019-03-29 2019-07-26 中国科学院计算技术研究所 A kind of counterfeit website identification method and system based on offline flow Packet analyzing
CN113676374B (en) * 2021-08-13 2024-03-22 杭州安恒信息技术股份有限公司 Target website clue detection method, device, computer equipment and medium
CN115001763B (en) * 2022-05-20 2024-03-19 北京天融信网络安全技术有限公司 Phishing website attack detection method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049484A (en) * 2012-11-30 2013-04-17 北京奇虎科技有限公司 Method and device for recognizing webpage risks
WO2013106354A1 (en) * 2012-01-12 2013-07-18 Microsoft Corporation Machine-learning based classification of user accounts based on email addresses and other account information
CN108306878A (en) * 2018-01-30 2018-07-20 平安科技(深圳)有限公司 Detection method for phishing site, device, computer equipment and storage medium
CN109450880A (en) * 2018-10-26 2019-03-08 平安科技(深圳)有限公司 Detection method for phishing site, device and computer equipment based on decision tree

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10157280B2 (en) * 2009-09-23 2018-12-18 F5 Networks, Inc. System and method for identifying security breach attempts of a website
CN104217160B (en) * 2014-09-19 2017-11-28 中国科学院深圳先进技术研究院 A kind of Chinese detection method for phishing site and system
CN107404473A (en) * 2017-06-06 2017-11-28 西安电子科技大学 Based on Mshield machine learning multi-mode Web application means of defences
CN107566389A (en) * 2017-09-19 2018-01-09 济南互信软件有限公司 A kind of imitation URL link fishing domain name recognition methods based on C4.5 decision trees
CN108319672B (en) * 2018-01-25 2023-04-18 南京邮电大学 Mobile terminal bad information filtering method and system based on cloud computing
CN108540451A (en) * 2018-03-13 2018-09-14 北京理工大学 A method of classification and Detection being carried out to attack with machine learning techniques

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013106354A1 (en) * 2012-01-12 2013-07-18 Microsoft Corporation Machine-learning based classification of user accounts based on email addresses and other account information
CN103049484A (en) * 2012-11-30 2013-04-17 北京奇虎科技有限公司 Method and device for recognizing webpage risks
CN108306878A (en) * 2018-01-30 2018-07-20 平安科技(深圳)有限公司 Detection method for phishing site, device, computer equipment and storage medium
CN109450880A (en) * 2018-10-26 2019-03-08 平安科技(深圳)有限公司 Detection method for phishing site, device and computer equipment based on decision tree

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周浩 (ZHOU, HAO): "基于决策树的搜索引擎恶意网页检测研究与实现 (The Research and Implementation of Malicious Web Pages Detection from Search Engine Based on Decision Tree)", 湖南大学硕士学位论文 (MASTER'S DISSERTATION OF HUNAN UNIVERSITY), 15 July 2014 (2014-07-15) *

Also Published As

Publication number Publication date
CN109450880A (en) 2019-03-08

Similar Documents

Publication Publication Date Title
CN109922052B (en) Malicious URL detection method combining multiple features
CN110808968B (en) Network attack detection method and device, electronic equipment and readable storage medium
WO2020082763A1 (en) Decision trees-based method and apparatus for detecting phishing website, and computer device
JP6530786B2 (en) System and method for detecting malicious elements of web pages
EP3065367B1 (en) System and method for automated phishing detection rule evolution
US20180219907A1 (en) Method and apparatus for detecting website security
CN108156131B (en) Webshell detection method, electronic device and computer storage medium
CN107707545B (en) Abnormal webpage access fragment detection method, device, equipment and storage medium
CN107204960B (en) Webpage identification method and device and server
US11212297B2 (en) Access classification device, access classification method, and recording medium
CN109768992B (en) Webpage malicious scanning processing method and device, terminal device and readable storage medium
CN105224600B (en) A kind of detection method and device of Sample Similarity
US9210189B2 (en) Method, system and client terminal for detection of phishing websites
CN107463844B (en) WEB Trojan horse detection method and system
US10462168B2 (en) Access classifying device, access classifying method, and access classifying program
US11516235B2 (en) System and method for detecting bots based on anomaly detection of JavaScript or mobile app profile information
Barlow et al. A novel approach to detect phishing attacks using binary visualisation and machine learning
JPWO2019013266A1 (en) Determination device, determination method, and determination program
JP2018041442A (en) System and method for detecting web page abnormal element
CN108234454B (en) Identity authentication method, server and client device
US20160028746A1 (en) Malicious code detection
CN114024761B (en) Network threat data detection method and device, storage medium and electronic equipment
CN114422271A (en) Data processing method, device, equipment and readable storage medium
CN110855635A (en) URL (Uniform resource locator) identification method and device and data processing equipment
CN107786529B (en) Website detection method, device and system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19875482

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19875482

Country of ref document: EP

Kind code of ref document: A1