WO2020082763A1

WO2020082763A1 - Decision trees-based method and apparatus for detecting phishing website, and computer device

Info

Publication number: WO2020082763A1
Application number: PCT/CN2019/091878
Authority: WO
Inventors: 谭杰
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-10-26
Filing date: 2019-06-19
Publication date: 2020-04-30
Also published as: CN109450880A

Abstract

The present application provides a decision trees-based method and apparatus for detecting a phishing website, and a computer device, belonging to the technical field of intelligent decisions, said method comprising: pre-constructing a random forest as a classification model; determining webpage information about a website to be detected, and extracting feature information about said website according to the webpage information about said website; using the constructed random forest comprising several decision trees to perform classification vote on the extracted feature information; and when the result of the classification vote indicates that the votes of a phishing website are greater than the votes of a normal website, determining that said website is a phishing website. In the present application, the random forest is constructed by establishing decision trees using a large number of website samples, and comprises diverse types of phishing websites; and the random forest is used as a classification model to perform classification vote, achieving high accuracy.

Description

Method, device and computer equipment for detecting phishing website based on decision tree

This application declares to enjoy the priority of the Chinese patent application filed on October 26, 2018 with the application number CN2018112561895 and the name "decision tree-based phishing website detection method, device and computer equipment". The overall content of the Chinese patent application is: The way of reference is incorporated in this application.

Technical field

The present application relates to the field of intelligent decision-making technology, in particular to a method, device and computer equipment for detecting a phishing website based on a decision tree.

Background technique

"Phishing website" means that criminals use various means to spoof the address and page content of a real website, or use vulnerabilities in the server program of a real website to insert dangerous HTML code in certain pages of the site to deceive users Bank account or credit card account, password and other private information.

In the detection method of the phishing website in the related art, there is a scheme to determine whether the website to be detected is a phishing website by comparing the domain name information and content identification information of the website to be detected with the target website. However, there are various types of phishing websites and endless phishing methods. Therefore, the accuracy of the detection results is low by comparing with the target website.

Summary of the invention

The purpose of this application is to provide a method, device and computer equipment for detecting a phishing website based on a decision tree to solve the problems in the prior art.

To achieve the above purpose, the present application provides a method for detecting a phishing website based on a decision tree, including the following steps:

Step 01: Construct a random forest in advance, and the constructed random forest includes several decision trees;

Step 02: Determine the webpage information of the website to be tested;

Step 03: Extract the feature information of the website to be detected according to the webpage information of the website to be detected;

Step 04: Use each decision tree of the random forest to classify and vote on the extracted feature information;

Step 05: When the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website, it is determined that the website to be detected is a phishing website;

The construction of the random forest includes:

Step 011, randomly selecting n samples from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites;

Step 012: randomly select k feature information from all the set feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples;

Step 013, repeat steps 011-012 m times to generate m decision trees, and the generated m decision trees constitute a random forest;

Among them, n, k, m are all positive integers.

To achieve the above objective, the present application also provides a phishing website detection device based on decision tree, including:

The random forest construction module is used to construct a random forest in advance to obtain a random forest classifier, and the constructed random forest includes several decision trees;

Webpage information determination module, used to determine the webpage information of the website to be tested;

A feature information extraction module, configured to extract feature information of the website to be detected according to the webpage information of the website to be detected;

The random forest classifier is used to classify and vote on the extracted feature information by using each decision tree of the random forest;

The detection result determination module is used to determine that the website to be tested is a phishing website when the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website;

The random forest construction module is specifically used to construct a random forest in the following ways:

Among them, n, k, m are all positive integers.

In order to achieve the above object, the present application also provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor. The following steps of the phishing website detection method:

Step 02: Determine the webpage information of the website to be tested;

Step 01 includes:

Among them, n, k, m are all positive integers.

To achieve the above object, the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following steps of a method for detecting a phishing website based on a decision tree are implemented:

Step 02: Determine the webpage information of the website to be tested;

Among them, step 01 includes:

Among them, n, k, m are all positive integers.

The method, device and computer equipment for detecting a phishing website based on a decision tree provided by this application, by determining the web page information of the website to be detected, extracting the characteristic information of the website to be detected according to the web page information of the website to be detected, and using In the random forest of trees, the extracted feature information is classified and voted. When the result of the classification vote is that the number of votes of the phishing website is more than the number of votes of the normal website, the website to be detected can be determined to be a phishing website. In this application, the random forest is constructed by establishing a decision tree through a large number of website samples. The types of phishing websites included are diverse. The use of random forests for classification and voting has a high accuracy rate.

BRIEF DESCRIPTION

FIG. 1 is a flowchart of Embodiment 1 of a method for detecting a phishing website based on a decision tree;

2 is a schematic diagram of a program module of a first embodiment of an application for detecting a phishing website based on a decision tree in this application;

3 is a schematic diagram of another program module of the first embodiment of a phishing website detection device based on a decision tree of the present application;

4 is a schematic diagram of a hardware structure of a first embodiment of a phishing website detection device based on a decision tree according to the present application;

FIG. 5 is a flowchart of Embodiment 2 of a method for detecting a phishing website based on a decision tree in this application.

detailed description

In order to make the purpose, technical solutions and advantages of the present application more clear, the following describes the present application in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative work fall within the scope of protection of this application.

The method, device and computer equipment for detecting a phishing website based on a decision tree provided by the present application are applicable to the field of intelligent decision-making technology, and are a method for classifying and voting through each decision tree in a random forest to detect whether it is a phishing website. This application determines the webpage information of the website to be tested, extracts the feature information of the website to be tested based on the webpage information of the website to be tested, and uses the constructed random forest including several decision trees to classify the extracted feature information to vote. When the voting result is that the phishing website has more votes than the normal website, it can be determined whether the website to be tested is a phishing website. In this application, the random forest is constructed by establishing a decision tree through a large number of website samples. The types of phishing websites included are diverse, and the use of random forests for classification and voting has a high accuracy rate.

Example one

Referring to FIG. 1, a method for detecting a phishing website based on a decision tree in this embodiment includes the following steps:

Step 01: Construct a random forest in advance, and the constructed random forest includes several decision trees.

In machine learning, a random forest is a classifier that contains multiple decision trees. It uses multiple decision trees to train samples and implement predictions. The output category is determined by the mode of the category output by the individual trees. In this embodiment, random forest can be used to detect whether the website to be detected is a phishing website.

In this embodiment, in order to detect whether the website to be detected is a phishing website, at least one of the following ways can be used to construct a random forest:

In step 011, n samples are randomly selected from the sample set with replacement, and the sample set includes webpage information of several fishing websites and webpage information of several normal websites.

In this embodiment, the webpage information included in the sample set may be a URL.

The phishing websites included in the sample set can be collected and constructed during the accumulation of experience, or they can be obtained from the blacklist of Google Safe API. Among them, the Google safe API blacklist includes several URLs, and the websites corresponding to these URLs are all phishing sites. Therefore, when constructing the sample set, all or part of the URLs in the Google safe API blacklist can be added to the sample set .

In order to ensure that the decision tree voting classification constructed using the sample set is more accurate, the sample set also needs to include the webpage information of several normal websites, where the webpage information of the normal website can be obtained from websites other than the blacklist of Google safe API .

Step 012: randomly select k feature information from all the set feature information, and use the randomly selected k feature information to establish a decision tree for the selected n samples.

In the classification problem, the data input to the classifier is the feature information, and the decision of each node in the established decision tree is determined based on these feature information.

In this embodiment, in order to construct a decision tree that can detect whether a website is a phishing website, the feature information includes at least one of the following information:

(1) Whether the URL is in IP format;

On WWW, each information resource has a uniform and unique address on the Internet. The address is called URL (Uniform Resource Locator), which is the unified resource location mark of WWW, which refers to the network address.

The most commonly used transmission protocol for URLs is the HTTP protocol, which is currently the most widely used protocol in WWW. Common URL formats can include: http format, file format, ftp format, gopher format, etc. According to experience, when the URL is in IP format, the website corresponding to the URL may be a phishing website.

(2) Whether the period of time when the URL domain name exists is less than the set number of days;

In general, the longer the phishing website domain name exists, the higher the possibility of being reported, and the shorter the URL domain name existence period, the greater the probability of being a phishing website. Therefore, the time period in which the URL domain name exists can also be used as the feature information for establishing the decision tree. The set number of days can be 30 days.

(3) Whether the @ character is included in the URL;

In this embodiment, when the URL contains the @ character, the website may be a phishing website. Therefore, whether the URL contains the @ character is also used as feature information for establishing a decision tree.

(4) Whether at least two domain names are included in the URL;

Some phishing websites will use multiple domain names to disguise. For example, when clicking on the URL of a website, there will be multiple jumps in the middle. Therefore, when at least two domain names are included in the URL, the website corresponding to the URL may be It will be a phishing website.

(5) Whether the account password information is included in the form;

The purpose of a general phishing website is to steal user account password information. Therefore, if the account password information is included in the form of the website, the website may be a phishing website. Therefore, whether the account password information is included in the form as a decision tree Characteristic information.

(6) Whether the value after the URL jump is the same as before the jump.

For example, the value before the URL jump is "Taobao", and after clicking the URL link, the value after the URL jump is not "Taobao", then the website may be a phishing website, using the value before the URL jump to deceive the user. Among them, the value after the URL jump can be obtained by parsing the opened web page.

In this embodiment, the best splitting method may be calculated according to the k randomly selected feature information. Splitting refers to the process of splitting the training data set into two sub-data sets again and again during the training process of the decision tree.

In this embodiment, assuming that the set number of all feature information is N, the value k of randomly selecting the number of feature information may be rounding the root number N. The rounding of the root number N may be rounding up or rounding down, which may be preset in advance. For example, the number N of all the feature information set is equal to 10, and the root number 10 is approximately 3.16. Taking rounding up as an example, then rounding up is equal to 4, and the number k of feature information randomly selected is 4; Take rounding down as an example, then rounding down is equal to 3, and the number k of randomly selected feature information is 3.

Among them, n and k are positive integers. Step 013. Repeat steps 011-012 m times to generate m decision trees. The generated m decision trees form a random forest; where m is a positive integer.

Among them, a random forest composed of m decision trees is a random forest classifier.

In this embodiment, each splitting process of the decision tree in the random forest does not use all the feature information to be selected, but selects a certain amount of feature information randomly from all the feature information to be selected, and then The best feature information is selected from the selected feature information. This can make the decision trees in the random forest different from each other, improve the diversity of the system, and thus improve the classification performance.

In an embodiment of the present application, the unsampled samples in the sample set can be used as the test data of the random forest classifier to verify the accuracy of the random forest classifier. For example, select several unsampled websites in the sample set Web page information, it is known that these selected types of unsampled websites, that is, phishing websites or normal websites, extract the feature information of each website, and input the extracted feature information into the random forest classifier respectively, according to the random forest classification The type of the website detected by the device is compared with its true type. If the accuracy rate exceeds the set probability, it indicates that the accuracy rate of the random forest classifier meets the requirements and can be used.

Step 02: Determine the webpage information of the website to be detected.

The determined webpage information of the website to be detected may be a URL.

In an embodiment of the present application, in order to improve the detection efficiency of whether the website to be detected is a phishing website, a blacklist of phishing websites may be constructed in advance, where the blacklist includes URLs of several websites, all of which have been determined When a phishing website needs to be tested, the URL of the website to be tested can be obtained according to the webpage information of the website to be tested, and the URL of the website to be tested can be compared with a pre-built blacklist, If the URL of the website to be detected is included in the blacklist, it is determined that the website to be detected is a phishing website, and if the URL of the website to be detected is not included in the blacklist, the website to be detected needs to be further To check, you need to perform step 03.

Among them, the blacklist may be a blacklist of Google Safe API.

Step 03: Extract the feature information of the website to be detected according to the webpage information of the website to be detected.

In this embodiment, the extracted feature information may be the same as the feature information in step 012, and the feature information may include at least one of the following information: (1) whether the URL is in IP format; (2) the time period when the URL domain name exists Whether it is less than the set number of days; (3) whether the URL contains the @ character; (4) whether the URL includes at least two domain names; (5) whether the account password information is included in the form; (6) the value and jump after the URL jump Is it the same before transfer?

Preferably, the feature information extracted for the website to be detected is the above six pieces of information.

In this embodiment, in order to facilitate the random forest classifier to classify the website to be detected, the extracted feature information may be Booleanized and converted into corresponding feature values. For example, for the above six feature information:

If the URL of the website to be tested is in IP format, it will be converted to feature value 1, if the URL of the website to be tested is not in IP format, it will be converted to feature value 0;

If the time period of the URL domain name of the detection website is less than the set number of days, it will be converted to feature value 1, if the time period of the URL domain name of the detection website is not less than the set number of days, it will be converted to feature value 0;

If the @ character is included in the URL, it will be converted to feature value 1, if the @ character is not included in the URL, it will be converted to feature value 0;

If the URL includes at least two domain names, it will be converted to feature value 1, if the URL does not include at least two domain names, it will be converted to feature value 0;

If the account password information is included in the form, it will be converted to feature value 1, if the account password information is not included in the form, it will be converted to feature value 0;

If the value after the URL jumps is different from that before the jump, it is converted to the feature value 1, and if the value after the URL jump is the same as before the jump, it is converted to the feature value 0.

Further, the six feature values can also be converted into feature vectors. For example, the extracted six feature information are: URL is not in IP format, URL domain name exists for a period of time not less than the set number of days, and URL does not contain @ Characters, URL includes a domain name, the form does not include account password information, the value after the URL jump is not the same as before the jump; then the converted feature vector is [0,0,0,0,0,1].

Step 04: Use each decision tree of the random forest to classify and vote on the extracted feature information.

Input the extracted feature information into the random forest classifier. Each decision tree in the random forest classifies the extracted feature information and counts the results of all decision trees voting on the extracted feature information.

Step 05: When the result of the classification vote is that the number of votes of the phishing website is more than the number of votes of the normal website, it is determined that the website to be detected is a phishing website.

During the classification voting, if the classification voting result is that the phishing website has more votes than the normal website, the website to be tested is determined to be a phishing website, and if the classification voting result is that the phishing website has fewer votes than the normal website , The website to be tested is determined to be a normal website.

In this embodiment, the random forest classifier detects whether the website is a phishing website or a normal website by using type identifiers 1, 0, where output 1 indicates that the website is a phishing website and output 0 indicates that the website is a normal website.

In an embodiment of the present application, in order to further improve the efficiency of website detection and enrich the blacklist, it may further include: when it is determined that the website to be detected is a phishing website according to the voting result, adding the URL of the website to be detected to Pre-built blacklist.

In the embodiment of the present application, by determining the webpage information of the website to be detected, the feature information of the website to be detected is extracted according to the webpage information of the website to be tested, and the random feature forest including a plurality of decision trees is constructed to classify the extracted feature information to vote , When the result of the classified vote is that the number of votes of the phishing website is more than the number of votes of the normal website, it can be determined that the website to be detected is a phishing website. In this application, the random forest is constructed by establishing a decision tree through a large number of website samples. The types of phishing websites included are diverse, and the use of random forests for classification and voting has a high accuracy rate.

Please continue to refer to FIG. 2, which shows a phishing website detection device based on a decision tree. In this embodiment, the deciding tree detection device 10 based on a decision tree may include or be divided into one or more program modules, one or Multiple program modules are stored in the storage medium and executed by one or more processors to complete the present application, and can implement the above-mentioned decision tree-based phishing website detection method. The program module referred to in this application refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable for describing the execution process of the phishing website detection device 10 based on the decision tree in the storage medium than the program itself. The following description will specifically introduce the functions of the program modules of this embodiment:

The random forest construction module 11 is used to construct a random forest in advance to obtain a random forest classifier 14, and the constructed random forest includes several decision trees;

The webpage information determination module 12 is used to determine the webpage information of the website to be detected;

The feature information extraction module 13 is configured to extract feature information of the website to be detected according to the webpage information of the website to be detected;

The random forest classifier 14 is used to classify and vote on the extracted feature information by using each decision tree of the random forest;

The detection result determination module 15 is used to determine that the website to be detected is a phishing website when the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website;

Among them, the random forest construction module 11 is specifically used to select n samples randomly from the sample set with replacement; the sample set includes webpage information of several phishing websites and webpage information of several normal websites; from the settings Randomly select k feature information from all the feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples; repeat the above steps m times to generate m decision trees, which consist of m decision trees Random forest; where n, k, and m are all positive integers.

In an embodiment of the present application, please refer to FIG. 3, the phishing website detection device 10 based on the decision tree may further include: a Boolean processing module 16, which is used to Booleanize the extracted feature information to convert into corresponding feature values And output the converted feature value to the random forest classifier 14.

In an embodiment of the present application, please refer to FIG. 3, the phishing website detection device 10 based on the decision tree may further include: a primary detection module 17, configured to obtain the website to be detected according to the webpage information of the website to be detected URL, comparing the URL of the website to be detected with a pre-built blacklist, if the URL of the website to be detected is included in the blacklist, it is determined that the website to be detected is a phishing website, and if the black If the URL of the website to be tested is not included in the list, the web page information of the website to be tested is output to the feature information extraction module 13.

In one embodiment of the present application, please refer to FIG. 3, the phishing website detection device 10 based on the decision tree may further include: a blacklist adding module 18, used to determine that the website to be detected is a phishing website based on the voting result Add the URL of the website to be detected to the pre-built blacklist.

This embodiment also provides a computer device, such as a smartphone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a rack server (including an independent server, or A server cluster composed of multiple servers), etc. The computer device 20 of this embodiment includes at least but not limited to: a memory 21 and a processor 22 that can be communicatively connected to each other through a system bus, as shown in FIG. 4. It should be noted that FIG. 4 only shows the computer device 20 having the components 21-22, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.

In this embodiment, the memory 21 (read-only storage medium) includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), Read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 21 may be an internal storage unit of the computer device 20, such as a hard disk or memory of the computer device 20. In other embodiments, the memory 21 may also be an external storage device of the computer device 20, for example, a plug-in hard disk equipped on the computer device 20, a smart memory card (Smart Media, Card, SMC), and secure digital (Secure Digital, SD) card, flash card (Flash Card), etc. Of course, the memory 21 may also include both the internal storage unit of the computer device 20 and its external storage device. In this embodiment, the memory 21 is generally used to store the operating system and various application software installed in the computer device 20, such as the program code of the phishing website detection apparatus 10 based on the decision tree in the first embodiment. In addition, the memory 21 may also be used to temporarily store various types of data that have been output or will be output.

The processor 22 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 22 is generally used to control the overall operation of the computer device 20. In this embodiment, the processor 22 is used to run the program code or process data stored in the memory 21, for example, to run the decision tree-based phishing website detection device 10, so as to implement the decision tree-based phishing website detection method of Embodiment 1.

This embodiment also provides a computer-readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), only Read memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, server, App store, etc., on which computer programs are stored, When the program is executed by the processor, the corresponding function is realized. The computer-readable storage medium of this embodiment is used to store a decision tree-based phishing website detection device 10, and when executed by a processor, implements the decision tree-based phishing website detection method of Embodiment 1.

Example 2

Referring to FIG. 5, the method for detecting a phishing website based on a decision tree in this embodiment is based on Embodiment 1, and includes the following steps:

Step 01: Construct a random forest. The constructed random forest includes several decision trees.

In step 011, n samples are randomly selected from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites.

In this embodiment, in order to construct a decision tree that can detect whether a website is a phishing website, the feature information includes at least one of the following information: (1) whether the URL is in IP format; (2) the URL domain name exists Whether the time period is less than the set number of days; (3) whether the URL contains the @ character; (4) whether the URL includes at least two domain names; (5) whether the account password information is included in the form; (6) the value after the URL jump Is it the same as before the jump?

Step 013. Repeat steps 011-012 m times to generate m decision trees. The generated m decision trees form a random forest; where m is a positive integer.

Step 02: Pre-construct a blacklist including webpage information of several phishing websites.

Among them, the blacklist may be a blacklist of Google Safe API.

Step 03: Determine the webpage information of the website to be detected.

Step 04: Obtain the URL of the website to be tested according to the webpage information of the website to be tested, and compare the URL of the website to be tested with a pre-built blacklist, if the blacklist includes the to-be-detected The URL of the website determines that the website to be detected is a phishing website. If the URL of the website to be detected is not included in the blacklist, step 05 is performed.

Step 05: Extract the feature information of the website to be detected according to the webpage information of the website to be detected.

In this embodiment, the extracted feature information may be the same as the feature information in step 012.

Step 06: Booleanize the extracted feature information to convert to corresponding feature values.

For the above six feature information: if the URL of the website to be tested is in IP format, it will be converted to feature value 1, if the URL of the website to be tested is not in IP format, it will be converted to feature value 0; if the URL domain name of the website to be tested If the time period of existence is less than the set number of days, it will be converted to feature value 1. If the time period of the URL domain name of the detected website is not less than the set number of days, it will be converted to feature value 0; Feature value 1, if the URL does not contain the @ character, it is converted to feature value 0; if the URL includes at least two domain names, it is converted to feature value 1, if the URL does not include at least two domain names, it is converted to feature value 0; if the account password information is included in the form, it will be converted to feature value 1, if the account password information is not included in the form, it will be converted to feature value 0; if the value after the URL jump is not the same as before the jump, it will be converted to Feature value 1, if the value after URL jump is the same as before jump, it will be converted to feature value 0.

Step 07: Use each decision tree of the random forest to classify and vote on the extracted feature information.

Step 08: When the voting result is that the number of votes of the phishing website is more than the number of votes of the normal website, the website to be detected is determined to be a phishing website; otherwise, it is determined to be a normal website; The website, and perform step 09; if it is a normal website, the user is prompted to visit the website.

Step 09: Add the URL of the website to be detected to the pre-built blacklist.

The sequence numbers of the above embodiments of the present application are for description only, and do not represent the advantages and disadvantages of the embodiments.

Through the description of the above embodiments, those skilled in the art can clearly understand that the methods in the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course, can also be implemented by hardware, but in many cases the former is better Implementation.

The above are only the preferred embodiments of the present application, and do not limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made by using the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

A method for detecting a phishing website based on a decision tree is characterized in that it includes the following steps:

Step 01: Construct a random forest in advance, and the constructed random forest includes several decision trees;

Step 02: Determine the webpage information of the website to be tested;

Step 03: Extract the feature information of the website to be detected according to the webpage information of the website to be detected;

Step 04: Use each decision tree of the random forest to classify and vote on the extracted feature information;

Step 05: When the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website, it is determined that the website to be detected is a phishing website;

The construction of the random forest includes:

Step 011, randomly selecting n samples from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites;

Step 012: randomly select k feature information from all the set feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples;

Step 013, repeat steps 011-012 m times to generate m decision trees, and the generated m decision trees constitute a random forest;

Among them, n, k, m are all positive integers.
The method for detecting a phishing website based on a decision tree according to claim 1, wherein the characteristic information includes at least one of the following information: whether the URL is in IP format, and whether the URL domain name exists for a period of time less than a set number of days , Whether the URL contains the @ character, whether the URL includes at least two domain names, whether the account password information is included in the form, and whether the value after the URL jump is the same as before the jump.
The method for detecting a phishing website based on a decision tree according to claim 1, wherein in step 012, the k value is rounded to the root number N, where N is the set number of all feature information.
The method for detecting a phishing website based on a decision tree according to claim 1, wherein before step 04, the method further comprises: booleanizing the extracted feature information to convert it into corresponding feature values, and according to the converted feature values Go to step 04.
The method for detecting a phishing website based on a decision tree according to claim 1 or 3, further comprising: before step 03, further comprising: obtaining the URL of the website to be detected according to the webpage information of the website to be detected Comparing the URL of the website to be detected with a pre-built blacklist, if the URL of the website to be detected is included in the blacklist, it is determined that the website to be detected is a phishing website, and if the blacklist does not include If the URL of the website to be detected, step 03 is performed.
The method for detecting a phishing website based on a decision tree according to claim 5, after step 05, further comprising: when determining that the website to be detected is a phishing website according to the voting result, Is added to the pre-built blacklist.
A phishing website detection device based on decision tree is characterized in that it includes:

The random forest construction module is used to construct a random forest in advance to obtain a random forest classifier, and the constructed random forest includes several decision trees;

Webpage information determination module, used to determine the webpage information of the website to be tested;

A feature information extraction module, configured to extract feature information of the website to be detected according to the webpage information of the website to be detected;

The random forest classifier is used to classify and vote on the extracted feature information by using each decision tree of the random forest;

The detection result determination module is used to determine that the website to be tested is a phishing website when the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website;

The random forest construction module is specifically used to construct a random forest in the following ways:

Step 011, randomly selecting n samples from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites;

Step 012: randomly select k feature information from all the set feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples;

Step 013, repeat steps 011-012 m times to generate m decision trees, and the generated m decision trees constitute a random forest;

Among them, n, k, m are all positive integers.
The phishing website detection device based on the decision tree according to claim 7, further comprising:

The Boolean processing module is used to Booleanize the extracted feature information to convert to corresponding feature values, and output the converted feature values to the random forest classifier.
The phishing website detection device based on the decision tree according to claim 7, further comprising:

The first-level detection module is used to obtain the URL of the website to be detected according to the webpage information of the website to be tested, and compare the URL of the website to be tested with a pre-built blacklist, if the blacklist includes The URL of the website to be detected, then output a notification that the website to be detected is a phishing website, and if the URL of the website to be detected is not included in the blacklist, then output the webpage information of the website to be detected to all The feature information extraction module.
The phishing website detection device based on the decision tree according to claim 7, further comprising:

The blacklist adding module is used to add the URL of the website to be detected to the pre-built blacklist when it is determined that the website to be detected is a phishing website according to the voting result.
A computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, characterized in that, when the processor executes the computer program, a method for detecting a phishing website based on a decision tree The following steps:

Step 01: Construct a random forest in advance, and the constructed random forest includes several decision trees;

Step 02: Determine the webpage information of the website to be tested;

Step 03: Extract the feature information of the website to be detected according to the webpage information of the website to be detected;

Step 04: Use each decision tree of the random forest to classify and vote on the extracted feature information;

Step 05: When the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website, it is determined that the website to be detected is a phishing website;

Step 01 includes:

Step 011, randomly selecting n samples from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites;

Step 012: randomly select k feature information from all the set feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples;

Step 013, repeat steps 011-012 m times to generate m decision trees, and the generated m decision trees constitute a random forest;

Among them, n, k, m are all positive integers.
The computer device according to claim 11, wherein the characteristic information includes at least one of the following information: whether the URL is in IP format, whether the time period of the URL domain name exists is less than a set number of days, and whether the URL contains @ The characters, whether the URL includes at least two domain names, whether the form includes account password information, and whether the value after the URL jump is the same as before the jump.
The computer device according to claim 11, wherein the value of k in step 012 is to round the root number N, where N is the set number of all feature information.
The computer device according to claim 11, characterized in that before step 04, the method further comprises: performing Boolean conversion on the extracted feature information to convert to corresponding feature values, and performing step 04 according to the converted feature values.
The computer device according to claim 11 or 14, further comprising: before step 03, further comprising: acquiring the URL of the website to be detected according to the webpage information of the website to be detected, and converting the URL of the website to be detected Compared with the pre-built blacklist, if the blacklist includes the URL of the website to be detected, it is determined that the website to be detected is a phishing website, and if the blacklist does not include the website to be detected URL, go to step 03.
A computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the following steps of a method for detecting a phishing website based on a decision tree are implemented:

Step 01: Construct a random forest in advance, and the constructed random forest includes several decision trees;

Step 02: Determine the webpage information of the website to be tested;

Step 03: Extract the feature information of the website to be detected according to the webpage information of the website to be detected;

Step 04: Use each decision tree of the random forest to classify and vote on the extracted feature information;

Step 05: When the classified voting result is that the number of votes of the phishing website is more than the number of votes of the normal website, it is determined that the website to be detected is a phishing website;

Among them, step 01 includes:

Step 011, randomly selecting n samples from the sample set with replacement, and the sample set includes webpage information of several phishing websites and webpage information of several normal websites;

Step 012: randomly select k feature information from all the set feature information, and use the randomly selected k feature information to build a decision tree on the selected n samples;

Step 013, repeat steps 011-012 m times to generate m decision trees, and the generated m decision trees constitute a random forest;

Among them, n, k, m are all positive integers.
The computer-readable storage medium according to claim 16, wherein the characteristic information includes at least one of the following information: whether the URL is in IP format, whether the time period of the URL domain name exists is less than a set number of days, the URL Whether the @ character is included, whether at least two domain names are included in the URL, whether account password information is included in the form, and whether the value after the URL jump is the same as before the jump.
The computer-readable storage medium according to claim 16, wherein the k value in step 012 is to round the root number N, where N is the set number of all feature information.
The computer-readable storage medium according to claim 16, further comprising: before step 04, further comprising: performing Boolean conversion on the extracted feature information to convert to corresponding feature values, and performing step 04 according to the converted feature values .
The computer-readable storage medium according to claim 16 or 18, characterized in that before step 03, it further comprises: acquiring the URL of the website to be detected according to the webpage information of the website to be detected, and converting the The URL of the website is compared with the pre-built blacklist. If the URL of the website to be detected is included in the blacklist, the website to be detected is determined to be a phishing website. If the blacklist does not include the website To check the URL of the website, go to step 03.