CN113132410A

CN113132410A - Method for detecting fishing website

Info

Publication number: CN113132410A
Application number: CN202110476844.3A
Authority: CN
Inventors: 胡光武; 廖龙; 马全; 延霞; 刘云霞
Original assignee: Shenzhen Institute of Information Technology
Current assignee: Shenzhen Institute of Information Technology
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2021-07-16
Anticipated expiration: 2041-04-29
Also published as: CN113132410B

Abstract

The invention provides a method for detecting a fishing website, which is applied to a URL character string of the website and used for judging whether the website indicated by the URL character string is the fishing website or not, and comprises the following steps: establishing a deep learning phishing website detection classifier model based on parallel combination of a convolutional neural network and a multi-head self-attention network; training the classifier model by using a pre-prepared balanced URL sample set; and judging the URL character string with unknown properties by using the trained classifier model, and judging whether the URL character string is a fishing website. Because the convolutional neural network and the multi-head self-attention network can respectively learn the character string characteristics and the characteristic weight of the URL and can be performed in parallel, the classifier model can effectively save analysis processing time, has high judgment accuracy and has the prospect of being applied to real-time detection of large-scale networks of the Internet.

Description

Method for detecting fishing website

Technical Field

The invention relates to the technical field of network security, in particular to a method for detecting a fishing website.

Background

Phishing website attack belongs to a social engineering attack, and refers to a malicious attack behavior that an attacker impersonates the website and content of a legal website and induces a user to access the website through modes and means such as e-mail, two-dimensional codes, instant communication tools, domain name hijacking and the like, so that the privacy of the user is revealed, the identity is stolen and property loss is caused.

In recent years, with the rapid development of the internet, especially the mobile internet, the phishing attack phenomenon is becoming more severe. According to the report of the internet network security situation of China in the first half of 2019 of the national computer network emergency technology processing coordination center (CNCERT), the CNCERT autonomously monitors and discovers that about 4.6 million counterfeit pages aiming at domestic websites of China are increased by 8.2% compared with the end of 2018 in the first half of 2019. On the international scale, the fishing activity trend report of the third quarter of 2019 of the international anti-phishing working group (APWG) shows that the number of global phishing websites is continuously increased from 1 month to 9 months of 2019, the newness is increased since 2016, and the increase of 9 months is nearly 100% than 1 month, and reaches 86276. At the same time, the online fraud report by RSA shows that the phishing attack in 2018 has lost about $ 46 billion to global organizations, and this figure is still growing further.

Taking the website address of the industrial and commercial bank of China (http:// www.icbc.com) as an example, an attacker firstly clones the website content like the industrial and commercial bank, sets the website address of the cloned website as http:// www.1cbc.com, and induces the user to access by means of other cheating skills such as DNS hijacking, URL hijacking, two-dimensional codes, network cheating and the like. Because the content and the website form of the website are highly similar to the real website, the victim can be mistaken as the real website and log in, the attacker can obtain the information of the user name, the login password, the bank card number, even the payment password and the like of the victim, and further can perform illegal operations (such as account transfer and the like) by utilizing the information. With the rise of mobile networks and two-dimensional codes in recent years, emerging phishing phenomena based on the two-dimensional codes cannot be perceived, and the harm of phishing websites is further increased.

The current phishing website detection mechanism methods can be mainly classified into five types: black and white list based schemes, feature heuristic based schemes, graph feature based schemes, machine learning based schemes, and deep learning based schemes. The black-and-white list scheme judges the validity of the current link address mainly by maintaining an illegal URL, an IP address or a keyword list, so that a plurality of randomly generated or unlisted phishing websites cannot be detected, namely, a zero-day phishing attack (0day phishing attack) cannot be prevented; the heuristic scheme based on the characteristics is based on the phishing characteristics existing in the phishing website, and multiple characteristic verification is performed in a heuristic manner, so that the authenticity of the website is judged, but the scheme has more or less false positives and insufficient judgment precision; the detection method based on the graphic characteristics judges whether the current website is a phishing website or not by drawing a link relation graph of the phishing website and the victim website, and the calculation of the scheme requires a large amount of time and memory, so the calculation time delay is serious; the phishing website URL or website content is analyzed by adopting a mature machine learning model based on a machine learning method, the biggest advantage is that the zero-day attack can be identified, the identification efficiency of the trained classifier is high, but the biggest disadvantage is that the learning characteristics of a sample must be manually specified by a researcher, so the judgment precision depends on the knowledge of experts, the precision is different in different machine learning methods, and the reasonable classifier design is one of the factors limiting the improvement of the judgment rate; the detection method based on the deep learning model can autonomously extract features from the samples, so that classification training can be carried out more fully and a more accurate classification result can be obtained. However, the design of the deep learning model and whether the sample data set is balanced are important factors affecting the deep learning scheme to obtain a high-precision result.

The prior art lacks a detection method which has both efficiency and precision and is used for detecting fishing websites.

The above background disclosure is only for the purpose of assisting understanding of the concept and technical solution of the present invention and does not necessarily belong to the prior art of the present patent application, and should not be used for evaluating the novelty and inventive step of the present application in the case that there is no clear evidence that the above content is disclosed at the filing date of the present patent application.

Disclosure of Invention

The invention provides a method for detecting a phishing website, aiming at solving the existing problems.

In order to solve the above problems, the technical solution adopted by the present invention is as follows:

a method for detecting a phishing website is applied to a website URL character string and used for judging whether the website indicated by the URL character string is the phishing website or not, and specifically comprises the following steps: s1: establishing a deep learning phishing website detection classifier model based on parallel combination of a convolutional neural network and a multi-head self-attention network;

s2: training the classifier model by using a pre-prepared balanced URL sample set; s3: and judging the URL character string with unknown properties by using the trained classifier model, and judging whether the URL character string is a fishing website.

Preferably, the classifier model comprises four layers: the system comprises a feature embedding layer, a feature learning layer, a feature weight learning layer and a classification layer; the characteristic embedding layer is used for forming a characteristic matrix by the preprocessed URL character string and reducing the matrix dimension of the characteristic matrix; the characteristic learning layer comprises five layers of URL characteristic learning networks, a convolutional neural network is adopted to autonomously learn the characteristics in the characteristic matrix, and the learning result of the URL characteristics is output; the characteristic weight learning layer comprises five layers of URL characteristic weight learning networks, the weight of the URL characteristic in the characteristic matrix is automatically learned by adopting a multi-head self-attention network, and the learning result of the URL characteristic weight is output; and the classification layer is used for integrating the output results of the feature learning layer and the feature weight learning layer so as to generate a final judgment result.

Preferably, the preprocessing of the URL string includes: limiting the length of the URL character string to enable the total length of the URL character string not to exceed 100 characters; encoding the URL character string with the limited length by adopting one-hot encoding; and after the URL character string with the length less than 100 characters is subjected to unique hot coding, a plurality of 0 s are completely filled in front of the whole code, and the length of the URL code reaches 100 through default mapping.

Preferably, the feature embedding layer converts the URL string into a feature matrix of 97 × 100 after preprocessing the URL string, and reduces the matrix dimension to 64 × 100, specifically including: firstly, establishing a matrix with 97 x 100 elements and all 0 elements, regarding 0-96 codes of one-hot codes as rows, and regarding the one-hot code strings of the URL character strings as columns; for each code of the URL one-hot code string, setting the value of a row corresponding to the code +1 and a column corresponding to the sequence position of the current code in the code string as 1, namely forming a characteristic matrix of the URL, namely:

wherein, L is the length of the character string, and after formatting, L is 100; the size of the feature matrix is 97 x 100; reduce the feature matrix dimension of 97 x 100 above to 64 x 100, i.e. to be

From 97 to 64, the original sparse matrix is changed to a dense matrix, i.e.

Wherein the content of the first and second substances,

for the feature matrix, V is the parameter matrix, p is the number of rows of matrix V, and m is the number of columns of matrix V.

Preferably, each of the feature learning layers of the URL feature learning network is composed of 4 layers including a convolutional layer, a residual layer, a full-link layer, and a residual layer, and is specifically designed as follows: the convolutional layer is provided with 5 cores and is used for receiving the input of the characteristic matrix or the result output by the network of the previous layer and selecting characteristics from 5 characteristic graphs by using a maximum pool; providing one of said residual layers after said convolutional layer for adding an input of said convolutional layer to an output of said convolutional layer;

the full-connection layer carries out linear transformation on the output result of the residual layer:

y＝max(0，w^Tx+b)

where x is the learned URL feature matrix, W^TIs the learned URL feature weight, b is the offset constant; and the residual layer superposes the input and the output of the full connecting layer and then outputs the result.

Preferably, each URL feature weight learning network included in the feature weight learning layer includes four layers, namely a multi-head self-attention layer, a residual layer, a full-link layer, and a residual layer, and is specifically designed as follows: the multi-head self-attention layer is used for receiving the input of the feature matrix or the result output by the previous layer of network, copying 3 parts of the input feature matrix, respectively endowing the three parameters required by the multi-head self-attention model with the input feature matrix, and independently learning the feature weight of the feature matrix; the residual layer is used for superposing the output and the input of the multi-head self-attention layer; the full connection layer carries out linear transformation on the output result of the residual layer; and the residual layer superposes the input and the output of the full connecting layer and outputs a result.

Preferably, the classification layer includes a five-layer network connected by a multi-head self-attention layer, a residual layer and a full-connection layer, and the classification layer is specifically designed as follows: the multi-head self-attention layer takes the output result of the feature learning layer and the output result of the feature weight learning layer as the assignment of three parameters of the multi-head self-attention layer respectively; the residual layer superposes the output of the multi-head self-attention layer and the junction boundary of the feature learning layer; the full connection layer carries out linear transformation on the output result of the residual layer; the residual layer uses the output of the above full-connection layer and the output of the last residual layer as the input of the residual layer to further calculate the result; the full connection layer transforms the result of the previous layer and converts the output result into a floating point number between 0 and 1 by adopting a Sigmoid activation function; the Sigmod activation function is:

where x is the learned URL feature matrix, W^TIs the learned URL feature weight, b is the offset constant;

the output result of the Sigmoid function is between 0 and 1, and if the result is greater than or equal to 0.5, the input URL character string can be judged to be a legal website; otherwise, judging as a phishing website:

preferably, the balanced URL sample set includes a certain number of legal websites as positive samples and fishing website samples as negative samples, and the numbers of the two are the same.

Preferably, the classifier model is trained by using a pre-prepared balanced URL sample set, and the specific steps include: the balanced URL sample set is used as the input of the classifier model after being subjected to coding pretreatment, and the property of the URL character string in the balanced URL sample set is given to an output result, namely when the URL character string is a legal website, the output result is 1; and when the URL character string is a fishing website, outputting a result of 0.

Preferably, the trained classifier model classifies the URL string with unknown properties and generates an output result, and if the output result is greater than or equal to 0.5, it indicates that the URL string with unknown properties is a legal website; and if the output result is less than 0.5, the fishing website is determined.

The invention has the beneficial effects that: the method for detecting the phishing websites is provided, the characteristics and the weight of the URL character strings of the phishing websites are learned through the deep learning network, so that whether the websites are the phishing websites or not is judged, the characteristic learning network and the characteristic weight learning network can be performed in parallel, the processing time is effectively saved, and the method has the characteristics of high accuracy and short judging time.

Furthermore, the invention introduces a multi-head self-attention network into a fishing detection scene for the first time, and the multi-head self-attention network is combined with a convolutional neural network in parallel for processing, so that the processing efficiency is greatly accelerated.

Furthermore, the invention can obtain the discrimination result with high accuracy and high efficiency only by taking the URL character string of the target website as input, and is suitable for large-scale network real-time detection and prevents the phishing website from infringing users in the network.

Drawings

Fig. 1 is a schematic diagram of a method for detecting a phishing website in an embodiment of the invention.

FIG. 2 is a flow chart illustrating a method for detecting a fishing website according to an embodiment of the present invention.

FIG. 3 is a diagram illustrating a classifier model for detecting phishing sites in an embodiment of the present invention.

FIG. 4 is a diagram illustrating a website address converted into a feature matrix according to an embodiment of the present invention.

FIG. 5 is a schematic structural diagram of a feature learning layer in a classifier model for detecting phishing websites in an embodiment of the present invention.

FIG. 6 is a schematic structural diagram of a feature weight learning layer in a classifier model for detecting phishing websites according to an embodiment of the present invention.

FIG. 7 is a diagram illustrating a structure of a classification layer in a classifier model for detecting phishing websites in an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. In addition, the connection may be for either a fixing function or a circuit connection function.

It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for convenience in describing the embodiments of the present invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be in any way limiting of the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.

As shown in fig. 1, the present invention provides a method for detecting a phishing website, which is applied to a URL character string of a website for determining whether the website indicated by the URL character string is a phishing website, and specifically includes the following steps:

s1: establishing a deep learning phishing website detection classifier model based on parallel combination of a convolutional neural network and a multi-head self-attention network;

s2: training the classifier model by using a pre-prepared balanced URL sample set;

s3: and judging the URL character string with unknown properties by using the trained classifier model, and judging whether the URL character string is a fishing website.

The method autonomously learns the characteristics of the URL character strings through the convolutional neural network and autonomously learns the characteristic weight of the network by multiple heads, so that whether the website is a phishing website or not is judged, and the two websites can be performed in parallel, so that the processing time is effectively saved, and the method has the characteristics of high accuracy and high efficiency.

In the traditional scheme based on machine learning, an expert is required to manually specify sample characteristics such as URL (uniform resource locator) length, HTML (hypertext markup language) content, the number of links in a webpage and the like, so that the accuracy of sample characteristic learning depends on the knowledge of the expert; the model based on deep learning does not depend on expert knowledge to specify the sample characteristics, but cannot distinguish the weights of the sample characteristics, so that the judgment accuracy is insufficient; the method combines the characteristics that the convolutional neural network can independently learn the sample characteristics and the multi-head self-attention network can learn the characteristic weight of the character string sample, and has the advantages of high efficiency and accuracy.

Fig. 2 is a flow chart illustrating a method for detecting a fishing website according to the present invention. The invention designs a fishing website classifier model based on a Convolutional Neural Network (CNN) and multi-head self-attention (MHSA), and then performs fishing website detection based on the classifier model, wherein the method comprises four stages: the method comprises a URL sample set preparation phase, a URL preprocessing phase, a model training phase and a model detection phase. In the URL sample preparation stage, a large number of fishing websites and legal websites with approximately equal quantity are collected and prepared as a sample set; the URL preprocessing stage is used for formatting the input URL character string and coding the input URL character string by adopting one-hot coding; in the model training stage, a sample set is used for training the classifier model; and in the model detection stage, the trained model is adopted to judge whether the URL with unknown properties is a phishing website.

In one embodiment of the present invention, the URL sample set preparation stage requires that URL samples of websites with known properties are prepared to be collected to form a uniform URL sample set, which includes a certain number of legal websites as positive samples and fishing website samples as negative samples, and the two numbers are the same. Meanwhile, the more the number of samples in the sample set is, the more the classifier model is trained fully, and the higher the accuracy of classification judgment is.

It is understood that equalization here means that the number of positive and negative samples is substantially the same, and equalization is considered as long as the difference is not within a factor of 10.

In an embodiment of the present invention, for unbalanced sample sets with large ratio difference between positive and negative sample sets caused by the fact that there are few fishing website samples and many normal website samples, the present invention uses a countermeasure generated Network (GAN) to generate fishing websites. The specific method for generating the fishing website sample comprises the following steps:

a GAN is typically composed of two parts, a generator G that receives noise z to generate simulated data and a classifier D that separates the simulated data from the real data as much as possible. By continually gaming, the final generator G will generate simulation data that is sufficiently realistic that the classifier D cannot distinguish it from real data. The training process of G is as follows:

1) generating m data by Gaussian random distribution and forming a data set Z, wherein Z_i∈Z，i＝1,m。

2) Input Z results in G (Z), which is a 97-dimensional vector set. For each vector in the vector set, the largest element in 97 dimensions is set to 1, and the rest are set to 0, so that any vector can represent a URL.

3) Using a set (matrix) X of true phishing URL sample vectors, where X_i∈X，i＝1,m。

4) Training classifier D using X and g (z), and updating D by raising its stochastic gradient:

some noise data are generated through Gaussian random distribution to form a set Z, the Z input is generated into G to obtain G (Z), and meanwhile G is updated by increasing the random gradient of G:

training D and G alternately, and if the accuracy of D remains 0.5, the training can be ended, which means D cannot distinguish between the real-world phishing URL and the automatically generated phishing website, and G can be used to generate the phishing URL.

In a more specific embodiment, from 5000Best Websites(http://5000best.com/websites/)45000 normal website addresses are obtained from PhishLink (https:// www.phishtank.com))43984 phishing sites were obtained.

In a specific embodiment, training the classifier model using a pre-prepared equalized URL sample set includes:

the balanced URL sample set is used as the input of a classifier model after being subjected to coding pretreatment, and the property of the URL character string in the balanced URL sample set is endowed with an output result, namely when the URL character string is a legal website, the output result is 1; and when the URL character string is the phishing website, the output result is 0.

The trained classifier model classifies the URL character string with unknown properties and generates an output result, and if the output result is greater than or equal to 0.5, the URL character string with unknown properties is a legal website; and if the output result is less than 0.5, the fishing website is determined.

Fig. 3 is a schematic diagram of a classifier model for detecting phishing websites in the present invention. The fishing website classifier model based on the Convolutional Neural Network (CNN) and the multi-head self-attention (MHSA) comprises four layers: the system comprises a feature embedding layer, a feature learning layer, a feature weight learning layer and a classification layer.

The characteristic embedding layer is used for forming a characteristic matrix by the preprocessed URL character string and reducing the matrix dimension of the characteristic matrix;

the characteristic learning layer comprises five layers of URL characteristic learning networks and is used for autonomously learning the characteristics in the characteristic matrix by adopting a convolutional neural network and outputting the learning result of the URL characteristics;

the characteristic weight learning layer comprises five layers of URL characteristic weight learning networks and a characteristic learning layer for parallel processing, autonomously learns the weight of the characteristics in the characteristic matrix by adopting a multi-head self-attention network and outputs the learning result of the URL characteristic weight;

and the classification layer is used for integrating the output results of the feature learning layer and the feature weight learning layer to generate a final judgment result.

In one embodiment of the present invention, the preprocessing of the URL string includes:

limiting the length of the URL character string to ensure that the total length of the URL character string does not exceed 100 characters;

encoding the URL character string with the limited length by adopting one-hot encoding;

and after the URL character string with the length less than 100 characters is subjected to unique hot coding, a plurality of 0 s are completely filled in front of the whole code, and the length of the URL code reaches 100 through default mapping.

The specific process is as follows:

(1) if the length of the URL character string exceeds 100 characters, uniformly taking the first 100 characters of the URL character string for truncation:

(2) encoding the URL character string after length limitation by adopting a one-hot encoding (one-hot encoding) mapping scheme_urlIs mapped to a corresponding number, i.e. S_url->g'. The mapping relationship is shown in table 1:

TABLE 1 one-hot code mapping relationship

For example, for the URL string "http://", "h" corresponds to 8, "t" corresponds to 20, "p" corresponds to 16, ": corresponds to 69,"/"corresponds to 72, the entire string corresponds to (8,20,20,16,69,72, 72).

For example, if the length of the website "http://.. com" exceeds 100, the first 100 characters are directly truncated; each character in the string is then replaced with a one-hot code, namely: h- >8, t >20, p- >20, … … ultimately makes it a one-dimensional array (8,20,20,16,69,72,72, …,3,15, 13).

In the invention, a classifier training stage selects a URL sample set prepared in advance, extracts each URL character string, and trains a classifier model after processing according to a preprocessing stage. The trained classifier can judge the URL character string with unknown properties, the final result is output as a floating point number between 0 and 1, the result of [0.5,1] is regarded as a normal website, and the result of [0,0.5) is regarded as a phishing website.

It can be understood that, in the invention, the sample set is firstly equalized, then the URL character string is preprocessed, then the URL is converted into the characteristic matrix by adopting the one-hot coding, then the characteristic matrix is respectively input into the network combining the CNN and the MHSA to respectively obtain the URL characteristic and the URL characteristic weight, and finally the multi-head self-attention network is adopted to synthesize the data of the two to generate the final judgment result.

It should be noted that the URL truncation optimum length setting of 100 is experimentally calculated, and should not be considered as a limitation of the present invention, and the change of the length value should not be taken as a feature for distinguishing the present invention, and can be changed according to specific situations. It is understood that when the length of the URL string exceeds 100, the processing efficiency of the classifier model is affected, and the contribution to the accuracy of the judgment is not particularly significant, whereas when the length of the string truncation is too short, the accuracy of the judgment classification is significantly affected. Therefore, in the present invention, when the length of the URL character string exceeds 100, the URL character string is cut off, and when the length of the URL character string is less than 100, 0 is supplemented in front of the encoded string to the length of 100.

In the invention, no matter the URL character string of the sample is used for training or the URL with unknown property is judged, after pretreatment, the model enters a characteristic embedding layer of the classifier model. The layer converts the URL character string after the encoding preprocessing into a feature matrix of 97 x 100, and then reduces the matrix dimension to 64 x 100, thereby facilitating the processing of the neural network model. Since the length of the URL string is 100, the length of the mapping table is 97; combining the following conversion matrix, regarding each character in the URL as a column, and marking the corresponding character with 1, a feature matrix of 97 × 100 is formed; because the model can only handle fixed matrices and matrices with 2n rows, the original 97 x 100 sparse matrix is reduced to 64 x 100. In an embodiment of the present invention, the feature embedding layer converts the preprocessed URL string into a feature matrix of 97 × 100, and reduces a matrix dimension to 64 × 100, which specifically includes:

a matrix is first built 97 x 100 with all elements 0, taking the one-hot coded 0-96 as rows and the one-hot coded string of URL strings as columns. For each code of the URL one-hot code string, setting the element value of the row corresponding to (code +1) and the column corresponding to the sequence position currently coded in the code string to 1, that is, forming a feature matrix of the URL, that is: :

specifically, regarding the encoding mapping relationship of 0 to 96 in the URL string as a column, if a certain letter appears, the position corresponding to the coordinate is marked as 1, and the rest 96 positions in the column are filled with 0, so that each letter in the URL string is transformed into a one-dimensional array with a length of 97, for example, the lower case letter "c" can be expressed as g' ═ 0,0,0,1,0 … 0,0, 0.

As shown in fig. 4, web site "http://...comAnd (4) a corresponding one-dimensional array (8,20,20,16,69,72,72, …,3,15,13) is a schematic diagram of the transformed feature matrix.

In order to facilitate unified processing, for a character string with the length of the character string less than 100 in the feature matrix, the feature embedding layer also needs to supplement 100-L containing 0 columns for the feature matrix, so that the number of columns in the feature matrix reaches 100;

wherein, L is the length of the character string, and after formatting, L is 100. The feature matrix has a size of 97 x 100.

The characteristic matrix 97 x 100 formed by the URL character string is a sparse matrix because of containing a large number of 0, so that the relation between characters in the URL character string becomes unobvious in space and semanteme, and meanwhile, the deep learning model can only process 2ⁿA matrix of dimensions. Thus, the present invention reduces the dimension of the 97 x 100 matrix above to 64 x 100, i.e., that is, to

From 97 to 64. From the mathematical point of view, the original matrix is multiplied by a parameter matrix V, thereby changing the original sparse matrix into a dense matrix, namely

(100*97→100*64)：

Wherein the content of the first and second substances,

the characteristic matrix is V, the parameter matrix is V, the row number of the matrix V is p, and the column number of the matrix V is m.

Then, the classifier model inputs the feature matrix into a feature learning layer of a 5-layer network and a feature weight learning layer of the 5-layer network, respectively, and processes the feature matrix and the feature weight learning layer, respectively.

FIG. 5 is a schematic structural diagram of a feature learning layer in the classifier model of the present invention. Each Layer of URL feature learning network in the feature learning Layer is composed of 4 layers including a Convolutional Layer (conditional Layer), a Residual Layer (Residual Layer) and a fully Connected Layer (Full Connected Layer), and is specifically designed as follows:

there are 5 kernels (Kernel) in the convolutional layer, which are used to accept the input of the feature matrix or the result output by the previous layer of network, and use Max pool (Max pool) to select features from the 5 feature maps;

providing a residual layer after the convolutional layer for adding an input of the convolutional layer to an output of the convolutional layer; the method is used for solving the problem of learning precision degradation.

The full connection layer carries out linear transformation on the output result of the residual layer:

y＝max(0，w^Tx+b)

where x is the learned URL feature matrix, W^TIs the learned URL feature weight, and b is the offset constant.

In order to avoid the problem of precision degradation, the last residual layer superposes the input and the output of the full connection layer and then outputs the result.

It is understood that only the structure of one of the five layers of the feature learning layer is shown here, and the feature matrix will output the result of feature learning after being processed by the five layers. The number of feature learning layers may vary depending on the particular situation, such as more or less than 5 layers.

FIG. 6 is a schematic structural diagram of a feature weight learning layer in the classifier model of the present invention. Each Layer of URL feature weight learning network included in the feature weight learning Layer includes a Multi-head Self-Attention Layer (Multi-head Self-Attention Layer), a Residual Layer (Residual Layer), a Full Connected Layer (Full Connected Layer) and a Residual Layer, and is specifically designed as follows:

a multi-head self-attention layer for receiving the input of the characteristic matrix or the output result of the previous layer network, copying 3 parts of the input characteristic matrix, respectively giving three parameters required by the multi-head self-attention model, and simultaneously giving the three parameters to the multi-head self-attention model

In a specific embodiment, the number of headers of the layer is set to 8 for learning the feature weight of the feature matrix autonomously; it is understood that the formula parameters for multi-head self-attention layer learning are well known formulas, and the three parameters are Q, K, V, which are not described herein.

A residual layer for superimposing the output and input of the multi-head self-attention layer; to solve the problem of accuracy degradation.

The full connection layer carries out linear transformation on the output result of the residual layer;

in order to avoid accuracy degradation, the last residual layer superposes the input and output of the full-connection layer, and outputs the result.

It is to be understood that only the structure of one of the five layers of the feature weight learning layer is shown here, and the number of layers of the feature weight learning layer may vary depending on the situation, such as more or less than 5 layers.

FIG. 7 is a schematic diagram of the structure of the classification layer in the classifier model according to the present invention. The classification layer comprises a five-layer network connected with a multi-head self-attention layer, a residual layer and a full-connection layer, and specifically comprises the following steps:

the multi-head self-attention layer respectively takes the output result (F-output) of the characteristic learning layer and the output result (A-output) of the characteristic weight learning layer as three parameters Q (A-output) and K (A-output copy) of the layer, and the value of V (F-output) is assigned;

the residual layer superposes the output of the multi-head self-attention layer and the junction boundary of the feature learning layer; thereby eliminating the problem of accuracy degradation.

the residual layer uses the output of the above full-connection layer and the output of the last residual layer as the input of the residual layer to further calculate the result; thereby eliminating the accuracy degradation problem.

And finally, the full connection layer transforms the result of the previous layer, and converts the output result into a floating point number between 0 and 1 by adopting a Sigmoid activation function, wherein the Sigmod activation function is as follows:

where x is the learned URL feature matrix, W^TIs the learned URL feature weight, b is the bias constant, and e is the natural constant. The output result of the Sigmoid function is between 0 and 1, and if the result is greater than or equal to 0.5, the input URL character string can be judged to be a legal website; otherwise, judging as a phishing website:

the invention obtains 68030 normal websites from a 5000best website (http://5000best. com/websites /), obtains 12003 fishing websites from a PhishLink website (https:// www.phishtank.com), and divides the websites into 5 training sets to train the model. The training data set is specific as follows:

after training, the classification model of the invention obtains the following classification results on five indexes of accuracy (Acc), false positive class rate (FPR), recall rate (Rec), accuracy rate (Pre) and F1:

data set	Accuracy Acc	False positive class rate FPR	Recall rate Rec	Accuracy Pre	F1
						D01	97.20％	1.2％	95.6％	98.76％	97.15％
D02	96.76％	0.86％	94.38％	99.1％	96.68％
						D03	96.41％	1.23％	94.05％	98.71％	96.32％
D04	92.05％	0.47％	84.57％	99.44％	91.41％
						D05	90.54％	0.22％	81.3％	99.73％	89.58％

Note: the smaller the FPR value is, the better the quality, and the larger the other index values are

The experimental results show that on the basis of a realistic unbalanced data sample set, the model obtains 97.2% of the best performance on the most representative accuracy Acc index, and the average accuracy also reaches 95.6%.

An embodiment of the present application further provides a control apparatus, including a processor and a storage medium for storing a computer program; wherein a processor is adapted to perform at least the method as described above when executing the computer program.

Embodiments of the present application also provide a storage medium for storing a computer program, which when executed performs at least the method described above.

Embodiments of the present application further provide a processor, where the processor executes a computer program to perform at least the method described above.

The storage medium may be implemented by any type of volatile or non-volatile storage device, or combination thereof. The nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM Enhanced Synchronous Dynamic Random Access Memory), Synchronous linked Dynamic Random Access Memory (SLDRAM, Synchronous Link Dynamic Random Access Memory), Direct Memory (DRmb Access Memory). The storage media described in connection with the embodiments of the invention are intended to comprise, without being limited to, these and any other suitable types of memory.

In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.

Claims

1. A method for detecting a phishing website, which is applied to a website URL character string and used for judging whether the website indicated by the URL character string is the phishing website, specifically comprises the following steps:

2. The method for detecting a fishing website of claim 1, wherein the classifier model comprises four layers: the system comprises a feature embedding layer, a feature learning layer, a feature weight learning layer and a classification layer;

the characteristic learning layer comprises five layers of URL characteristic learning networks, a convolutional neural network is adopted to autonomously learn the characteristics in the characteristic matrix, and the learning result of the URL characteristics is output;

the characteristic weight learning layer comprises five layers of URL characteristic weight learning networks, the weight of the URL characteristic in the characteristic matrix is automatically learned by adopting a multi-head self-attention network, and the learning result of the URL characteristic weight is output;

and the classification layer is used for integrating the output results of the feature learning layer and the feature weight learning layer so as to generate a final judgment result.

3. The method for detecting a phishing website as claimed in claim 2, wherein the preprocessing of the URL string comprises:

limiting the length of the URL character string to enable the total length of the URL character string not to exceed 100 characters;

4. The method for detecting phishing websites of claim 3, wherein the feature embedding layer converts the URL string after preprocessing into a 97 x 100 feature matrix and reduces the matrix dimension to 64 x 100, and specifically comprises:

firstly, establishing a matrix with 97 x 100 elements and all 0 elements, regarding 0-96 codes of one-hot codes as rows, and regarding the one-hot code strings of the URL character strings as columns; for each code of the URL one-hot code string, setting the value of a row corresponding to the code +1 and a column corresponding to the sequence position of the current code in the code string as 1, namely forming a characteristic matrix of the URL, namely:

From 97 to 64, the original sparse matrix is changed to a dense matrix, i.e.

Wherein the content of the first and second substances,

5. The method for detecting a phishing website as claimed in claim 4, wherein each of the feature learning layers of the URL feature learning network is composed of 4 layers including a convolutional layer, a residual layer, a full link layer and a residual layer, and is specifically designed as follows:

the convolutional layer is provided with 5 cores and is used for receiving the input of the characteristic matrix or the result output by the network of the previous layer and selecting characteristics from 5 characteristic graphs by using a maximum pool;

providing one of said residual layers after said convolutional layer for adding an input of said convolutional layer to an output of said convolutional layer;

y＝max(0，ω^Tx+b)

6. The method according to claim 5, wherein the characteristic weight learning layer comprises four layers, namely a multi-head self-attention layer, a residual layer, a full-link layer and a residual layer, of URL characteristic weight learning network, and is specifically designed as follows:

the multi-head self-attention layer is used for receiving the input of the feature matrix or the result output by the previous layer of network, copying 3 parts of the input feature matrix, respectively endowing the three parameters required by the multi-head self-attention model with the input feature matrix, and independently learning the feature weight of the feature matrix;

the residual layer is used for superposing the output and the input of the multi-head self-attention layer;

and the residual layer superposes the input and the output of the full connecting layer and outputs a result.

7. The method for detecting a fishing site according to claim 6, wherein the classification layer comprises a five-layer network including a multi-headed self-attention layer, a residual layer and a full-link layer, and is specifically designed as follows:

the multi-head self-attention layer takes the output result of the feature learning layer and the output result of the feature weight learning layer as the assignment of three parameters of the multi-head self-attention layer respectively;

the residual layer superposes the output of the multi-head self-attention layer and the junction boundary of the feature learning layer;

the residual layer uses the output of the above full-connection layer and the output of the last residual layer as the input of the residual layer to further calculate the result;

the full connection layer transforms the result of the previous layer and converts the output result into a floating point number between 0 and 1 by adopting a Sigmoid activation function; the Sigmod activation function is:

8. the method for detecting fishing sites according to any of claims 1 to 7, wherein the balanced URL sample set comprises a number of legal sites as positive samples and fishing site samples as negative samples, and the number of the legal sites and fishing site samples are the same.

9. The method for detecting phishing sites of claim 8 wherein the classifier model is trained using a pre-prepared equalized set of URL samples, comprising the steps of:

the balanced URL sample set is used as the input of the classifier model after being subjected to coding pretreatment, and the property of the URL character string in the balanced URL sample set is given to an output result, namely when the URL character string is a legal website, the output result is 1; and when the URL character string is a fishing website, outputting a result of 0.

10. The method for detecting a phishing website as claimed in claim 9, wherein the trained classifier model classifies the URL string of unknown nature and generates an output result, and if the output result is greater than or equal to 0.5, the URL string of unknown nature is a legal website; and if the output result is less than 0.5, the fishing website is determined.