CN110650108A

CN110650108A - Fishing page identification method based on icon and related equipment

Info

Publication number: CN110650108A
Application number: CN201810671754.8A
Authority: CN
Inventors: 马长春
Original assignee: Sangfor Technologies Co Ltd
Current assignee: Sangfor Technologies Co Ltd
Priority date: 2018-06-26
Filing date: 2018-06-26
Publication date: 2020-01-03

Abstract

The embodiment of the invention provides a phishing page identification method based on icon icons and related equipment. In the embodiment of the invention, the login page to be detected is subjected to real-time anti-counterfeiting identification from multiple dimensions, on one hand, a server can collect icon corresponding to each secure login page in a preset secure white list, and each icon is characterized according to a preset algorithm to generate a corresponding standard feature vector set, and by taking the standard feature vector sets as references, the similarity between the login page to be detected and each secure login page can be judged from at least two dimensions, and corresponding weight values are distributed to the login page to be detected according to the judgment result of each dimension, and finally, the sum of the weight values obtained by the login page to be detected can be counted to judge whether the login page to be detected is a phishing page, so that the accuracy of phishing page identification is improved.

Description

Fishing page identification method based on icon and related equipment

Technical Field

The invention relates to the technical field of network security, in particular to a phishing page identification method based on icon icons and related equipment.

Background

Phishing is an attack intended to entice addressees to give sensitive information (such as user name, password, account ID, ATM PIN code or credit card details) by mass-sending fraudulent spam purporting to come from banks or other well-known institutions. Hackers often forge phishing pages, and users access the forged phishing pages and input corresponding sensitive information to store the sensitive information, so that the purpose of stealing the sensitive information is achieved.

The existing webpage counterfeiting detection scheme is usually based on a blacklist technology, screening is mainly carried out based on a blacklist established by a security manufacturer, and updating of the blacklist of the security manufacturer is usually carried out after the harm of a phishing website is formed and cannot be identified at the beginning of the appearance of the phishing website.

In view of the above, a new phishing page identification method is needed to reduce the risk of phishing.

Disclosure of Invention

The embodiment of the invention provides a phishing page identification method based on icon icons and related equipment, which are used for identifying phishing pages.

The embodiment of the invention provides a phishing page identification method based on icon icons in a first aspect, which is characterized by comprising the following steps:

respectively collecting icon icons corresponding to all the safe login pages in a preset safe white list to form a first icon set;

respectively characterizing the icon in the first icon set according to a preset algorithm, and storing each feature vector of each icon in an associated manner to form respective standard feature vector sets;

acquiring icon icons of the login page to be detected, and characterizing the corresponding icon icons according to the preset algorithm to generate a corresponding second characteristic vector set;

judging the similarity between the login page to be detected and each safe login page from at least two dimensions according to the second characteristic vector set and the standard characteristic vector set, and distributing corresponding weight values for the login page to be detected according to the judgment result of each dimension;

and counting the sum of the weights obtained by the login page to be detected, and if the sum of the weights is not less than a third preset threshold value, judging that the login page to be detected is a phishing page.

Optionally, as a possible implementation manner, in the embodiment of the present invention, the determining, from at least two dimensions, the similarity between the login page to be detected and each secure login page, and allocating a corresponding weight to the login page to be detected according to the determination result of each dimension includes:

if a third feature vector set exists in each standard feature vector set corresponding to the icon set, and the number of feature vectors successfully matched in the second feature vector set and the third feature vector set is not less than a first preset threshold value, allocating a first weight to the login page to be detected, wherein the similarity between the two feature vectors is greater than a second preset threshold value, and then judging that the corresponding feature vectors are successfully matched;

and judging whether the domain name corresponding to the third feature vector set is the same as the domain name corresponding to the login page to be detected, and distributing a second weight to the login page to be detected according to a judgment result.

Optionally, as a possible implementation manner, the method for identifying a phishing page based on an icon in the embodiment of the present invention further includes:

and judging whether the similarity of the texture features of the icon corresponding to the login page corresponding to the third feature vector set and the icon of the login page to be detected is greater than a fourth preset threshold value, and distributing a third weight to the login page to be detected according to the judgment result.

acquiring page information of a page to be detected, wherein the page information at least comprises a hypertext markup language (HTML) file corresponding to the page to be detected;

extracting a file text and a Uniform Resource Locator (URL) address from the HTML file;

counting the number of preset key words contained in the file body of the HTML file;

judging whether the ratio of the number of URL addresses in the HTML file to the number of bytes of the file text is larger than a fifth preset threshold value or not, and distributing a fourth weight to the page to be detected according to the judgment result;

judging whether the number of preset key words contained in the file text in the HTML file is larger than a sixth preset threshold value or not, and distributing a fifth weight to the page to be detected according to the judgment result;

and counting the sum of the weights obtained by the page to be detected in each detection process, and if the sum of the weights is not less than a seventh preset threshold value, judging that the page to be detected is a login page to be detected.

inputting the screenshot of the login page into a preset Convolutional Neural Network (CNN) classifier model for classification, and distributing a sixth weight to the page to be detected according to a classification result.

A second aspect of an embodiment of the present invention provides a server, including:

the first acquisition module is used for respectively acquiring icon corresponding to each safe login page in a preset safe white list to form a first icon set;

the calculation module is used for respectively characterizing the icon in the first icon set according to a preset algorithm and storing each feature vector of each icon in an associated manner to form a respective standard feature vector set;

the second acquisition module is used for acquiring the icon of the login page to be detected, characterizing the corresponding icon according to the preset algorithm and generating a corresponding second characteristic vector set;

the distribution module is used for judging the similarity between the icon corresponding to the second feature vector set and the icon corresponding to the standard feature vector set from at least two dimensions, and distributing corresponding weight values for the login page to be detected according to the judgment result of each dimension;

and the first counting module is used for counting the sum of the weights obtained by the login page to be detected, and if the sum of the weights is not less than a third preset threshold value, the login page to be detected is judged to be a phishing page.

Optionally, as a possible implementation manner, the allocation module in the embodiment of the present invention includes:

the first allocation unit is used for allocating a first weight to the login page to be detected if a third characteristic vector set exists in each standard characteristic vector set corresponding to the icon set, and the number of the characteristic vectors successfully matched in the second characteristic vector set and the third characteristic vector set is not less than a first preset threshold value, wherein the similarity of the two characteristic vectors is greater than the second preset threshold value, and the corresponding characteristic vectors are judged to be successfully matched;

the second allocating unit is used for judging whether the domain name corresponding to the third feature vector set is the same as the domain name corresponding to the login page to be detected or not and allocating a second weight to the login page to be detected according to a judgment result;

optionally, as a possible implementation manner, the allocation module in the embodiment of the present invention further includes:

and the third distribution unit is used for judging whether the similarity of the texture features of the icon corresponding to the login page corresponding to the third feature vector set and the icon of the login page to be detected is greater than a fourth preset threshold value or not, and distributing a third weight to the login page to be detected according to the judgment result.

Optionally, as a possible implementation manner, the server in the embodiment of the present invention further includes:

the third acquisition module is used for acquiring page information of a page to be detected, wherein the page information at least comprises a hypertext markup language (HTML) file corresponding to the page to be detected;

the extraction module is used for extracting a file text and a Uniform Resource Locator (URL) address from the HTML file;

the second statistical module is used for counting the number of preset key words contained in the document body of the HTML document;

the fourth distribution module is used for judging whether the ratio of the number of the URL addresses in the HTML file to the number of bytes in the file text is larger than a fifth preset threshold value or not and distributing a fourth weight to the page to be detected according to the judgment result;

the fifth distribution module is used for judging whether the number of preset key words contained in the file text in the HTML file is larger than a sixth preset threshold value or not and distributing a fifth weight to the page to be detected according to the judgment result;

and the third counting module is used for counting the sum of the weights obtained by the page to be detected in each detection process, and if the sum of the weights is not less than a seventh preset threshold value, the page to be detected is judged to be a login page to be detected.

and the sixth distribution module is used for inputting the screenshot of the login page into a preset Convolutional Neural Network (CNN) classifier model for classification and distributing a sixth weight to the page to be detected according to a classification result.

A third aspect of an embodiment of the present invention provides a computer apparatus, which is characterized in that the computer apparatus includes a processor, and the processor is configured to implement the steps in any one of the possible implementations of the first aspect and the first aspect when executing a computer program stored in a memory.

A fourth aspect of embodiments of the present invention provides a computer-readable storage medium having a computer program stored thereon, characterized in that: the computer program realizes the steps of the first aspect and any one of the possible implementations of the first aspect when executed by a processor.

According to the technical scheme, the embodiment of the invention has the following advantages:

in the embodiment of the invention, the login page to be detected is subjected to real-time anti-counterfeiting identification from multiple dimensions, on one hand, a server can collect icon corresponding to each secure login page in a preset secure white list, and each icon is characterized according to a preset algorithm to generate a corresponding standard feature vector set, and by taking the standard feature vector sets as references, the similarity between the login page to be detected and each secure login page can be judged from at least two dimensions, and corresponding weight values are distributed to the login page to be detected according to the judgment result of each dimension, and finally, the sum of the weight values obtained by the login page to be detected can be counted to judge whether the login page to be detected is a phishing page, so that the accuracy of phishing page identification is improved.

Drawings

FIG. 1 is a schematic diagram of an embodiment of a phishing page identification method based on icon icons in the embodiment of the invention;

FIG. 2 is a schematic diagram of another embodiment of a phishing page identification method based on icon icons in the embodiment of the invention;

FIG. 3 is a schematic diagram of another embodiment of a phishing page identification method based on icon icons in the embodiment of the invention;

FIG. 4 is a schematic diagram of an embodiment of a page identification method to be detected in the embodiment of the present invention;

FIG. 5 is a diagram of an embodiment of a server in an embodiment of the invention;

FIG. 6 is a diagram of another embodiment of a server in an embodiment of the invention;

FIG. 7 is a diagram of another embodiment of a server in an embodiment of the invention;

FIG. 8 is a diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For convenience of understanding, a specific flow in the embodiment of the present invention is described below, and referring to fig. 1, an embodiment of a phishing page identification method based on icon in the embodiment of the present invention may include:

101. and respectively collecting icon icons corresponding to all the safe login pages in the preset safe white list to form a first icon set.

In the embodiment of the invention, in order to identify the login page to be detected, firstly, security icon icons corresponding to target application programs which are easy to be phished need to be collected in advance as references, wherein the icon icons refer to icons displayed in a browser window tab page, and can be a logo chart of a company or a separately designed chart. Common phishing target application classes may include a bank class (e.g., a tenderer bank), an IM class (e.g., QQ), a document sharing class (e.g., a hundredth cloud), a mailbox class (newwave mailbox), a shopping class (naught), and so forth. The server can collect the security icon corresponding to the common phishing target application program interface to form a first icon set.

The specific method for acquiring the icon may be to intercept the icon at the login page by using a crawler engine according to the URL address of the login page, or to acquire the icon at the storage location of the icon in the corresponding HTML file, which is not limited herein. For example, when an ICON is obtained by using an HTML file, there is a Link Rel ═ short ICON "href ═ http:// address of the picture (note corresponding to your directory)" in the head tag of the page, and the corresponding picture can be obtained only by finding the corresponding tag.

102. And characterizing the icon in the first icon set according to a preset algorithm, and storing the characteristic vectors of each icon in an associated manner to form respective standard characteristic vector sets.

In order to automatically identify the landing page to be detected subsequently, the icon in the first icon set needs to be characterized, and the feature vector of each icon is extracted. Algorithms which can be used in the embodiment of the present invention include a hash algorithm, for example, an LSH algorithm (local Sensitive hash), an SH algorithm (Spectral hash), an AGH algorithm (Anchor Graph hash), etc., a SURF algorithm (Speeded Up Robust Features), and a sift algorithm. The specific algorithm is the prior art and is not described herein. It can be understood that there are many icon characterization algorithms, but the same algorithm needs to be used for different icon characterization processes for matching of subsequent feature vectors.

For example, when the SURF algorithm is adopted for characterization, dozens of feature vectors can be extracted from each icon image, the feature vectors are 158-dimensional SURF feature vectors, and the feature vectors of the same icon are stored in an associated manner to form respective standard feature vector sets.

103. And acquiring icon icons of the login page to be detected, and characterizing according to a preset algorithm to generate a corresponding second characteristic vector set.

When the page to be detected is determined to be the landing page, the server can obtain the icon of the page to be detected, and the icon of the page to be detected is characterized by adopting the same preset algorithm to generate a corresponding second characteristic vector set.

104. And judging the similarity between the login page to be detected and each safe login page from at least two dimensions, and distributing corresponding weight values for the login page to be detected according to the judgment result of each dimension.

After the second feature vector set and the standard feature vector set are obtained, the server may determine, from at least two dimensions, the similarity between the login page to be detected and each secure login page, and allocate a corresponding weight to the login page to be detected according to the determination result of each dimension, where the number of specific detection dimensions is not limited here, and an exemplary detection dimension will be described in detail in the following embodiments.

105. And counting the sum of the weights obtained by the login page to be detected, and if the sum of the weights is not less than a third preset threshold value, judging that the login page to be detected is a phishing page.

The server can count the sum of the weights obtained by the login page to be detected, and if the sum of the weights is not less than a third preset threshold value, the login page to be detected is judged to be a phishing page.

For convenience of understanding, the following detailed description is provided for a specific process in an embodiment of the present invention, and referring to fig. 2, another embodiment of a phishing page identification method based on icon in an embodiment of the present invention may include:

201. and respectively collecting icon icons corresponding to all the safe login pages in the preset safe white list to form a first icon set.

202. And characterizing the icon in the first icon set according to a preset algorithm, and storing the characteristic vectors of each icon in an associated manner to form respective standard characteristic vector sets.

203. And acquiring icon icons of the login page to be detected, and characterizing according to a preset algorithm to generate a corresponding second characteristic vector set.

Steps 201 to 203 in the embodiment of the present invention are similar to those described in steps 101 to 103, and please refer to steps 101 to 103 for details, which are not described herein again.

When the page to be detected is determined to be the landing page, the icon of the page to be detected can be obtained, the icon of the page to be detected is characterized by the same preset algorithm, and a corresponding second characteristic vector set is generated.

204. And matching the second characteristic vector set with each standard characteristic vector set.

In the embodiment of the invention, the login page to be detected can be detected and identified from multiple dimensions, specifically, a second characteristic vector set corresponding to the login page to be detected can be matched with each standard characteristic vector set, if a third characteristic vector set exists in each standard characteristic vector set corresponding to the icon set, and the number of characteristic vectors successfully matched in the second characteristic vector set and the third characteristic vector set is not less than a first preset threshold value, a first weight is allocated to the login page to be detected, wherein the similarity of the two characteristic vectors is greater than the second preset threshold value, and the matching of the corresponding characteristic vectors is determined to be successful. Optionally, if the third feature vector set does not exist in each standard feature vector set corresponding to the first icon set, the first weight may not be assigned or the weight assigned to the login page to be detected is zero.

Optionally, the euclidean distance between the two feature vectors may be calculated by using a euclidean distance algorithm when the similarity between the two feature vectors is greater than the second preset threshold, and the euclidean distance between the two feature vectors is smaller than the specific threshold, so that the similarity between the two feature vectors may be determined to be greater than the second preset threshold, and the corresponding feature vector is determined to be successfully matched. It is to be understood that, in the embodiment of the present invention, the algorithm for determining the vector similarity may be an euclidean distance algorithm, or may also be a manhattan distance algorithm, a chebyshev distance algorithm, a minkowski distance algorithm, a mahalanobis distance algorithm, a hamming distance algorithm, or the like, and is not limited herein.

Further, in order to reduce the calculation amount, in the embodiment of the present invention, a value range of parameters of each dimension of the feature vector satisfying the matching condition may be calculated according to the second preset threshold and each standard feature vector set, a similarity between two feature vectors is preliminarily determined according to the value range, and the feature vectors satisfying the value range are further subjected to similarity calculation, so as to reduce the calculation amount.

Specifically, for example, after SURF feature vectors are obtained by using a SURF algorithm, and the number of SURF feature vectors successfully matched in the second feature vector set is not less than one third of the third feature vector set, it can be determined that an icon of a login page to be detected is similar to an icon corresponding to a preset secure login page, and further detection needs to be performed on the login page to be detected.

205. And judging whether the domain name corresponding to the third characteristic vector set is the same as the domain name corresponding to the login page to be detected or not, and distributing a second weight to the login page to be detected according to the judgment result.

The server can acquire a common phishing target application program and a domain name corresponding to the login page to be detected through the crawler engine, judge whether the domain name corresponding to the third feature vector set is the same as the domain name corresponding to the login page to be detected, and allocate a second weight to the login page to be detected according to a judgment result.

206. And counting the sum of the weights obtained by the login page to be detected, and if the sum of the weights is not less than a third preset threshold value, judging that the login page to be detected is a phishing page.

The server can count the sum of the weights obtained by the login page to be detected, and if the sum of the weights is not less than a third preset threshold value, the login page to be detected is judged to be a phishing page. For example, for the login page to be detected, if the third feature vector set shown in step 204 is matched with the second feature vector set of the icon of the login page to be detected, the first weight is assigned to the login page to be detected as 1, and if the third feature vector set is different from the domain name corresponding to the login page to be detected, the second weight may be assigned to the login page to be detected as 2, and the third preset threshold value is set as 3, it may be determined that the login page to be detected is the phishing page.

In the embodiment of the invention, a server can collect icon corresponding to each safe login page in a preset safe white list in advance, and each icon is characterized according to a preset algorithm to generate a corresponding characteristic vector, each characteristic vector in a standard characteristic vector set formed by the characteristic vectors is taken as a reference, the login page to be detected is subjected to real-time anti-counterfeiting identification from multiple dimensions, the icon of the login page to be detected can be matched with each standard characteristic vector set, if the matching success standard characteristic vector set exists, a first weight can be distributed to the login page to be detected, the domain name of the safe login page corresponding to the successfully matched icon is compared with the domain name of the login page to be detected, a second weight is distributed to the login page to be detected according to the comparison result, and finally the sum of the weight obtained by the login page to be detected can be counted, if the sum of the weights is not less than the fourth preset threshold value, the login page to be detected is judged to be the phishing page, and the accuracy of phishing page identification is improved.

On the basis of the embodiment shown in fig. 2, in order to further improve the detection accuracy, whether the page to be detected is a phishing page may be detected from more dimensions, referring to fig. 3, another embodiment of a phishing page identification method based on icon in the embodiment of the present invention may include:

301. and respectively collecting icon icons corresponding to all the safe login pages in the preset safe white list to form a first icon set.

302. And characterizing the icon in the first icon set according to a preset algorithm, and storing the characteristic vectors of each icon in an associated manner to form respective standard characteristic vector sets.

303. And acquiring icon icons of the login page to be detected, and characterizing according to a preset algorithm to generate a corresponding second characteristic vector set.

304. And matching the second characteristic vector set with each standard characteristic vector set.

Steps 301 to 304 in the embodiment of the present invention are similar to those described in steps 201 to 204, and please refer to steps 201 to 204 for details, which are not described herein again.

305, judging whether the similarity of the texture features of the icon corresponding to the login page corresponding to the third feature vector set and the icon of the login page to be detected is greater than a fourth preset threshold value, and distributing a third weight to the login page to be detected according to the judgment result.

On the basis of the step 304, when a third feature vector set exists, the login page to be detected needs to be further detected, in the embodiment of the present invention, the server may collect texture features of the icon of the login page to be detected, determine whether the similarity between the texture features of the icon corresponding to the login page corresponding to the third feature vector set and the icon of the login page to be detected is greater than a third preset threshold, and allocate a third weight to the login page to be detected according to the determination result. Optionally, if the similarity of texture features of icon icons corresponding to the third feature vector set and the second feature vector set is smaller than a third preset threshold, a third weight value of zero may be allocated to the login page to be detected.

The specific texture feature extraction method may be an LBP algorithm (local binary pattern), a gray level co-occurrence matrix algorithm, a gray level gradient co-occurrence matrix algorithm, a gabor wavelet texture algorithm, or the like, and the specific texture feature extraction method is not described herein again for the prior art.

306. And judging whether the domain name corresponding to the third characteristic vector set is the same as the domain name corresponding to the login page to be detected or not, and distributing a second weight to the login page to be detected according to the judgment result.

The server can acquire a common phishing target application program and a domain name corresponding to the login page to be detected through the crawler engine, judge whether the domain name of the safe login page corresponding to the third characteristic vector set is the same as the domain name corresponding to the login page to be detected, and allocate a second weight to the login page to be detected according to a judgment result.

307. And counting the sum of the weights obtained by the login page to be detected, and if the sum of the weights is not less than a third preset threshold value, judging that the login page to be detected is a phishing page.

The server may count a sum of weights obtained by the login page to be detected, in this embodiment, a sum of the first weight, the second weight, and the third weight may be counted, if the sum of the weights is not less than a fourth preset threshold, the page to be detected is determined to be the login page, and a specific fourth preset threshold may be set reasonably according to the configuration of the actual weight, which is not limited herein.

For example, for the login page to be detected, if the third feature vector set shown in step 304 matches with the second feature vector set of the icon of the login page to be detected, a first weight is assigned to the login page to be detected as 1, if the domain name corresponding to the login page to be detected is not the same as the third feature vector set, a second weight may be assigned to the login page to be detected as 2, and if the similarity of the texture features of the icon of the login page to be detected and the icon corresponding to the third feature vector set is greater than a fourth preset threshold, a third weight may be assigned to the login page to be detected as 1, and a third preset threshold value of 3 may be set, so that the login page to be detected may be determined as a phishing page.

In practical application, the page to be detected may include a login page and a non-login page, and the detection object in the embodiment shown in fig. 1 to 3 is preferably a login page or a non-login page, and in order to improve the detection accuracy, the page to be detected may be preliminarily screened to screen out the login page to be detected. Referring to fig. 4, on the basis of the embodiments shown in fig. 1 to fig. 3, an embodiment of identifying the landing page to be detected in the embodiment of the present invention may include:

401. acquiring page information of a page to be detected, wherein the page information at least comprises a hypertext markup language (HTML) file corresponding to the page to be detected;

in view of the fact that the text content of the login page is relatively small, the login page often has preset key words related to login, such as words of "login", "register", "forget password", "automatically login", "remember password", and the like, and also can be a translation of the preset key words in languages of other countries, and the login page often has an internal link connected to pages such as a main page, a registration page, a password recovery page, a partner account login page, and the like.

The server can collect page information of the page to be detected, the page information can include an HTML file of the page to be detected, and then the page text and a URL (internal link) address contained in the HTML file can be extracted from the corresponding HTML file. Optionally, the server may build a crawler engine to crawl page information of the page to be detected.

402. And extracting the page text and the URL address from the HTML file.

After the server acquires the HTML file, the text and the URL address of the page to be detected can be extracted from the HTML file. Specifically, the URL address in the file may be extracted according to the HTML syntax rule.

Optionally, the server may convert the code in the HTML file into a preset format for storage, and may remove the script and the special character in the code, and may obtain the page text of the page to be detected after removing the page text obtained by the format tag in the HTML format. Optionally, according to the line spacing and word spacing distribution of the codes in the HTML file, the partial codes with larger line spacing and larger word spacing are removed, so as to further optimize the obtained page text. It can be understood that the method for extracting the text of the page to be detected from the HTML file can be reasonably adjusted according to the actual encoding mode of the HTML file, and is not limited herein.

403. And counting the number of preset key words contained in the page body of the HTML file.

After extracting the page text of the HTML file, the server can count the number of preset key words contained in the page text of the HTML file, wherein the preset key words can be words such as 'login', 'register', 'forget password', 'automatically login', 'remember password', and the like, and translations of the preset key words in languages of other countries. It can be understood that the preset key words can be reasonably set according to different types of languages and different login pages, and the specific setting is not limited herein.

404. Judging whether the ratio of the number of URL addresses in the HTML file to the number of bytes of the page text is larger than a fifth preset threshold value or not, and distributing a fourth weight to the page to be detected according to the judgment result;

the page texts of common login pages are relatively few, the login pages are connected to internal links of a plurality of pages such as a main page, a registration page, a password recovery page and a partner account login page, and under the condition that the page texts in the page to be detected are constant, the more the internal links are, the more the page to be detected is likely to be the login page. In the embodiment of the invention, a multi-dimensional detection mode is adopted, the weight is distributed to the detection result of each dimension, the server can judge whether the ratio of the number of URL addresses in the HTML file to the number of bytes of the page text is larger than a fifth preset threshold value or not and distribute a fourth weight to the page to be detected according to the judgment result, for example, when the ratio of the number of URL addresses in the HTML file to the number of bytes of the page text is larger than the first preset threshold value, a fourth weight which is not zero is distributed to the page to be detected, and when the ratio is smaller than the fifth preset threshold value, the fourth weight is distributed to the page to be detected and is zero.

405. Judging whether the number of preset key words contained in the page text in the HTML file is larger than a sixth preset threshold value or not, and distributing a fifth weight to the page to be detected according to the judgment result;

the server may determine whether the number of preset key words included in the page text in the HTML file is greater than a sixth preset threshold, and allocate a fifth weight to the page to be detected according to the determination result, optionally, the number of preset key words included in the page text is not less than the sixth preset threshold, a fixed fifth weight may be allocated to the page to be detected, or it may be set that the larger the number of preset key words included in the page text is, the larger the allocated fifth weight is, and the specific location is not limited herein.

406. Inputting the screenshot of the login page into a preset CNN classifier model for classification, and distributing a sixth weight to the page to be detected according to the classification result.

Optionally, in order to further improve the detection accuracy, a convolutional neural network CNN classifier may be introduced to detect whether the page to be detected is a landing page. Specifically, the server can collect a preset number of login page images as positive samples and a preset number of non-login pages as negative samples; and inputting the positive sample and the negative sample into an original CNN classifier model for training to obtain a preset CNN classifier model.

Specifically, the server may characterize the obtained positive sample and the negative sample according to a preset algorithm, such as a hash algorithm, a surf algorithm, a sift algorithm, and the like, to generate a corresponding feature vector, and after the sample is vectorized, the feature vector is recorded as X, and the label of manual classification is recorded as Y;

the vector X and the label Y are input into a classifier model for training, for example, the vector X and the label Y are input into a CNN classifier model, and the CNN model calculates parameters required in the process of mapping the vector X to the label Y according to a preset algorithm, so as to finally obtain a preset CNN model. The model can map lr of the unknown feature vector set X to the label set Y: x — > y, the algorithm principle of the specific CNN classifier model is the prior art, and is not described herein.

After the preset CNN classifier model is obtained through training, the server can obtain the page screenshot of the page to be detected through the rendering engine, can input the login page screenshot into the preset CNN classifier model for classification, and allocates a sixth weight to the page to be detected according to the classification result, for example, if the CNN classifier classifies the page screenshot of the page to be detected as a non-login interface, the sixth weight allocated to the page to be detected by the server may be zero.

407. And inputting the URL address in the HTML file into a long-short term memory network LSTM classifier model for classification, and distributing a seventh weight value to the page to be detected according to the classification result.

Optionally, in order to further improve the detection accuracy, an input long-short term memory network LSTM classifier model may be introduced to detect whether the page to be detected is a login page. Specifically, the server may collect URL addresses of a preset number of login pages as positive samples and URL addresses of a preset number of non-login pages as negative samples; and inputting the positive sample and the negative sample into an original LSTM classifier model for training to obtain a preset LSTM classifier model.

The server can input the URL address in the HTML file of the page to be detected, which is obtained previously, into the LSTM classifier model for classification, and distributes a seventh weight value to the page to be detected according to the classification result. For example, if the LSTM classifier classifies the page screenshot of the page to be detected as a non-login interface, the seventh weight value allocated by the server to the page to be detected may be zero.

408. And counting the sum of the weights obtained by the page to be detected, and if the sum of the weights is not less than a seventh preset threshold value, judging that the page to be detected is a login page.

The server may count a sum of weights obtained by the page to be detected, in this embodiment, a sum of a fourth weight, a fifth weight, a sixth weight, and a seventh weight may be counted, if the sum of weights is not less than a sixth preset threshold, the page to be detected is determined to be a login page, and a specific seventh preset threshold may be reasonably set according to configuration of an actual weight, which is not limited herein.

In the embodiment of the invention, the page information of the page to be detected can be collected, the page information can comprise an HTML (hypertext markup language) file of the page to be detected and a page screenshot of the page to be detected, four-dimensional detection is carried out based on the page information of the page to be detected, four weights are distributed to the page to be detected according to the detection result of each dimension, the sum of the weights obtained by the page to be detected is finally counted, if the sum of the weights is not less than a seventh preset threshold value, the page to be detected is judged to be a login page, whether the page to be detected is the login page or not is detected from multiple dimensions.

It should be understood that, in various embodiments of the present invention, the sequence numbers of the above steps do not mean the execution sequence, and the execution sequence of each step should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

The foregoing embodiment describes a method for identifying a phishing page based on a login frame image in an embodiment of the present invention, and a server in an embodiment of the present invention is described below with reference to fig. 5, where an embodiment of a server in an embodiment of the present invention may include:

a first collecting module 501, configured to collect icon icons corresponding to security login pages in a preset security white list respectively, so as to form a first icon set;

the calculation module 502 is used for characterizing the icon in the first icon set according to a preset algorithm, and storing each feature vector of each icon in an associated manner to form a respective standard feature vector set;

the second acquisition module 503 is configured to acquire icon icons of the login page to be detected, and generate a corresponding second feature vector set according to the characterization of a preset algorithm;

the distribution module 504 is configured to determine, from at least two dimensions, a similarity between an icon corresponding to the second feature vector set and an icon corresponding to the standard feature vector set, and distribute a corresponding weight to the login page to be detected according to a determination result of each dimension;

the first counting module 505 is configured to count a sum of weights obtained by the login page to be detected, and if the sum of weights is not less than a fourth preset threshold, determine that the login page to be detected is a phishing page.

Optionally, as a possible implementation manner, referring to fig. 6, the allocating module 504 in the embodiment of the present invention includes:

a first allocating unit 5041, configured to allocate a first weight to the login page to be detected if a third feature vector set exists in each standard feature vector set corresponding to the icon set, and the number of feature vectors successfully matched in the second feature vector set and the third feature vector set is not less than a first preset threshold, where the similarity between two feature vectors is greater than a second preset threshold, and it is determined that the corresponding feature vectors are successfully matched;

the second allocating unit 5042 is configured to determine whether the third feature vector is the same as the domain name corresponding to the login page to be detected, and allocate a second weight to the login page to be detected according to the determination result;

optionally, as a possible implementation manner, referring to fig. 6, the allocating module 504 in the embodiment of the present invention further includes:

a third allocating unit 5043, configured to determine whether a similarity between texture features of an icon corresponding to the login page corresponding to the third feature vector set and the icon of the login page to be detected is greater than a fourth preset threshold, and allocate a third weight to the login page to be detected according to the determination result.

Optionally, as a possible implementation manner, referring to fig. 7, the server in the embodiment of the present invention further includes:

a third collecting module 506, configured to collect page information of the page to be detected, where the page information at least includes a hypertext markup language HTML file corresponding to the page to be detected;

an extracting module 507, configured to extract a text of the file and a URL address of the URL from the HTML file;

the second statistical module 508 is used for counting the number of preset key words contained in the text of the HTML file;

a fourth distribution module 509, configured to determine whether the ratio of the number of URL addresses in the HTML file to the number of bytes in the text of the file is greater than a fifth preset threshold, and distribute a fourth weight to the page to be detected according to the determination result;

a fifth distribution module 510, configured to determine whether the number of preset key words included in a document body in the HTML document is greater than a sixth preset threshold, and distribute a fifth weight to the page to be detected according to the determination result;

the third counting module 511 is configured to count a sum of weights obtained by the page to be detected in each detection process, and if the sum of weights is not less than a seventh preset threshold, determine that the page to be detected is a login page to be detected.

and a sixth allocating module 512, configured to input the screenshot of the login page into a preset convolutional neural network CNN classifier model for classification, and allocate a sixth weight to the page to be detected according to the classification result.

The server apparatus in the embodiment of the present invention is described above from the perspective of the modular functional entity, and the computer apparatus in the embodiment of the present invention is described below from the perspective of hardware processing:

fig. 8 shows only a portion related to the embodiment of the present invention for convenience of description, and please refer to the method portion of the embodiment of the present invention for reference, though specific technical details are not disclosed. The computer device 8 is generally a computer device with a high processing capability, such as a server.

Referring to fig. 8, the computer device 8 includes: a power supply 810, a memory 820, a processor 830, a wired or wireless network interface 840, and computer programs stored in the memory and executable on the processor. The processor, when executing the computer program, implements the steps of the above-described embodiment of the phishing page identification method based on the login frame image, such as steps 201 to 206 shown in fig. 2. Alternatively, the processor, when executing the computer program, implements the functions of each module or unit in the above-described device embodiments.

In some embodiments of the present invention, the processor is specifically configured to implement the following steps:

acquiring icon icons of the login page to be detected, and generating a corresponding second characteristic vector set according to the preset algorithm characterization;

and counting the sum of the weights obtained by the login page to be detected, and if the sum of the weights is not less than a fourth preset threshold value, judging that the login page to be detected is a phishing page.

Optionally, in some embodiments of the present invention, the processor may be further configured to implement the following steps:

if a third feature vector set exists in each standard feature vector set corresponding to the icon set, and the number of feature vectors successfully matched in the second feature vector set and the third feature vector set is not less than a first preset threshold value, distributing a first weight value for the login page to be detected, wherein the similarity of the two feature vectors is greater than the second preset threshold value, and judging that the matching of the corresponding feature vectors is successful;

and judging whether the domain name corresponding to the third feature vector is the same as the domain name corresponding to the login page to be detected, and distributing a second weight to the login page to be detected according to the judgment result.

extracting a file text and a Uniform Resource Locator (URL) address from an HTML file;

counting the number of preset key words contained in the file text of the HTML file;

judging whether the ratio of the number of URL addresses in the HTML file to the number of bytes in the text of the file is larger than a fifth preset threshold value or not, and distributing a fourth weight to the page to be detected according to the judgment result;

and counting the sum of the weights obtained by the page to be detected in each detection process, and if the sum of the weights is not less than a seventh preset threshold value, judging that the page to be detected is the login page to be detected.

inputting the screenshot of the login page into a preset Convolutional Neural Network (CNN) classifier model for classification, and distributing a sixth weight to the page to be detected according to the classification result.

The computer device 8 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. Illustratively, a computer program may be partitioned into one or more modules/units, which are stored in a memory and executed by a processor. One or more modules/units may be a series of computer program instruction segments capable of performing certain functions, the instruction segments being used to describe the execution of a computer program in a computer device.

Those skilled in the art will appreciate that the configuration shown in fig. 8 does not constitute a limitation of the computer apparatus 8, that the computer apparatus 8 may comprise more or less components than those shown, or some components may be combined, or a different arrangement of components, e.g. the computer apparatus may further comprise input-output devices, buses, etc.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center of the computer device and the various interfaces and lines connecting the various parts of the overall computer device.

The memory may be used to store computer programs and/or modules, and the processor may implement various functions of the computer device by executing or executing the computer programs and/or modules stored in the memory, as well as by invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

The present invention also provides a computer-readable storage medium having a computer program stored thereon, which when executed by a processor, performs the steps of:

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A phishing page identification method based on icon icons is characterized by comprising the following steps:

respectively characterizing the icon in the first icon set according to a preset algorithm, and storing each feature vector of each icon in an associated manner to form respective standard feature vector sets; acquiring icon icons of the login page to be detected, and characterizing the corresponding icon icons according to the preset algorithm to generate a corresponding second characteristic vector set;

2. The method according to claim 1, wherein the determining the similarity between the login page to be detected and each secure login page from at least two dimensions, and assigning the corresponding weight to the login page to be detected according to the determination result of each dimension comprises:

3. The method according to claim 2, before counting the sum of the weights obtained from the landing pages to be detected, further comprising:

4. The method according to claim 3, wherein before the collecting the icon of the landing page to be detected, further comprising:

5. The method according to claim 4, wherein the page information login page screenshot of the page to be detected further includes, before counting the sum of the weights obtained by the page to be detected:

6. A server, comprising:

7. The server of claim 6, wherein the assignment module comprises:

and the second allocating unit is used for judging whether the domain name corresponding to the third characteristic vector set and the login page to be detected is the same or not and allocating a second weight to the login page to be detected according to the judgment result.

8. The server of claim 7, further comprising:

9. A computer arrangement, characterized in that the computer arrangement comprises a processor for implementing the steps of the method according to any one of claims 1 to 5 when executing a computer program stored in a memory.

10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program when executed by a processor implementing the steps of the method according to any one of claims 1 to 5.