CN110650108A - Fishing page identification method based on icon and related equipment - Google Patents

Fishing page identification method based on icon and related equipment Download PDF

Info

Publication number
CN110650108A
CN110650108A CN201810671754.8A CN201810671754A CN110650108A CN 110650108 A CN110650108 A CN 110650108A CN 201810671754 A CN201810671754 A CN 201810671754A CN 110650108 A CN110650108 A CN 110650108A
Authority
CN
China
Prior art keywords
page
detected
icon
login page
login
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810671754.8A
Other languages
Chinese (zh)
Inventor
马长春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN201810671754.8A priority Critical patent/CN110650108A/en
Publication of CN110650108A publication Critical patent/CN110650108A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention provides a phishing page identification method based on icon icons and related equipment. In the embodiment of the invention, the login page to be detected is subjected to real-time anti-counterfeiting identification from multiple dimensions, on one hand, a server can collect icon corresponding to each secure login page in a preset secure white list, and each icon is characterized according to a preset algorithm to generate a corresponding standard feature vector set, and by taking the standard feature vector sets as references, the similarity between the login page to be detected and each secure login page can be judged from at least two dimensions, and corresponding weight values are distributed to the login page to be detected according to the judgment result of each dimension, and finally, the sum of the weight values obtained by the login page to be detected can be counted to judge whether the login page to be detected is a phishing page, so that the accuracy of phishing page identification is improved.

Description

Fishing page identification method based on icon and related equipment
Technical Field
The invention relates to the technical field of network security, in particular to a phishing page identification method based on icon icons and related equipment.
Background
Phishing is an attack intended to entice addressees to give sensitive information (such as user name, password, account ID, ATM PIN code or credit card details) by mass-sending fraudulent spam purporting to come from banks or other well-known institutions. Hackers often forge phishing pages, and users access the forged phishing pages and input corresponding sensitive information to store the sensitive information, so that the purpose of stealing the sensitive information is achieved.
The existing webpage counterfeiting detection scheme is usually based on a blacklist technology, screening is mainly carried out based on a blacklist established by a security manufacturer, and updating of the blacklist of the security manufacturer is usually carried out after the harm of a phishing website is formed and cannot be identified at the beginning of the appearance of the phishing website.
In view of the above, a new phishing page identification method is needed to reduce the risk of phishing.
Disclosure of Invention
The embodiment of the invention provides a phishing page identification method based on icon icons and related equipment, which are used for identifying phishing pages.
The embodiment of the invention provides a phishing page identification method based on icon icons in a first aspect, which is characterized by comprising the following steps:
respectively collecting icon icons corresponding to all the safe login pages in a preset safe white list to form a first icon set;
respectively characterizing the icon in the first icon set according to a preset algorithm, and storing each feature vector of each icon in an associated manner to form respective standard feature vector sets;
acquiring icon icons of the login page to be detected, and characterizing the corresponding icon icons according to the preset algorithm to generate a corresponding second characteristic vector set;
judging the similarity between the login page to be detected and each safe login page from at least two dimensions according to the second characteristic vector set and the standard characteristic vector set, and distributing corresponding weight values for the login page to be detected according to the judgment result of each dimension;
and counting the sum of the weights obtained by the login page to be detected, and if the sum of the weights is not less than a third preset threshold value, judging that the login page to be detected is a phishing page.
Optionally, as a possible implementation manner, in the embodiment of the present invention, the determining, from at least two dimensions, the similarity between the login page to be detected and each secure login page, and allocating a corresponding weight to the login page to be detected according to the determination result of each dimension includes:
if a third feature vector set exists in each standard feature vector set corresponding to the icon set, and the number of feature vectors successfully matched in the second feature vector set and the third feature vector set is not less than a first preset threshold value, allocating a first weight to the login page to be detected, wherein the similarity between the two feature vectors is greater than a second preset threshold value, and then judging that the corresponding feature vectors are successfully matched;
and judging whether the domain name corresponding to the third feature vector set is the same as the domain name corresponding to the login page to be detected, and distributing a second weight to the login page to be detected according to a judgment result.
Optionally, as a possible implementation manner, the method for identifying a phishing page based on an icon in the embodiment of the present invention further includes:
and judging whether the similarity of the texture features of the icon corresponding to the login page corresponding to the third feature vector set and the icon of the login page to be detected is greater than a fourth preset threshold value, and distributing a third weight to the login page to be detected according to the judgment result.
Optionally, as a possible implementation manner, the method for identifying a phishing page based on an icon in the embodiment of the present invention further includes:
acquiring page information of a page to be detected, wherein the page information at least comprises a hypertext markup language (HTML) file corresponding to the page to be detected;
extracting a file text and a Uniform Resource Locator (URL) address from the HTML file;
counting the number of preset key words contained in the file body of the HTML file;
judging whether the ratio of the number of URL addresses in the HTML file to the number of bytes of the file text is larger than a fifth preset threshold value or not, and distributing a fourth weight to the page to be detected according to the judgment result;
judging whether the number of preset key words contained in the file text in the HTML file is larger than a sixth preset threshold value or not, and distributing a fifth weight to the page to be detected according to the judgment result;
and counting the sum of the weights obtained by the page to be detected in each detection process, and if the sum of the weights is not less than a seventh preset threshold value, judging that the page to be detected is a login page to be detected.
Optionally, as a possible implementation manner, the method for identifying a phishing page based on an icon in the embodiment of the present invention further includes:
inputting the screenshot of the login page into a preset Convolutional Neural Network (CNN) classifier model for classification, and distributing a sixth weight to the page to be detected according to a classification result.
A second aspect of an embodiment of the present invention provides a server, including:
the first acquisition module is used for respectively acquiring icon corresponding to each safe login page in a preset safe white list to form a first icon set;
the calculation module is used for respectively characterizing the icon in the first icon set according to a preset algorithm and storing each feature vector of each icon in an associated manner to form a respective standard feature vector set;
the second acquisition module is used for acquiring the icon of the login page to be detected, characterizing the corresponding icon according to the preset algorithm and generating a corresponding second characteristic vector set;
the distribution module is used for judging the similarity between the icon corresponding to the second feature vector set and the icon corresponding to the standard feature vector set from at least two dimensions, and distributing corresponding weight values for the login page to be detected according to the judgment result of each dimension;
and the first counting module is used for counting the sum of the weights obtained by the login page to be detected, and if the sum of the weights is not less than a third preset threshold value, the login page to be detected is judged to be a phishing page.
Optionally, as a possible implementation manner, the allocation module in the embodiment of the present invention includes:
the first allocation unit is used for allocating a first weight to the login page to be detected if a third characteristic vector set exists in each standard characteristic vector set corresponding to the icon set, and the number of the characteristic vectors successfully matched in the second characteristic vector set and the third characteristic vector set is not less than a first preset threshold value, wherein the similarity of the two characteristic vectors is greater than the second preset threshold value, and the corresponding characteristic vectors are judged to be successfully matched;
the second allocating unit is used for judging whether the domain name corresponding to the third feature vector set is the same as the domain name corresponding to the login page to be detected or not and allocating a second weight to the login page to be detected according to a judgment result;
optionally, as a possible implementation manner, the allocation module in the embodiment of the present invention further includes:
and the third distribution unit is used for judging whether the similarity of the texture features of the icon corresponding to the login page corresponding to the third feature vector set and the icon of the login page to be detected is greater than a fourth preset threshold value or not, and distributing a third weight to the login page to be detected according to the judgment result.
Optionally, as a possible implementation manner, the server in the embodiment of the present invention further includes:
the third acquisition module is used for acquiring page information of a page to be detected, wherein the page information at least comprises a hypertext markup language (HTML) file corresponding to the page to be detected;
the extraction module is used for extracting a file text and a Uniform Resource Locator (URL) address from the HTML file;
the second statistical module is used for counting the number of preset key words contained in the document body of the HTML document;
the fourth distribution module is used for judging whether the ratio of the number of the URL addresses in the HTML file to the number of bytes in the file text is larger than a fifth preset threshold value or not and distributing a fourth weight to the page to be detected according to the judgment result;
the fifth distribution module is used for judging whether the number of preset key words contained in the file text in the HTML file is larger than a sixth preset threshold value or not and distributing a fifth weight to the page to be detected according to the judgment result;
and the third counting module is used for counting the sum of the weights obtained by the page to be detected in each detection process, and if the sum of the weights is not less than a seventh preset threshold value, the page to be detected is judged to be a login page to be detected.
Optionally, as a possible implementation manner, the server in the embodiment of the present invention further includes:
and the sixth distribution module is used for inputting the screenshot of the login page into a preset Convolutional Neural Network (CNN) classifier model for classification and distributing a sixth weight to the page to be detected according to a classification result.
A third aspect of an embodiment of the present invention provides a computer apparatus, which is characterized in that the computer apparatus includes a processor, and the processor is configured to implement the steps in any one of the possible implementations of the first aspect and the first aspect when executing a computer program stored in a memory.
A fourth aspect of embodiments of the present invention provides a computer-readable storage medium having a computer program stored thereon, characterized in that: the computer program realizes the steps of the first aspect and any one of the possible implementations of the first aspect when executed by a processor.
According to the technical scheme, the embodiment of the invention has the following advantages:
in the embodiment of the invention, the login page to be detected is subjected to real-time anti-counterfeiting identification from multiple dimensions, on one hand, a server can collect icon corresponding to each secure login page in a preset secure white list, and each icon is characterized according to a preset algorithm to generate a corresponding standard feature vector set, and by taking the standard feature vector sets as references, the similarity between the login page to be detected and each secure login page can be judged from at least two dimensions, and corresponding weight values are distributed to the login page to be detected according to the judgment result of each dimension, and finally, the sum of the weight values obtained by the login page to be detected can be counted to judge whether the login page to be detected is a phishing page, so that the accuracy of phishing page identification is improved.
Drawings
FIG. 1 is a schematic diagram of an embodiment of a phishing page identification method based on icon icons in the embodiment of the invention;
FIG. 2 is a schematic diagram of another embodiment of a phishing page identification method based on icon icons in the embodiment of the invention;
FIG. 3 is a schematic diagram of another embodiment of a phishing page identification method based on icon icons in the embodiment of the invention;
FIG. 4 is a schematic diagram of an embodiment of a page identification method to be detected in the embodiment of the present invention;
FIG. 5 is a diagram of an embodiment of a server in an embodiment of the invention;
FIG. 6 is a diagram of another embodiment of a server in an embodiment of the invention;
FIG. 7 is a diagram of another embodiment of a server in an embodiment of the invention;
FIG. 8 is a diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a phishing page identification method based on icon icons and related equipment, which are used for identifying phishing pages.
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of understanding, a specific flow in the embodiment of the present invention is described below, and referring to fig. 1, an embodiment of a phishing page identification method based on icon in the embodiment of the present invention may include:
101. and respectively collecting icon icons corresponding to all the safe login pages in the preset safe white list to form a first icon set.
In the embodiment of the invention, in order to identify the login page to be detected, firstly, security icon icons corresponding to target application programs which are easy to be phished need to be collected in advance as references, wherein the icon icons refer to icons displayed in a browser window tab page, and can be a logo chart of a company or a separately designed chart. Common phishing target application classes may include a bank class (e.g., a tenderer bank), an IM class (e.g., QQ), a document sharing class (e.g., a hundredth cloud), a mailbox class (newwave mailbox), a shopping class (naught), and so forth. The server can collect the security icon corresponding to the common phishing target application program interface to form a first icon set.
The specific method for acquiring the icon may be to intercept the icon at the login page by using a crawler engine according to the URL address of the login page, or to acquire the icon at the storage location of the icon in the corresponding HTML file, which is not limited herein. For example, when an ICON is obtained by using an HTML file, there is a Link Rel ═ short ICON "href ═ http:// address of the picture (note corresponding to your directory)" in the head tag of the page, and the corresponding picture can be obtained only by finding the corresponding tag.
102. And characterizing the icon in the first icon set according to a preset algorithm, and storing the characteristic vectors of each icon in an associated manner to form respective standard characteristic vector sets.
In order to automatically identify the landing page to be detected subsequently, the icon in the first icon set needs to be characterized, and the feature vector of each icon is extracted. Algorithms which can be used in the embodiment of the present invention include a hash algorithm, for example, an LSH algorithm (local Sensitive hash), an SH algorithm (Spectral hash), an AGH algorithm (Anchor Graph hash), etc., a SURF algorithm (Speeded Up Robust Features), and a sift algorithm. The specific algorithm is the prior art and is not described herein. It can be understood that there are many icon characterization algorithms, but the same algorithm needs to be used for different icon characterization processes for matching of subsequent feature vectors.
For example, when the SURF algorithm is adopted for characterization, dozens of feature vectors can be extracted from each icon image, the feature vectors are 158-dimensional SURF feature vectors, and the feature vectors of the same icon are stored in an associated manner to form respective standard feature vector sets.
103. And acquiring icon icons of the login page to be detected, and characterizing according to a preset algorithm to generate a corresponding second characteristic vector set.
When the page to be detected is determined to be the landing page, the server can obtain the icon of the page to be detected, and the icon of the page to be detected is characterized by adopting the same preset algorithm to generate a corresponding second characteristic vector set.
104. And judging the similarity between the login page to be detected and each safe login page from at least two dimensions, and distributing corresponding weight values for the login page to be detected according to the judgment result of each dimension.
After the second feature vector set and the standard feature vector set are obtained, the server may determine, from at least two dimensions, the similarity between the login page to be detected and each secure login page, and allocate a corresponding weight to the login page to be detected according to the determination result of each dimension, where the number of specific detection dimensions is not limited here, and an exemplary detection dimension will be described in detail in the following embodiments.
105. And counting the sum of the weights obtained by the login page to be detected, and if the sum of the weights is not less than a third preset threshold value, judging that the login page to be detected is a phishing page.
The server can count the sum of the weights obtained by the login page to be detected, and if the sum of the weights is not less than a third preset threshold value, the login page to be detected is judged to be a phishing page.
In the embodiment of the invention, the login page to be detected is subjected to real-time anti-counterfeiting identification from multiple dimensions, on one hand, a server can collect icon corresponding to each secure login page in a preset secure white list, and each icon is characterized according to a preset algorithm to generate a corresponding standard feature vector set, and by taking the standard feature vector sets as references, the similarity between the login page to be detected and each secure login page can be judged from at least two dimensions, and corresponding weight values are distributed to the login page to be detected according to the judgment result of each dimension, and finally, the sum of the weight values obtained by the login page to be detected can be counted to judge whether the login page to be detected is a phishing page, so that the accuracy of phishing page identification is improved.
For convenience of understanding, the following detailed description is provided for a specific process in an embodiment of the present invention, and referring to fig. 2, another embodiment of a phishing page identification method based on icon in an embodiment of the present invention may include:
201. and respectively collecting icon icons corresponding to all the safe login pages in the preset safe white list to form a first icon set.
202. And characterizing the icon in the first icon set according to a preset algorithm, and storing the characteristic vectors of each icon in an associated manner to form respective standard characteristic vector sets.
203. And acquiring icon icons of the login page to be detected, and characterizing according to a preset algorithm to generate a corresponding second characteristic vector set.
Steps 201 to 203 in the embodiment of the present invention are similar to those described in steps 101 to 103, and please refer to steps 101 to 103 for details, which are not described herein again.
When the page to be detected is determined to be the landing page, the icon of the page to be detected can be obtained, the icon of the page to be detected is characterized by the same preset algorithm, and a corresponding second characteristic vector set is generated.
204. And matching the second characteristic vector set with each standard characteristic vector set.
In the embodiment of the invention, the login page to be detected can be detected and identified from multiple dimensions, specifically, a second characteristic vector set corresponding to the login page to be detected can be matched with each standard characteristic vector set, if a third characteristic vector set exists in each standard characteristic vector set corresponding to the icon set, and the number of characteristic vectors successfully matched in the second characteristic vector set and the third characteristic vector set is not less than a first preset threshold value, a first weight is allocated to the login page to be detected, wherein the similarity of the two characteristic vectors is greater than the second preset threshold value, and the matching of the corresponding characteristic vectors is determined to be successful. Optionally, if the third feature vector set does not exist in each standard feature vector set corresponding to the first icon set, the first weight may not be assigned or the weight assigned to the login page to be detected is zero.
Optionally, the euclidean distance between the two feature vectors may be calculated by using a euclidean distance algorithm when the similarity between the two feature vectors is greater than the second preset threshold, and the euclidean distance between the two feature vectors is smaller than the specific threshold, so that the similarity between the two feature vectors may be determined to be greater than the second preset threshold, and the corresponding feature vector is determined to be successfully matched. It is to be understood that, in the embodiment of the present invention, the algorithm for determining the vector similarity may be an euclidean distance algorithm, or may also be a manhattan distance algorithm, a chebyshev distance algorithm, a minkowski distance algorithm, a mahalanobis distance algorithm, a hamming distance algorithm, or the like, and is not limited herein.
Further, in order to reduce the calculation amount, in the embodiment of the present invention, a value range of parameters of each dimension of the feature vector satisfying the matching condition may be calculated according to the second preset threshold and each standard feature vector set, a similarity between two feature vectors is preliminarily determined according to the value range, and the feature vectors satisfying the value range are further subjected to similarity calculation, so as to reduce the calculation amount.
Specifically, for example, after SURF feature vectors are obtained by using a SURF algorithm, and the number of SURF feature vectors successfully matched in the second feature vector set is not less than one third of the third feature vector set, it can be determined that an icon of a login page to be detected is similar to an icon corresponding to a preset secure login page, and further detection needs to be performed on the login page to be detected.
205. And judging whether the domain name corresponding to the third characteristic vector set is the same as the domain name corresponding to the login page to be detected or not, and distributing a second weight to the login page to be detected according to the judgment result.
The server can acquire a common phishing target application program and a domain name corresponding to the login page to be detected through the crawler engine, judge whether the domain name corresponding to the third feature vector set is the same as the domain name corresponding to the login page to be detected, and allocate a second weight to the login page to be detected according to a judgment result.
206. And counting the sum of the weights obtained by the login page to be detected, and if the sum of the weights is not less than a third preset threshold value, judging that the login page to be detected is a phishing page.
The server can count the sum of the weights obtained by the login page to be detected, and if the sum of the weights is not less than a third preset threshold value, the login page to be detected is judged to be a phishing page. For example, for the login page to be detected, if the third feature vector set shown in step 204 is matched with the second feature vector set of the icon of the login page to be detected, the first weight is assigned to the login page to be detected as 1, and if the third feature vector set is different from the domain name corresponding to the login page to be detected, the second weight may be assigned to the login page to be detected as 2, and the third preset threshold value is set as 3, it may be determined that the login page to be detected is the phishing page.
In the embodiment of the invention, a server can collect icon corresponding to each safe login page in a preset safe white list in advance, and each icon is characterized according to a preset algorithm to generate a corresponding characteristic vector, each characteristic vector in a standard characteristic vector set formed by the characteristic vectors is taken as a reference, the login page to be detected is subjected to real-time anti-counterfeiting identification from multiple dimensions, the icon of the login page to be detected can be matched with each standard characteristic vector set, if the matching success standard characteristic vector set exists, a first weight can be distributed to the login page to be detected, the domain name of the safe login page corresponding to the successfully matched icon is compared with the domain name of the login page to be detected, a second weight is distributed to the login page to be detected according to the comparison result, and finally the sum of the weight obtained by the login page to be detected can be counted, if the sum of the weights is not less than the fourth preset threshold value, the login page to be detected is judged to be the phishing page, and the accuracy of phishing page identification is improved.
On the basis of the embodiment shown in fig. 2, in order to further improve the detection accuracy, whether the page to be detected is a phishing page may be detected from more dimensions, referring to fig. 3, another embodiment of a phishing page identification method based on icon in the embodiment of the present invention may include:
301. and respectively collecting icon icons corresponding to all the safe login pages in the preset safe white list to form a first icon set.
302. And characterizing the icon in the first icon set according to a preset algorithm, and storing the characteristic vectors of each icon in an associated manner to form respective standard characteristic vector sets.
303. And acquiring icon icons of the login page to be detected, and characterizing according to a preset algorithm to generate a corresponding second characteristic vector set.
304. And matching the second characteristic vector set with each standard characteristic vector set.
Steps 301 to 304 in the embodiment of the present invention are similar to those described in steps 201 to 204, and please refer to steps 201 to 204 for details, which are not described herein again.
305, judging whether the similarity of the texture features of the icon corresponding to the login page corresponding to the third feature vector set and the icon of the login page to be detected is greater than a fourth preset threshold value, and distributing a third weight to the login page to be detected according to the judgment result.
On the basis of the step 304, when a third feature vector set exists, the login page to be detected needs to be further detected, in the embodiment of the present invention, the server may collect texture features of the icon of the login page to be detected, determine whether the similarity between the texture features of the icon corresponding to the login page corresponding to the third feature vector set and the icon of the login page to be detected is greater than a third preset threshold, and allocate a third weight to the login page to be detected according to the determination result. Optionally, if the similarity of texture features of icon icons corresponding to the third feature vector set and the second feature vector set is smaller than a third preset threshold, a third weight value of zero may be allocated to the login page to be detected.
The specific texture feature extraction method may be an LBP algorithm (local binary pattern), a gray level co-occurrence matrix algorithm, a gray level gradient co-occurrence matrix algorithm, a gabor wavelet texture algorithm, or the like, and the specific texture feature extraction method is not described herein again for the prior art.
306. And judging whether the domain name corresponding to the third characteristic vector set is the same as the domain name corresponding to the login page to be detected or not, and distributing a second weight to the login page to be detected according to the judgment result.
The server can acquire a common phishing target application program and a domain name corresponding to the login page to be detected through the crawler engine, judge whether the domain name of the safe login page corresponding to the third characteristic vector set is the same as the domain name corresponding to the login page to be detected, and allocate a second weight to the login page to be detected according to a judgment result.
307. And counting the sum of the weights obtained by the login page to be detected, and if the sum of the weights is not less than a third preset threshold value, judging that the login page to be detected is a phishing page.
The server may count a sum of weights obtained by the login page to be detected, in this embodiment, a sum of the first weight, the second weight, and the third weight may be counted, if the sum of the weights is not less than a fourth preset threshold, the page to be detected is determined to be the login page, and a specific fourth preset threshold may be set reasonably according to the configuration of the actual weight, which is not limited herein.
For example, for the login page to be detected, if the third feature vector set shown in step 304 matches with the second feature vector set of the icon of the login page to be detected, a first weight is assigned to the login page to be detected as 1, if the domain name corresponding to the login page to be detected is not the same as the third feature vector set, a second weight may be assigned to the login page to be detected as 2, and if the similarity of the texture features of the icon of the login page to be detected and the icon corresponding to the third feature vector set is greater than a fourth preset threshold, a third weight may be assigned to the login page to be detected as 1, and a third preset threshold value of 3 may be set, so that the login page to be detected may be determined as a phishing page.
In practical application, the page to be detected may include a login page and a non-login page, and the detection object in the embodiment shown in fig. 1 to 3 is preferably a login page or a non-login page, and in order to improve the detection accuracy, the page to be detected may be preliminarily screened to screen out the login page to be detected. Referring to fig. 4, on the basis of the embodiments shown in fig. 1 to fig. 3, an embodiment of identifying the landing page to be detected in the embodiment of the present invention may include:
401. acquiring page information of a page to be detected, wherein the page information at least comprises a hypertext markup language (HTML) file corresponding to the page to be detected;
in view of the fact that the text content of the login page is relatively small, the login page often has preset key words related to login, such as words of "login", "register", "forget password", "automatically login", "remember password", and the like, and also can be a translation of the preset key words in languages of other countries, and the login page often has an internal link connected to pages such as a main page, a registration page, a password recovery page, a partner account login page, and the like.
The server can collect page information of the page to be detected, the page information can include an HTML file of the page to be detected, and then the page text and a URL (internal link) address contained in the HTML file can be extracted from the corresponding HTML file. Optionally, the server may build a crawler engine to crawl page information of the page to be detected.
402. And extracting the page text and the URL address from the HTML file.
After the server acquires the HTML file, the text and the URL address of the page to be detected can be extracted from the HTML file. Specifically, the URL address in the file may be extracted according to the HTML syntax rule.
Optionally, the server may convert the code in the HTML file into a preset format for storage, and may remove the script and the special character in the code, and may obtain the page text of the page to be detected after removing the page text obtained by the format tag in the HTML format. Optionally, according to the line spacing and word spacing distribution of the codes in the HTML file, the partial codes with larger line spacing and larger word spacing are removed, so as to further optimize the obtained page text. It can be understood that the method for extracting the text of the page to be detected from the HTML file can be reasonably adjusted according to the actual encoding mode of the HTML file, and is not limited herein.
403. And counting the number of preset key words contained in the page body of the HTML file.
After extracting the page text of the HTML file, the server can count the number of preset key words contained in the page text of the HTML file, wherein the preset key words can be words such as 'login', 'register', 'forget password', 'automatically login', 'remember password', and the like, and translations of the preset key words in languages of other countries. It can be understood that the preset key words can be reasonably set according to different types of languages and different login pages, and the specific setting is not limited herein.
404. Judging whether the ratio of the number of URL addresses in the HTML file to the number of bytes of the page text is larger than a fifth preset threshold value or not, and distributing a fourth weight to the page to be detected according to the judgment result;
the page texts of common login pages are relatively few, the login pages are connected to internal links of a plurality of pages such as a main page, a registration page, a password recovery page and a partner account login page, and under the condition that the page texts in the page to be detected are constant, the more the internal links are, the more the page to be detected is likely to be the login page. In the embodiment of the invention, a multi-dimensional detection mode is adopted, the weight is distributed to the detection result of each dimension, the server can judge whether the ratio of the number of URL addresses in the HTML file to the number of bytes of the page text is larger than a fifth preset threshold value or not and distribute a fourth weight to the page to be detected according to the judgment result, for example, when the ratio of the number of URL addresses in the HTML file to the number of bytes of the page text is larger than the first preset threshold value, a fourth weight which is not zero is distributed to the page to be detected, and when the ratio is smaller than the fifth preset threshold value, the fourth weight is distributed to the page to be detected and is zero.
405. Judging whether the number of preset key words contained in the page text in the HTML file is larger than a sixth preset threshold value or not, and distributing a fifth weight to the page to be detected according to the judgment result;
the server may determine whether the number of preset key words included in the page text in the HTML file is greater than a sixth preset threshold, and allocate a fifth weight to the page to be detected according to the determination result, optionally, the number of preset key words included in the page text is not less than the sixth preset threshold, a fixed fifth weight may be allocated to the page to be detected, or it may be set that the larger the number of preset key words included in the page text is, the larger the allocated fifth weight is, and the specific location is not limited herein.
406. Inputting the screenshot of the login page into a preset CNN classifier model for classification, and distributing a sixth weight to the page to be detected according to the classification result.
Optionally, in order to further improve the detection accuracy, a convolutional neural network CNN classifier may be introduced to detect whether the page to be detected is a landing page. Specifically, the server can collect a preset number of login page images as positive samples and a preset number of non-login pages as negative samples; and inputting the positive sample and the negative sample into an original CNN classifier model for training to obtain a preset CNN classifier model.
Specifically, the server may characterize the obtained positive sample and the negative sample according to a preset algorithm, such as a hash algorithm, a surf algorithm, a sift algorithm, and the like, to generate a corresponding feature vector, and after the sample is vectorized, the feature vector is recorded as X, and the label of manual classification is recorded as Y;
the vector X and the label Y are input into a classifier model for training, for example, the vector X and the label Y are input into a CNN classifier model, and the CNN model calculates parameters required in the process of mapping the vector X to the label Y according to a preset algorithm, so as to finally obtain a preset CNN model. The model can map lr of the unknown feature vector set X to the label set Y: x — > y, the algorithm principle of the specific CNN classifier model is the prior art, and is not described herein.
After the preset CNN classifier model is obtained through training, the server can obtain the page screenshot of the page to be detected through the rendering engine, can input the login page screenshot into the preset CNN classifier model for classification, and allocates a sixth weight to the page to be detected according to the classification result, for example, if the CNN classifier classifies the page screenshot of the page to be detected as a non-login interface, the sixth weight allocated to the page to be detected by the server may be zero.
407. And inputting the URL address in the HTML file into a long-short term memory network LSTM classifier model for classification, and distributing a seventh weight value to the page to be detected according to the classification result.
Optionally, in order to further improve the detection accuracy, an input long-short term memory network LSTM classifier model may be introduced to detect whether the page to be detected is a login page. Specifically, the server may collect URL addresses of a preset number of login pages as positive samples and URL addresses of a preset number of non-login pages as negative samples; and inputting the positive sample and the negative sample into an original LSTM classifier model for training to obtain a preset LSTM classifier model.
The server can input the URL address in the HTML file of the page to be detected, which is obtained previously, into the LSTM classifier model for classification, and distributes a seventh weight value to the page to be detected according to the classification result. For example, if the LSTM classifier classifies the page screenshot of the page to be detected as a non-login interface, the seventh weight value allocated by the server to the page to be detected may be zero.
408. And counting the sum of the weights obtained by the page to be detected, and if the sum of the weights is not less than a seventh preset threshold value, judging that the page to be detected is a login page.
The server may count a sum of weights obtained by the page to be detected, in this embodiment, a sum of a fourth weight, a fifth weight, a sixth weight, and a seventh weight may be counted, if the sum of weights is not less than a sixth preset threshold, the page to be detected is determined to be a login page, and a specific seventh preset threshold may be reasonably set according to configuration of an actual weight, which is not limited herein.
In the embodiment of the invention, the page information of the page to be detected can be collected, the page information can comprise an HTML (hypertext markup language) file of the page to be detected and a page screenshot of the page to be detected, four-dimensional detection is carried out based on the page information of the page to be detected, four weights are distributed to the page to be detected according to the detection result of each dimension, the sum of the weights obtained by the page to be detected is finally counted, if the sum of the weights is not less than a seventh preset threshold value, the page to be detected is judged to be a login page, whether the page to be detected is the login page or not is detected from multiple dimensions.
It should be understood that, in various embodiments of the present invention, the sequence numbers of the above steps do not mean the execution sequence, and the execution sequence of each step should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
The foregoing embodiment describes a method for identifying a phishing page based on a login frame image in an embodiment of the present invention, and a server in an embodiment of the present invention is described below with reference to fig. 5, where an embodiment of a server in an embodiment of the present invention may include:
a first collecting module 501, configured to collect icon icons corresponding to security login pages in a preset security white list respectively, so as to form a first icon set;
the calculation module 502 is used for characterizing the icon in the first icon set according to a preset algorithm, and storing each feature vector of each icon in an associated manner to form a respective standard feature vector set;
the second acquisition module 503 is configured to acquire icon icons of the login page to be detected, and generate a corresponding second feature vector set according to the characterization of a preset algorithm;
the distribution module 504 is configured to determine, from at least two dimensions, a similarity between an icon corresponding to the second feature vector set and an icon corresponding to the standard feature vector set, and distribute a corresponding weight to the login page to be detected according to a determination result of each dimension;
the first counting module 505 is configured to count a sum of weights obtained by the login page to be detected, and if the sum of weights is not less than a fourth preset threshold, determine that the login page to be detected is a phishing page.
Optionally, as a possible implementation manner, referring to fig. 6, the allocating module 504 in the embodiment of the present invention includes:
a first allocating unit 5041, configured to allocate a first weight to the login page to be detected if a third feature vector set exists in each standard feature vector set corresponding to the icon set, and the number of feature vectors successfully matched in the second feature vector set and the third feature vector set is not less than a first preset threshold, where the similarity between two feature vectors is greater than a second preset threshold, and it is determined that the corresponding feature vectors are successfully matched;
the second allocating unit 5042 is configured to determine whether the third feature vector is the same as the domain name corresponding to the login page to be detected, and allocate a second weight to the login page to be detected according to the determination result;
optionally, as a possible implementation manner, referring to fig. 6, the allocating module 504 in the embodiment of the present invention further includes:
a third allocating unit 5043, configured to determine whether a similarity between texture features of an icon corresponding to the login page corresponding to the third feature vector set and the icon of the login page to be detected is greater than a fourth preset threshold, and allocate a third weight to the login page to be detected according to the determination result.
Optionally, as a possible implementation manner, referring to fig. 7, the server in the embodiment of the present invention further includes:
a third collecting module 506, configured to collect page information of the page to be detected, where the page information at least includes a hypertext markup language HTML file corresponding to the page to be detected;
an extracting module 507, configured to extract a text of the file and a URL address of the URL from the HTML file;
the second statistical module 508 is used for counting the number of preset key words contained in the text of the HTML file;
a fourth distribution module 509, configured to determine whether the ratio of the number of URL addresses in the HTML file to the number of bytes in the text of the file is greater than a fifth preset threshold, and distribute a fourth weight to the page to be detected according to the determination result;
a fifth distribution module 510, configured to determine whether the number of preset key words included in a document body in the HTML document is greater than a sixth preset threshold, and distribute a fifth weight to the page to be detected according to the determination result;
the third counting module 511 is configured to count a sum of weights obtained by the page to be detected in each detection process, and if the sum of weights is not less than a seventh preset threshold, determine that the page to be detected is a login page to be detected.
Optionally, as a possible implementation manner, the server in the embodiment of the present invention further includes:
and a sixth allocating module 512, configured to input the screenshot of the login page into a preset convolutional neural network CNN classifier model for classification, and allocate a sixth weight to the page to be detected according to the classification result.
The server apparatus in the embodiment of the present invention is described above from the perspective of the modular functional entity, and the computer apparatus in the embodiment of the present invention is described below from the perspective of hardware processing:
fig. 8 shows only a portion related to the embodiment of the present invention for convenience of description, and please refer to the method portion of the embodiment of the present invention for reference, though specific technical details are not disclosed. The computer device 8 is generally a computer device with a high processing capability, such as a server.
Referring to fig. 8, the computer device 8 includes: a power supply 810, a memory 820, a processor 830, a wired or wireless network interface 840, and computer programs stored in the memory and executable on the processor. The processor, when executing the computer program, implements the steps of the above-described embodiment of the phishing page identification method based on the login frame image, such as steps 201 to 206 shown in fig. 2. Alternatively, the processor, when executing the computer program, implements the functions of each module or unit in the above-described device embodiments.
In some embodiments of the present invention, the processor is specifically configured to implement the following steps:
respectively collecting icon icons corresponding to all the safe login pages in a preset safe white list to form a first icon set;
respectively characterizing the icon in the first icon set according to a preset algorithm, and storing each feature vector of each icon in an associated manner to form respective standard feature vector sets;
acquiring icon icons of the login page to be detected, and generating a corresponding second characteristic vector set according to the preset algorithm characterization;
judging the similarity between the login page to be detected and each safe login page from at least two dimensions according to the second characteristic vector set and the standard characteristic vector set, and distributing corresponding weight values for the login page to be detected according to the judgment result of each dimension;
and counting the sum of the weights obtained by the login page to be detected, and if the sum of the weights is not less than a fourth preset threshold value, judging that the login page to be detected is a phishing page.
Optionally, in some embodiments of the present invention, the processor may be further configured to implement the following steps:
if a third feature vector set exists in each standard feature vector set corresponding to the icon set, and the number of feature vectors successfully matched in the second feature vector set and the third feature vector set is not less than a first preset threshold value, distributing a first weight value for the login page to be detected, wherein the similarity of the two feature vectors is greater than the second preset threshold value, and judging that the matching of the corresponding feature vectors is successful;
and judging whether the domain name corresponding to the third feature vector is the same as the domain name corresponding to the login page to be detected, and distributing a second weight to the login page to be detected according to the judgment result.
Optionally, in some embodiments of the present invention, the processor may be further configured to implement the following steps:
and judging whether the similarity of the texture features of the icon corresponding to the login page corresponding to the third feature vector set and the icon of the login page to be detected is greater than a fourth preset threshold value, and distributing a third weight to the login page to be detected according to the judgment result.
Optionally, in some embodiments of the present invention, the processor may be further configured to implement the following steps:
acquiring page information of a page to be detected, wherein the page information at least comprises a hypertext markup language (HTML) file corresponding to the page to be detected;
extracting a file text and a Uniform Resource Locator (URL) address from an HTML file;
counting the number of preset key words contained in the file text of the HTML file;
judging whether the ratio of the number of URL addresses in the HTML file to the number of bytes in the text of the file is larger than a fifth preset threshold value or not, and distributing a fourth weight to the page to be detected according to the judgment result;
judging whether the number of preset key words contained in the file text in the HTML file is larger than a sixth preset threshold value or not, and distributing a fifth weight to the page to be detected according to the judgment result;
and counting the sum of the weights obtained by the page to be detected in each detection process, and if the sum of the weights is not less than a seventh preset threshold value, judging that the page to be detected is the login page to be detected.
Optionally, in some embodiments of the present invention, the processor may be further configured to implement the following steps:
inputting the screenshot of the login page into a preset Convolutional Neural Network (CNN) classifier model for classification, and distributing a sixth weight to the page to be detected according to the classification result.
The computer device 8 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. Illustratively, a computer program may be partitioned into one or more modules/units, which are stored in a memory and executed by a processor. One or more modules/units may be a series of computer program instruction segments capable of performing certain functions, the instruction segments being used to describe the execution of a computer program in a computer device.
Those skilled in the art will appreciate that the configuration shown in fig. 8 does not constitute a limitation of the computer apparatus 8, that the computer apparatus 8 may comprise more or less components than those shown, or some components may be combined, or a different arrangement of components, e.g. the computer apparatus may further comprise input-output devices, buses, etc.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center of the computer device and the various interfaces and lines connecting the various parts of the overall computer device.
The memory may be used to store computer programs and/or modules, and the processor may implement various functions of the computer device by executing or executing the computer programs and/or modules stored in the memory, as well as by invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
The present invention also provides a computer-readable storage medium having a computer program stored thereon, which when executed by a processor, performs the steps of:
respectively collecting icon icons corresponding to all the safe login pages in a preset safe white list to form a first icon set;
respectively characterizing the icon in the first icon set according to a preset algorithm, and storing each feature vector of each icon in an associated manner to form respective standard feature vector sets;
acquiring icon icons of the login page to be detected, and generating a corresponding second characteristic vector set according to the preset algorithm characterization;
judging the similarity between the login page to be detected and each safe login page from at least two dimensions according to the second characteristic vector set and the standard characteristic vector set, and distributing corresponding weight values for the login page to be detected according to the judgment result of each dimension;
and counting the sum of the weights obtained by the login page to be detected, and if the sum of the weights is not less than a fourth preset threshold value, judging that the login page to be detected is a phishing page.
Optionally, in some embodiments of the present invention, the processor may be further configured to implement the following steps:
if a third feature vector set exists in each standard feature vector set corresponding to the icon set, and the number of feature vectors successfully matched in the second feature vector set and the third feature vector set is not less than a first preset threshold value, distributing a first weight value for the login page to be detected, wherein the similarity of the two feature vectors is greater than the second preset threshold value, and judging that the matching of the corresponding feature vectors is successful;
and judging whether the domain name corresponding to the third feature vector is the same as the domain name corresponding to the login page to be detected, and distributing a second weight to the login page to be detected according to the judgment result.
Optionally, in some embodiments of the present invention, the processor may be further configured to implement the following steps:
and judging whether the similarity of the texture features of the icon corresponding to the login page corresponding to the third feature vector set and the icon of the login page to be detected is greater than a fourth preset threshold value, and distributing a third weight to the login page to be detected according to the judgment result.
Optionally, in some embodiments of the present invention, the processor may be further configured to implement the following steps:
acquiring page information of a page to be detected, wherein the page information at least comprises a hypertext markup language (HTML) file corresponding to the page to be detected;
extracting a file text and a Uniform Resource Locator (URL) address from an HTML file;
counting the number of preset key words contained in the file text of the HTML file;
judging whether the ratio of the number of URL addresses in the HTML file to the number of bytes in the text of the file is larger than a fifth preset threshold value or not, and distributing a fourth weight to the page to be detected according to the judgment result;
judging whether the number of preset key words contained in the file text in the HTML file is larger than a sixth preset threshold value or not, and distributing a fifth weight to the page to be detected according to the judgment result;
and counting the sum of the weights obtained by the page to be detected in each detection process, and if the sum of the weights is not less than a seventh preset threshold value, judging that the page to be detected is the login page to be detected.
Optionally, in some embodiments of the present invention, the processor may be further configured to implement the following steps:
inputting the screenshot of the login page into a preset Convolutional Neural Network (CNN) classifier model for classification, and distributing a sixth weight to the page to be detected according to the classification result.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A phishing page identification method based on icon icons is characterized by comprising the following steps:
respectively collecting icon icons corresponding to all the safe login pages in a preset safe white list to form a first icon set;
respectively characterizing the icon in the first icon set according to a preset algorithm, and storing each feature vector of each icon in an associated manner to form respective standard feature vector sets; acquiring icon icons of the login page to be detected, and characterizing the corresponding icon icons according to the preset algorithm to generate a corresponding second characteristic vector set;
judging the similarity between the login page to be detected and each safe login page from at least two dimensions according to the second characteristic vector set and the standard characteristic vector set, and distributing corresponding weight values for the login page to be detected according to the judgment result of each dimension;
and counting the sum of the weights obtained by the login page to be detected, and if the sum of the weights is not less than a third preset threshold value, judging that the login page to be detected is a phishing page.
2. The method according to claim 1, wherein the determining the similarity between the login page to be detected and each secure login page from at least two dimensions, and assigning the corresponding weight to the login page to be detected according to the determination result of each dimension comprises:
if a third feature vector set exists in each standard feature vector set corresponding to the icon set, and the number of feature vectors successfully matched in the second feature vector set and the third feature vector set is not less than a first preset threshold value, allocating a first weight to the login page to be detected, wherein the similarity between the two feature vectors is greater than a second preset threshold value, and then judging that the corresponding feature vectors are successfully matched;
and judging whether the domain name corresponding to the third feature vector set is the same as the domain name corresponding to the login page to be detected, and distributing a second weight to the login page to be detected according to a judgment result.
3. The method according to claim 2, before counting the sum of the weights obtained from the landing pages to be detected, further comprising:
and judging whether the similarity of the texture features of the icon corresponding to the login page corresponding to the third feature vector set and the icon of the login page to be detected is greater than a fourth preset threshold value, and distributing a third weight to the login page to be detected according to the judgment result.
4. The method according to claim 3, wherein before the collecting the icon of the landing page to be detected, further comprising:
acquiring page information of a page to be detected, wherein the page information at least comprises a hypertext markup language (HTML) file corresponding to the page to be detected;
extracting a file text and a Uniform Resource Locator (URL) address from the HTML file;
counting the number of preset key words contained in the file body of the HTML file;
judging whether the ratio of the number of URL addresses in the HTML file to the number of bytes of the file text is larger than a fifth preset threshold value or not, and distributing a fourth weight to the page to be detected according to the judgment result;
judging whether the number of preset key words contained in the file text in the HTML file is larger than a sixth preset threshold value or not, and distributing a fifth weight to the page to be detected according to the judgment result;
and counting the sum of the weights obtained by the page to be detected in each detection process, and if the sum of the weights is not less than a seventh preset threshold value, judging that the page to be detected is a login page to be detected.
5. The method according to claim 4, wherein the page information login page screenshot of the page to be detected further includes, before counting the sum of the weights obtained by the page to be detected:
inputting the screenshot of the login page into a preset Convolutional Neural Network (CNN) classifier model for classification, and distributing a sixth weight to the page to be detected according to a classification result.
6. A server, comprising:
the first acquisition module is used for respectively acquiring icon corresponding to each safe login page in a preset safe white list to form a first icon set;
the calculation module is used for respectively characterizing the icon in the first icon set according to a preset algorithm and storing each feature vector of each icon in an associated manner to form a respective standard feature vector set;
the second acquisition module is used for acquiring the icon of the login page to be detected, characterizing the corresponding icon according to the preset algorithm and generating a corresponding second characteristic vector set;
the distribution module is used for judging the similarity between the icon corresponding to the second feature vector set and the icon corresponding to the standard feature vector set from at least two dimensions, and distributing corresponding weight values for the login page to be detected according to the judgment result of each dimension;
and the first counting module is used for counting the sum of the weights obtained by the login page to be detected, and if the sum of the weights is not less than a third preset threshold value, the login page to be detected is judged to be a phishing page.
7. The server of claim 6, wherein the assignment module comprises:
the first allocation unit is used for allocating a first weight to the login page to be detected if a third characteristic vector set exists in each standard characteristic vector set corresponding to the icon set, and the number of the characteristic vectors successfully matched in the second characteristic vector set and the third characteristic vector set is not less than a first preset threshold value, wherein the similarity of the two characteristic vectors is greater than the second preset threshold value, and the corresponding characteristic vectors are judged to be successfully matched;
and the second allocating unit is used for judging whether the domain name corresponding to the third characteristic vector set and the login page to be detected is the same or not and allocating a second weight to the login page to be detected according to the judgment result.
8. The server of claim 7, further comprising:
the third acquisition module is used for acquiring page information of a page to be detected, wherein the page information at least comprises a hypertext markup language (HTML) file corresponding to the page to be detected;
the extraction module is used for extracting a file text and a Uniform Resource Locator (URL) address from the HTML file;
the second statistical module is used for counting the number of preset key words contained in the document body of the HTML document;
the fourth distribution module is used for judging whether the ratio of the number of the URL addresses in the HTML file to the number of bytes in the file text is larger than a fifth preset threshold value or not and distributing a fourth weight to the page to be detected according to the judgment result;
the fifth distribution module is used for judging whether the number of preset key words contained in the file text in the HTML file is larger than a sixth preset threshold value or not and distributing a fifth weight to the page to be detected according to the judgment result;
and the third counting module is used for counting the sum of the weights obtained by the page to be detected in each detection process, and if the sum of the weights is not less than a seventh preset threshold value, the page to be detected is judged to be a login page to be detected.
9. A computer arrangement, characterized in that the computer arrangement comprises a processor for implementing the steps of the method according to any one of claims 1 to 5 when executing a computer program stored in a memory.
10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program when executed by a processor implementing the steps of the method according to any one of claims 1 to 5.
CN201810671754.8A 2018-06-26 2018-06-26 Fishing page identification method based on icon and related equipment Pending CN110650108A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810671754.8A CN110650108A (en) 2018-06-26 2018-06-26 Fishing page identification method based on icon and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810671754.8A CN110650108A (en) 2018-06-26 2018-06-26 Fishing page identification method based on icon and related equipment

Publications (1)

Publication Number Publication Date
CN110650108A true CN110650108A (en) 2020-01-03

Family

ID=68988852

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810671754.8A Pending CN110650108A (en) 2018-06-26 2018-06-26 Fishing page identification method based on icon and related equipment

Country Status (1)

Country Link
CN (1) CN110650108A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114448664A (en) * 2021-12-22 2022-05-06 深信服科技股份有限公司 Phishing webpage identification method and device, computer equipment and storage medium
CN114465811A (en) * 2022-03-09 2022-05-10 北京华云安信息技术有限公司 Website login determination method and device, electronic equipment and storage medium
CN114463730A (en) * 2021-07-15 2022-05-10 荣耀终端有限公司 Page identification method and terminal equipment
CN115801455A (en) * 2023-01-31 2023-03-14 北京微步在线科技有限公司 Website fingerprint-based counterfeit website detection method and device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102647422A (en) * 2012-04-10 2012-08-22 中国科学院计算机网络信息中心 Phishing website detection method and device
CN102708186A (en) * 2012-05-11 2012-10-03 上海交通大学 Identification method of phishing sites
CN103281320A (en) * 2013-05-23 2013-09-04 中国科学院计算机网络信息中心 Website icon matching-based detection method for brand counterfeit websites
CN105138921A (en) * 2015-08-18 2015-12-09 中南大学 Phishing site target domain name identification method based on page feature matching
CN105530251A (en) * 2015-12-14 2016-04-27 深圳市深信服电子科技有限公司 Method and device for identifying phishing website
CN105763543A (en) * 2016-02-03 2016-07-13 百度在线网络技术(北京)有限公司 Phishing site identification method and device
CN106453351A (en) * 2016-10-31 2017-02-22 重庆邮电大学 Financial fishing webpage detection method based on Web page characteristics
US20170126729A1 (en) * 2015-10-29 2017-05-04 Duo Security, Inc. Methods and systems for implementing a phishing assessment
CN107204960A (en) * 2016-03-16 2017-09-26 阿里巴巴集团控股有限公司 Web page identification method and device, server

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102647422A (en) * 2012-04-10 2012-08-22 中国科学院计算机网络信息中心 Phishing website detection method and device
CN102708186A (en) * 2012-05-11 2012-10-03 上海交通大学 Identification method of phishing sites
CN103281320A (en) * 2013-05-23 2013-09-04 中国科学院计算机网络信息中心 Website icon matching-based detection method for brand counterfeit websites
CN105138921A (en) * 2015-08-18 2015-12-09 中南大学 Phishing site target domain name identification method based on page feature matching
US20170126729A1 (en) * 2015-10-29 2017-05-04 Duo Security, Inc. Methods and systems for implementing a phishing assessment
CN105530251A (en) * 2015-12-14 2016-04-27 深圳市深信服电子科技有限公司 Method and device for identifying phishing website
CN105763543A (en) * 2016-02-03 2016-07-13 百度在线网络技术(北京)有限公司 Phishing site identification method and device
CN107204960A (en) * 2016-03-16 2017-09-26 阿里巴巴集团控股有限公司 Web page identification method and device, server
CN106453351A (en) * 2016-10-31 2017-02-22 重庆邮电大学 Financial fishing webpage detection method based on Web page characteristics

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114463730A (en) * 2021-07-15 2022-05-10 荣耀终端有限公司 Page identification method and terminal equipment
CN114448664A (en) * 2021-12-22 2022-05-06 深信服科技股份有限公司 Phishing webpage identification method and device, computer equipment and storage medium
CN114448664B (en) * 2021-12-22 2024-01-02 深信服科技股份有限公司 Method and device for identifying phishing webpage, computer equipment and storage medium
CN114465811A (en) * 2022-03-09 2022-05-10 北京华云安信息技术有限公司 Website login determination method and device, electronic equipment and storage medium
CN114465811B (en) * 2022-03-09 2023-05-23 北京华云安信息技术有限公司 Website login determination method and device, electronic equipment and storage medium
CN115801455A (en) * 2023-01-31 2023-03-14 北京微步在线科技有限公司 Website fingerprint-based counterfeit website detection method and device
CN115801455B (en) * 2023-01-31 2023-05-26 北京微步在线科技有限公司 Method and device for detecting counterfeit website based on website fingerprint

Similar Documents

Publication Publication Date Title
CN110647896B (en) Phishing page identification method based on logo image and related equipment
US10805346B2 (en) Phishing attack detection
Goel et al. Dual branch convolutional neural network for copy move forgery detection
CN110647895B (en) Phishing page identification method based on login box image and related equipment
CN106778241B (en) Malicious file identification method and device
US20180183815A1 (en) System and method for detecting malware
CN110650108A (en) Fishing page identification method based on icon and related equipment
CN111897962B (en) Asset marking method and device for Internet of things
EP3869385B1 (en) Method for extracting structural data from image, apparatus and device
TW201926106A (en) URL attack detection method and apparatus, and electronic device
CN108021806B (en) Malicious installation package identification method and device
CN109582813B (en) Retrieval method, device, equipment and storage medium for cultural relic exhibit
CN112183296B (en) Simulated bill image generation and bill image recognition method and device
WO2016118215A1 (en) Classification and storage of documents
CN111353491A (en) Character direction determining method, device, equipment and storage medium
CN104239582A (en) Method and device for identifying phishing webpage based on feature vector model
Roy et al. Face sketch-photo recognition using local gradient checksum: LGCS
CN104966109B (en) Medical laboratory single image sorting technique and device
CN113111880A (en) Certificate image correction method and device, electronic equipment and storage medium
CN113962199B (en) Text recognition method, text recognition device, text recognition equipment, storage medium and program product
CN113378609B (en) Agent proxy signature identification method and device
CN112041847A (en) Providing images with privacy tags
CN110650110B (en) Login page identification method and related equipment
CN108268778A (en) Data processing method, device and storage medium
ELSayed et al. Masked SIFT with align‐based refinement for contactless palmprint recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200103

RJ01 Rejection of invention patent application after publication