CN108650260B - Malicious website identification method and device - Google Patents

Malicious website identification method and device Download PDF

Info

Publication number
CN108650260B
CN108650260B CN201810438563.7A CN201810438563A CN108650260B CN 108650260 B CN108650260 B CN 108650260B CN 201810438563 A CN201810438563 A CN 201810438563A CN 108650260 B CN108650260 B CN 108650260B
Authority
CN
China
Prior art keywords
website
identified
identification
preset
malicious
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810438563.7A
Other languages
Chinese (zh)
Other versions
CN108650260A (en
Inventor
李小勇
张家桦
李继蕊
苑洁
高云全
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201810438563.7A priority Critical patent/CN108650260B/en
Publication of CN108650260A publication Critical patent/CN108650260A/en
Application granted granted Critical
Publication of CN108650260B publication Critical patent/CN108650260B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention provides a method and a device for identifying a malicious website, which can acquire website information of the website to be identified, wherein the website information comprises a website identification of the website to be identified, the website identification is input into a pre-trained identification model, the identification model is obtained by training according to initial data and over-sampling data, the initial data comprises a website identification of a preset comparison website, the comparison website comprises a preset malicious website and a preset non-malicious website, the over-sampling data is obtained by processing the initial data according to a preset over-sampling algorithm, and an identification result of the website to be identified is determined according to an output result of the identification model. Based on the processing, the training data can be balanced, the recognition precision of the recognition model is improved, and the accuracy of malicious website recognition is further improved.

Description

Malicious website identification method and device
Technical Field
The invention relates to the technical field of internet, in particular to a method and a device for identifying a malicious website.
Background
The rapid development of the internet brings convenience to people, for example, users can download various data through the internet and can also shop through the internet. Meanwhile, various cyber crimes are more and more frequent. Lawbreakers often impersonate banks, e-commerce or social websites to send fraud information to users, induce users to log in malicious websites, and further steal the information of the users, resulting in economic loss of the users.
In order to solve the above problems, a method based on machine learning may be generally adopted in the prior art to identify malicious websites, and specifically, the method mainly includes that, according to web page features of known malicious websites and non-malicious websites, for example: the method comprises the steps of constructing training data by information such as an ICP (Internet Content Provider) certificate number of a webpage, the number of hyperlinks in the webpage, the number of empty links in the webpage, whether the webpage contains a form and the like, training a preset identification model, identifying a website to be identified according to the trained identification model, and judging whether the website to be identified is a malicious website.
However, in the prior art, training data are often unbalanced, which may result in low recognition accuracy of the recognition model, and further reduce accuracy of malicious website recognition.
Disclosure of Invention
The embodiment of the invention aims to provide a method and a device for identifying a malicious website, which can improve the accuracy of identifying the malicious website. The specific technical scheme is as follows:
in a first aspect, to achieve the above object, an embodiment of the present invention discloses a method for identifying a malicious website, where the method includes:
acquiring website information of a website to be identified, wherein the website information comprises a website identification of the website to be identified;
inputting the website identification into a pre-trained identification model, wherein the identification model is obtained by training according to initial data and oversampling data, the initial data comprises a website identification of a preset comparison website, the comparison website comprises a preset malicious website and a preset non-malicious website, and the oversampling data is obtained by processing the initial data according to a preset oversampling algorithm;
and determining the identification result of the website to be identified according to the output result of the identification model.
Optionally, before the inputting the website identifier into a pre-trained recognition model, the method further includes:
judging whether a website with the same website identification as the website to be identified exists in the comparison website;
if the website with the same website identification as the website to be identified exists in the comparison websites, determining the identification result of the website to be identified according to the website with the same website identification as the website to be identified;
and if the website which is the same as the website identification of the website to be identified does not exist in the comparison websites, executing the step of inputting the website identification to a pre-trained identification model.
Optionally, the website information further includes a domain name of the website to be recognized, and before the step of inputting the website identifier into a pre-trained recognition model, the method further includes:
acquiring a target digital signature corresponding to the domain name of the website to be identified;
judging whether a digital signature with similarity greater than a first preset threshold value to the target digital signature exists in preset malicious digital signatures;
if the digital signature with the similarity larger than a first preset threshold value with the target digital signature exists in the preset malicious digital signature, judging the website to be identified as a malicious website;
and if no digital signature with the similarity greater than a first preset threshold value with the target digital signature exists in the preset malicious digital signatures, executing the step of inputting the website identification to a pre-trained recognition model.
Optionally, the website information further includes a web page image of the website to be recognized, and before the step of inputting the website identifier into a recognition model trained in advance, the method further includes:
acquiring a target image fingerprint corresponding to the webpage image of the website to be identified;
judging whether an image fingerprint with similarity greater than a second preset threshold value with the target image fingerprint exists in preset malicious image fingerprints;
if image fingerprints with similarity greater than a second preset threshold value with the target image fingerprints exist in preset malicious image fingerprints, judging the website to be identified as a malicious website;
and if no image fingerprint with the similarity of the target image fingerprint larger than a second preset threshold exists in the preset malicious image fingerprints, the step of inputting the website identification into a pre-trained recognition model is executed.
Optionally, the method further includes:
and sending the identification result of the website to be identified to a preset terminal so that the terminal displays the identification result of the website to be identified.
In a second aspect, in order to achieve the above object, an embodiment of the present invention discloses an apparatus for identifying a malicious website, where the apparatus includes:
the system comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring website information of a website to be recognized, and the website information comprises a website identification of the website to be recognized;
the first processing module is used for inputting the website identification to a pre-trained recognition model, wherein the recognition model is obtained by training according to initial data and oversampling data, the initial data comprises a preset website identification of a comparison website, the comparison website comprises a preset malicious website and a preset non-malicious website, and the oversampling data is obtained by processing the initial data according to a preset oversampling algorithm;
and the determining module is used for determining the identification result of the website to be identified according to the output result of the identification model.
Optionally, the apparatus further comprises:
the second processing module is used for judging whether websites with the same website identification as the website to be identified exist in the comparison websites;
if the website with the same website identification as the website to be identified exists in the comparison websites, determining the identification result of the website to be identified according to the website with the same website identification as the website to be identified;
and if the website with the same website identification as the website to be identified does not exist in the comparison websites, triggering the first processing module.
Optionally, the website information further includes a domain name of the website to be identified,
the second processing module is further configured to obtain a target digital signature corresponding to the domain name of the website to be identified;
judging whether a digital signature with similarity greater than a first preset threshold value to the target digital signature exists in preset malicious digital signatures;
if the digital signature with the similarity larger than a first preset threshold value with the target digital signature exists in the preset malicious digital signature, judging the website to be identified as a malicious website;
and if the preset malicious digital signature does not have a digital signature with the similarity to the target digital signature larger than a first preset threshold value, triggering the first processing module.
Optionally, the website information further includes a webpage image of the website to be identified,
the second processing module is further configured to acquire a target image fingerprint corresponding to the web page image of the website to be identified;
judging whether an image fingerprint with similarity greater than a second preset threshold value with the target image fingerprint exists in preset malicious image fingerprints;
if image fingerprints with similarity greater than a second preset threshold value with the target image fingerprints exist in preset malicious image fingerprints, judging the website to be identified as a malicious website;
and if no image fingerprint with the similarity of the target image fingerprint larger than a second preset threshold exists in the preset malicious image fingerprints, triggering the first processing module.
Optionally, the apparatus further comprises:
and the sending module is used for sending the identification result of the website to be identified to a preset terminal so that the terminal displays the identification result of the website to be identified.
In another aspect of the present invention, in order to achieve the above object, an embodiment of the present invention discloses a terminal, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
the memory is used for storing a computer program;
the processor is configured to implement any of the above method steps when executing the program stored in the memory.
In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform any of the method steps described above.
In yet another aspect of the present invention, the present invention also provides a computer program product containing instructions which, when executed on a computer, cause the computer to perform any of the method steps described above.
The method and the device for identifying the malicious website can acquire website information of the website to be identified, the website information comprises a website identification of the website to be identified, the website identification is input into a pre-trained identification model, the identification model is obtained by training according to initial data and over-sampling data, the initial data comprises a website identification of a preset comparison website, the comparison website comprises a preset malicious website and a preset non-malicious website, the over-sampling data is obtained by processing the initial data according to a preset over-sampling algorithm, and an identification result of the website to be identified is determined according to an output result of the identification model. Based on the processing, the training data can be balanced, the recognition precision of the recognition model is improved, and the accuracy of malicious website recognition is further improved.
Of course, it is not necessary for any product or method of practicing the invention to achieve all of the above-described advantages at the same time.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a method for identifying a malicious website according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating an example of a method for identifying a malicious website according to an embodiment of the present invention;
fig. 3 is a structural diagram of an apparatus for identifying malicious websites according to an embodiment of the present invention;
fig. 4 is a structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the prior art, when a website to be recognized is recognized according to a trained recognition model, the set training data are unbalanced, so that the recognition accuracy of the recognition model is low, and the recognition accuracy of malicious websites is reduced.
In order to solve the above problem, embodiments of the present invention provide a method and an apparatus for identifying a malicious website, which may be applied to an electronic device, where the electronic device may be a terminal or a server. The electronic equipment can obtain website information of a website to be recognized, the website information comprises a website identification of the website to be recognized, then the electronic equipment inputs the website identification into a recognition model which is trained in advance, the recognition model is obtained by training according to initial data and oversampling data, the initial data comprises a website identification of a preset comparison website, the comparison website comprises a preset malicious website and a preset non-malicious website, the oversampling data is obtained by processing the initial data according to a preset oversampling algorithm, and a recognition result of the website to be recognized is determined according to an output result of the recognition model. Based on the processing, the training data can be balanced, the recognition precision of the recognition model is improved, and the accuracy of malicious website recognition is further improved.
Referring to fig. 1, fig. 1 is a flowchart of a method for identifying a malicious website according to an embodiment of the present invention, where the method may include the following steps:
s101: and acquiring the website information of the website to be identified.
The website information may include a website identifier of the website to be identified, and the website identifier may be a URL (Uniform Resource Locator) corresponding to the website to be identified.
In implementation, for a website to be identified, the electronic device may obtain a website identifier of the website to be identified.
S102: and inputting the website identification into a pre-trained recognition model.
The identification model may be trained according to initial data and oversampled data, and specifically, the identification model may be a DBN (Deep Belief Network) model or other probability generation models in the prior art, the initial data may include website identifiers of preset comparison websites, the comparison websites include preset malicious websites and preset non-malicious websites, which may be set by technicians based on experience, the oversampled data may be obtained by processing the initial data according to a preset oversampling algorithm, the preset oversampling algorithm may be a boundary-class data-based (boundary Synthetic minor Over-sampling Technique) algorithm or other oversampling algorithms in the prior art, the boundary-class data may be determined according to the initial data according to the boundary-class data-based on the boundary-class data, composite minority class data (i.e., oversampled data) is then generated from the boundary minority class data. Specifically, the Borderline-Smote algorithm can be realized by the following steps:
the initial data (i.e. the website id of the comparison website) can be represented by T, and accordingly, a few types of data (i.e. the website id of the malicious website) in the initial data can be represented by P, a majority type of data (i.e. the website id of the non-malicious website) in the initial data can be represented by N, and each type of data can be represented by formula (1):
T=P+N,P={p1,p2,…,ppnum},N={n1,n2,…,nnnum} (1)
pnum represents the number of samples (which may be referred to as minority class samples) in the minority class data P, and nnum represents the number of samples (which may be referred to as majority class samples) in the majority class data N.
For each minority class sample P in the minority class data Pi(i ═ 1,2, …, pnum), the minority class sample p can be determined in the initial data TiM neighbors of, the minority class samples piThe number of data N belonging to the majority class in the m neighbors of (a) can be represented by m '(0. ltoreq. m'. ltoreq.m).
If m ═ m, the minority class of samples piAll m neighbors belong to the majority class data N, the minority class sample p can be determinediFor noisy samples, the minority class samples p are not needediAnd carrying out subsequent processing. If it is not
Figure GDA0003221083190000071
I.e. the minority sample piThe number of samples belonging to the majority class of data in the m neighbors of (1) is greater than the number of samples p in the minority classiThe number of samples belonging to the minority class of data in the m neighbors of (1), the minority class of samples piTo be susceptible to misclassification, the minority class samples p may be classifiediAs a few classes of samples of the boundary. Then, according to the obtained few class samples of the boundary, a set DANGER can be obtained; if it is not
Figure GDA0003221083190000072
The minority class sample p is confirmediFor the safety sample, the minority class sample p is not needediAnd carrying out subsequent processing.
The number of boundary minority class samples in the set DANGER can be represented by dnum, and then equation (2) can be obtained:
DANGER={p′1,p′2,…,p′dnum},0≤dnum≤pnum (2)
minority class samples p 'for each boundary in the set DANGER'x(0 ≦ x ≦ dnum), the boundary minority class sample P 'may be determined in the minority class data P'xK neighbors.
For each boundary minority class sample p'xFrom the boundary a few class samples p'xRandomly taking s samples belonging to minority class data P in the k neighbors of (1), and calculating the minority class sample P 'of each sample and the boundary'xDistance diff ofj(j ═ 1,2, …, s). For each distance diffjGenerating a corresponding random number r between 0 and 1j(j ═ 1,2, …, s), then, s synthetic minority samples can be generated according to equation (3):
syntheticj=p′x+rj×diffj,j=1,2,…s,0≤x≤dnum (3)
wherein syntheticjRepresents boundary minority class sample p'xA few classes of samples, i.e., oversampled data.
The initial data T and the oversampled data synthesized may then be combinedjThe recognition model is trained as training data of the model.
In implementation, the electronic device may input a website identifier of a website to be recognized into a recognition model trained in advance, so as to recognize the website to be recognized.
S103: and determining the identification result of the website to be identified according to the output result of the identification model.
In implementation, the electronic device may determine the recognition result of the website to be recognized according to the output result of the recognition model, generally, the output result of the recognition model is the probability that the website to be recognized belongs to the malicious website, specifically, when the output result of the recognition model is greater than a third preset threshold, the electronic device may determine that the website to be recognized is the malicious website, and when the output result of the recognition model is less than or equal to the third preset threshold, the electronic device may determine that the website to be recognized is the non-malicious website.
Optionally, before the website to be recognized is recognized according to the recognition model, the website to be recognized may also be recognized directly according to the website identifier of the comparison website. Specifically, before inputting the website identifier into the pre-trained recognition model, the method may further include the following steps: judging whether websites with the same website identification as the website to be identified exist in the comparison websites; if the website which has the same website identification as the website to be identified exists in the comparison websites, determining the identification result of the website to be identified according to the website which has the same website identification as the website to be identified; if no website with the same website identification as the website to be identified exists in the comparison websites, step S102 is executed.
The electronic device may locally store a website blacklist for storing website identifications of malicious websites in comparison websites, and may locally store a website whitelist for storing website identifications of non-malicious websites in comparison websites.
In implementation, after acquiring the website identifier of the website to be identified, the electronic device may directly perform query in a local website blacklist and a local website whitelist to determine whether the website identifier of the website to be identified exists. When the electronic device queries the website identifier of the website to be identified in the website blacklist, the electronic device can determine that the website to be identified is a malicious website. When the electronic device queries the website identifier of the website to be identified in the website white list, the electronic device may determine that the website to be identified is a non-malicious website. Correspondingly, when the electronic device does not inquire the website identification of the website to be identified in the website blacklist and the website whitelist, the electronic device can input the website identification of the website to be identified to the pre-trained identification model so as to further identify the website to be identified.
As can be seen from the above, based on the method for identifying a malicious website in the embodiment of the present invention, the electronic device can directly identify the website to be identified according to the website identifier of the comparison website, and the efficiency of identifying the malicious website can be improved.
Optionally, the electronic device may further identify the website to be identified according to the similarity of the domain names of the websites. Specifically, before the step of inputting the website identifier into the pre-trained recognition model, the method may further include the following steps: acquiring a target digital signature corresponding to a domain name of a website to be identified; judging whether a digital signature with similarity greater than a first preset threshold value to a target digital signature exists in preset malicious digital signatures; if the digital signature with the similarity larger than a first preset threshold value with the target digital signature exists in the preset malicious digital signature, judging the website to be identified as a malicious website; if there is no digital signature with similarity greater than the first preset threshold with the target digital signature in the preset malicious digital signatures, step S102 is executed.
The digital signature (i.e., malicious digital signature) corresponding to the domain name of the malicious website may also be stored in the local website blacklist of the electronic device, and the first preset threshold may be set by a technician according to experience.
In implementation, the electronic device may obtain a digital signature (i.e., a target digital signature) corresponding to a domain name of a website to be identified. Specifically, for a domain name of a certain website, the electronic device may generate a short string set corresponding to the domain name of the website, and the electronic device may obtain the short string set corresponding to the domain name of the website according to a K-shifts (K-plate) method. For example, for the domain name "location. { "log", "ogi", "gin", "in.", "n.t", ". ta", "tao", "aob", "oba", "bao", "ao.", "o.c", ". co", "com" }. Wherein the value of K can be set empirically by a skilled person. Then, the electronic device may perform Hash processing on the short string set according to a Hash (Hash) function, and extract a minimum Hash to obtain a digital signature corresponding to the domain name. For each malicious digital signature in the website blacklist, the electronic device may calculate a similarity between the malicious digital signature and a target digital signature, and specifically, the electronic device may calculate a Jaccard (Jaccard) similarity between the malicious digital signature and the target digital signature. Then, the electronic device may determine whether there is a digital signature with similarity greater than a first preset threshold with the target digital signature in the preset malicious digital signature. When the electronic equipment judges that the digital signature with the similarity greater than the first preset threshold value exists in the preset malicious digital signature, the electronic equipment can judge that the website to be identified is a malicious website. When the electronic device judges that no digital signature with similarity greater than a first preset threshold value to a target digital signature exists in preset malicious digital signatures, the electronic device can input the website identification of the website to be identified to a pre-trained identification model so as to further identify the website to be identified.
As can be seen from the above, based on the method for identifying a malicious website in the embodiment of the present invention, the electronic device may further identify the website to be identified according to the similarity of the domain names of the websites, so that the accuracy of identification can be improved.
Optionally, the electronic device may further identify the website to be identified according to the similarity of the web page images. Specifically, before the step of inputting the website identifier into the pre-trained recognition model, the method may further include the following steps: acquiring a target image fingerprint corresponding to a webpage image of a website to be identified; judging whether an image fingerprint with similarity greater than a second preset threshold value with the target image fingerprint exists in the preset malicious image fingerprints; if image fingerprints with similarity greater than a second preset threshold value with the target image fingerprints exist in the preset malicious image fingerprints, judging the website to be identified as a malicious website; if there is no image fingerprint with the similarity of the target image fingerprint larger than the second preset threshold in the preset malicious image fingerprints, executing step S102.
The electronic device may further store an image fingerprint (i.e., a malicious image fingerprint) corresponding to the web page image of the malicious website in the local website blacklist, and the second preset threshold may be set by a technician according to experience.
In implementation, the electronic device may acquire an image fingerprint (i.e., a target image fingerprint) corresponding to a web page image of a website to be identified. Specifically, for a certain web page image, the electronic device may first obtain a screenshot of the web page image to obtain a screenshot image, and then, the electronic device may scale the screenshot image to a preset size, for example, the screenshot image with 8 pixels × 8 pixels. The electronic device may calculate an average gray value of a pixel point in the zoomed screenshot image, and for each pixel point, when the gray value of the pixel point is greater than or equal to the average gray value, the value of the pixel point may be recorded as 1, and when the gray value of the pixel point is smaller than the average gray value, the value of the pixel point may be recorded as 0. For the screenshot image, the electronic device may obtain an 0/1 string with 64 bits, i.e., the image fingerprint corresponding to the webpage image. The electronic equipment can judge whether an image fingerprint with similarity greater than a second preset threshold value to the target image fingerprint exists in the preset malicious image fingerprints. Specifically, the electronic device may calculate a hamming distance between the target image fingerprint and each malicious image fingerprint. Correspondingly, the hamming distance between the target image fingerprint and a certain malicious image fingerprint is smaller than a fourth preset threshold, which can indicate that the similarity between the target image fingerprint and the malicious image fingerprint is larger than a second preset threshold; the hamming distance between the target image fingerprint and the malicious image fingerprint is greater than a fourth preset threshold, which means that the similarity between the target image fingerprint and the malicious image fingerprint is less than a second preset threshold. When the electronic equipment judges that image fingerprints with similarity greater than a second preset threshold value with the target image fingerprints exist in the preset malicious image fingerprints, the electronic equipment can judge that the website to be identified is a malicious website. When the electronic equipment judges that no image fingerprint with the similarity to the target image fingerprint larger than a second preset threshold exists in the preset malicious image fingerprints, the electronic equipment can input the website identification of the website to be identified into a pre-trained identification model so as to further identify the website to be identified.
Optionally, the method may further display the identification result of the website to be identified. Specifically, the method may further include the steps of: and sending the identification result of the website to be identified to a preset terminal so that the terminal displays the identification result of the website to be identified.
Wherein the terminal comprises a display component.
In implementation, the electronic device may send the identification result of the website to be identified to a preset terminal, so that the terminal may display the identification result of the website to be identified through the display component.
Therefore, based on the malicious website identification method implemented by the invention, the electronic equipment can also send the identification result of the website to be identified to the preset terminal, so that the terminal can display the identification result of the website to be identified, and further the user experience can be improved.
Referring to fig. 2, fig. 2 is a flowchart of an example of a method for identifying a malicious website according to an embodiment of the present invention, where the method may include the following steps:
s201: and acquiring the website information of the website to be identified.
The website information may include a website identifier of the website to be identified.
S202: and judging whether a website with the same website identification as that of the website to be identified exists in the comparison websites, if so, executing S203, and if not, executing S204.
S203: and determining the identification result of the website to be identified according to the website with the same website identification as the website to be identified.
S204: judging whether a digital signature with the similarity to the target digital signature larger than a first preset threshold exists in the preset malicious digital signatures, if so, executing S205, and if not, executing S206.
The website information can also comprise a domain name of the website to be identified, and the target digital signature is a digital signature corresponding to the domain name of the website to be identified.
S205: and judging the website to be identified as a malicious website.
S206: and judging whether an image fingerprint with the similarity to the target image fingerprint larger than a second preset threshold exists in the preset malicious image fingerprints, if so, executing S205, and if not, executing S207.
The website information can also comprise a webpage image of the website to be identified, and the target image fingerprint is an image fingerprint corresponding to the webpage image of the website to be identified.
S207: and inputting the website identification of the website to be recognized into a recognition model which is trained in advance.
The identification model can be obtained by training according to initial data and oversampling data, the initial data can include a website identifier of a preset comparison website, the comparison website can include a preset malicious website and a preset non-malicious website, and the oversampling data is obtained by processing the initial data according to a preset oversampling algorithm.
S208: and determining the identification result of the website to be identified according to the output result of the identification model.
S209: and sending the identification result of the website to be identified to a preset terminal so that the terminal displays the identification result of the website to be identified.
As can be seen from the above, the method for identifying a malicious website according to the embodiment of the present invention can obtain website information of a website to be identified, where the website information includes a website identifier of the website to be identified, and inputs the website identifier into a pre-trained identification model, where the identification model is obtained by training according to initial data and over-sampling data, the initial data includes a website identifier of a preset comparison website, the comparison website includes a preset malicious website and a preset non-malicious website, the over-sampling data is obtained by processing the initial data according to a preset over-sampling algorithm, and an identification result of the website to be identified is determined according to an output result of the identification model. Based on the processing, the training data can be balanced, the recognition precision of the recognition model is improved, and the accuracy of malicious website recognition is further improved.
Corresponding to the embodiment of the method in fig. 1, referring to fig. 3, fig. 3 is a block diagram of an apparatus for identifying a malicious website according to an embodiment of the present invention, where the apparatus may include:
an obtaining module 301, configured to obtain website information of a website to be identified, where the website information includes a website identifier of the website to be identified;
the first processing module 302 is configured to input the website identifier into a pre-trained recognition model, where the recognition model is obtained by training according to initial data and over-sampling data, the initial data includes a website identifier of a preset comparison website, the comparison website includes a preset malicious website and a preset non-malicious website, and the over-sampling data is obtained by processing the initial data according to a preset over-sampling algorithm;
and the determining module 303 is configured to determine the recognition result of the website to be recognized according to the output result of the recognition model.
Optionally, the apparatus further comprises:
the second processing module is used for judging whether websites with the same website identification as the website to be identified exist in the comparison websites;
if the website with the same website identification as the website to be identified exists in the comparison websites, determining the identification result of the website to be identified according to the website with the same website identification as the website to be identified;
if there is no website with the same website identification as the website to be identified in the comparison websites, the first processing module 302 is triggered.
Optionally, the website information further includes a domain name of the website to be identified,
the second processing module is further configured to obtain a target digital signature corresponding to the domain name of the website to be identified;
judging whether a digital signature with similarity greater than a first preset threshold value to the target digital signature exists in preset malicious digital signatures;
if the digital signature with the similarity larger than a first preset threshold value with the target digital signature exists in the preset malicious digital signature, judging the website to be identified as a malicious website;
if there is no digital signature with similarity greater than a first preset threshold in the preset malicious digital signatures, the first processing module 302 is triggered.
Optionally, the website information further includes a webpage image of the website to be identified,
the second processing module is further configured to acquire a target image fingerprint corresponding to the web page image of the website to be identified;
judging whether an image fingerprint with similarity greater than a second preset threshold value with the target image fingerprint exists in preset malicious image fingerprints;
if image fingerprints with similarity greater than a second preset threshold value with the target image fingerprints exist in preset malicious image fingerprints, judging the website to be identified as a malicious website;
if there is no image fingerprint in the preset malicious image fingerprints whose similarity of the target image fingerprint is greater than a second preset threshold, the first processing module 302 is triggered.
Optionally, the apparatus further comprises:
and the sending module is used for sending the identification result of the website to be identified to a preset terminal so that the terminal displays the identification result of the website to be identified.
As can be seen from the above, the malicious website recognition device according to the embodiment of the present invention can obtain website information of a to-be-recognized website, where the website information includes a website identifier of the to-be-recognized website, and inputs the website identifier into a recognition model trained in advance, where the recognition model is obtained by training according to initial data and oversampled data, the initial data includes a website identifier of a preset comparison website, the comparison website includes a preset malicious website and a preset non-malicious website, the oversampled data is obtained by processing the initial data according to a preset oversampling algorithm, and a recognition result of the to-be-recognized website is determined according to an output result of the recognition model. Based on the processing, the training data can be balanced, the recognition precision of the recognition model is improved, and the accuracy of malicious website recognition is further improved.
The embodiment of the present invention further provides a terminal, as shown in fig. 4, including a processor 401, a communication interface 402, a memory 403, and a communication bus 404, where the processor 401, the communication interface 402, and the memory 403 complete mutual communication through the communication bus 404,
a memory 403 for storing a computer program;
the processor 401, when executing the program stored in the memory 403, implements the following steps:
acquiring website information of a website to be identified, wherein the website information comprises a website identification of the website to be identified;
inputting the website identification into a pre-trained identification model, wherein the identification model is obtained by training according to initial data and oversampling data, the initial data comprises a website identification of a preset comparison website, the comparison website comprises a preset malicious website and a preset non-malicious website, and the oversampling data is obtained by processing the initial data according to a preset oversampling algorithm;
and determining the identification result of the website to be identified according to the output result of the identification model.
The communication bus mentioned in the above embodiments may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry standard rd Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the terminal and other equipment.
The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In another embodiment of the present invention, there is also provided a computer-readable storage medium, having stored therein instructions, which when run on a computer, cause the computer to execute the method for identifying a malicious website described in any one of the above embodiments.
In yet another embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the method for identifying malicious websites according to any one of the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, the electronic device, the computer-readable storage medium, and the computer program product embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiments.

Claims (6)

1. A method for identifying a malicious website, the method comprising:
acquiring website information of a website to be identified, wherein the website information comprises a website identification of the website to be identified;
inputting the website identification into a pre-trained identification model, wherein the identification model is obtained by training according to initial data and oversampling data, the initial data comprises a website identification of a preset comparison website, the comparison website comprises a preset malicious website and a preset non-malicious website, and the oversampling data is obtained by processing the initial data according to a preset oversampling algorithm;
determining the identification result of the website to be identified according to the output result of the identification model;
the website information further includes a domain name of the website to be recognized, and before the website identification is input into a recognition model trained in advance, the method further includes:
acquiring a target digital signature corresponding to the domain name of the website to be identified;
judging whether a digital signature with similarity greater than a first preset threshold value to the target digital signature exists in preset malicious digital signatures;
if the digital signature with the similarity larger than a first preset threshold value with the target digital signature exists in the preset malicious digital signature, judging the website to be identified as a malicious website;
if no digital signature with the similarity greater than a first preset threshold value with the target digital signature exists in preset malicious digital signatures, the step of inputting the website identification into a pre-trained recognition model is executed;
the acquiring of the target digital signature corresponding to the domain name of the website to be identified includes:
generating a short character string set corresponding to the domain name of the website to be identified according to a K-plate shingles method;
carrying out hash processing on the short character string set according to a hash function;
taking the obtained minimum hash as a target digital signature corresponding to the domain name of the website to be identified;
the website information further comprises a webpage image of the website to be recognized, and before the website identification is input into a pre-trained recognition model, the method further comprises the following steps:
acquiring a target image fingerprint corresponding to the webpage image of the website to be identified;
judging whether an image fingerprint with similarity greater than a second preset threshold value with the target image fingerprint exists in preset malicious image fingerprints;
if image fingerprints with similarity greater than a second preset threshold value with the target image fingerprints exist in preset malicious image fingerprints, judging the website to be identified as a malicious website;
and if no image fingerprint with the similarity of the target image fingerprint larger than a second preset threshold exists in the preset malicious image fingerprints, the step of inputting the website identification into a pre-trained recognition model is executed.
2. The method of claim 1, wherein prior to said inputting said website identification into a pre-trained recognition model, said method further comprises:
judging whether a website with the same website identification as the website to be identified exists in the comparison website;
if the website with the same website identification as the website to be identified exists in the comparison websites, determining the identification result of the website to be identified according to the website with the same website identification as the website to be identified;
and if the website which is the same as the website identification of the website to be identified does not exist in the comparison websites, executing the step of inputting the website identification to a pre-trained identification model.
3. The method of claim 1, further comprising:
and sending the identification result of the website to be identified to a preset terminal so that the terminal displays the identification result of the website to be identified.
4. An apparatus for identifying malicious websites, the apparatus comprising:
the system comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring website information of a website to be recognized, and the website information comprises a website identification of the website to be recognized;
the first processing module is used for inputting the website identification to a pre-trained recognition model, wherein the recognition model is obtained by training according to initial data and oversampling data, the initial data comprises a preset website identification of a comparison website, the comparison website comprises a preset malicious website and a preset non-malicious website, and the oversampling data is obtained by processing the initial data according to a preset oversampling algorithm;
the determining module is used for determining the identification result of the website to be identified according to the output result of the identification model;
the website information further includes a domain name of the website to be identified,
the second processing module is used for acquiring a target digital signature corresponding to the domain name of the website to be identified;
judging whether a digital signature with similarity greater than a first preset threshold value to the target digital signature exists in preset malicious digital signatures;
if the digital signature with the similarity larger than a first preset threshold value with the target digital signature exists in the preset malicious digital signature, judging the website to be identified as a malicious website;
if no digital signature with the similarity greater than a first preset threshold value with the target digital signature exists in the preset malicious digital signatures, triggering the first processing module;
the second processing module is specifically configured to generate a short string set corresponding to the domain name of the website to be identified according to a K-plate shingles method;
carrying out hash processing on the short character string set according to a hash function;
taking the obtained minimum hash as a target digital signature corresponding to the domain name of the website to be identified;
the website information further includes a web page image of the website to be identified,
the second processing module is further configured to acquire a target image fingerprint corresponding to the web page image of the website to be identified;
judging whether an image fingerprint with similarity greater than a second preset threshold value with the target image fingerprint exists in preset malicious image fingerprints;
if image fingerprints with similarity greater than a second preset threshold value with the target image fingerprints exist in preset malicious image fingerprints, judging the website to be identified as a malicious website;
and if no image fingerprint with the similarity of the target image fingerprint larger than a second preset threshold exists in the preset malicious image fingerprints, triggering the first processing module.
5. The apparatus according to claim 4, wherein the second processing module is further configured to determine whether there is a website with the same website identifier as the website to be identified in the comparison website;
if the website with the same website identification as the website to be identified exists in the comparison websites, determining the identification result of the website to be identified according to the website with the same website identification as the website to be identified;
and if the website with the same website identification as the website to be identified does not exist in the comparison websites, triggering the first processing module.
6. The apparatus of claim 4, further comprising:
and the sending module is used for sending the identification result of the website to be identified to a preset terminal so that the terminal displays the identification result of the website to be identified.
CN201810438563.7A 2018-05-09 2018-05-09 Malicious website identification method and device Active CN108650260B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810438563.7A CN108650260B (en) 2018-05-09 2018-05-09 Malicious website identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810438563.7A CN108650260B (en) 2018-05-09 2018-05-09 Malicious website identification method and device

Publications (2)

Publication Number Publication Date
CN108650260A CN108650260A (en) 2018-10-12
CN108650260B true CN108650260B (en) 2021-10-15

Family

ID=63754033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810438563.7A Active CN108650260B (en) 2018-05-09 2018-05-09 Malicious website identification method and device

Country Status (1)

Country Link
CN (1) CN108650260B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259207A (en) * 2018-11-30 2020-06-09 阿里巴巴集团控股有限公司 Short message identification method, device and equipment
CN110334262B (en) * 2019-06-06 2023-12-29 创新先进技术有限公司 Model training method and device and electronic equipment
CN110225030B (en) * 2019-06-10 2021-09-28 福州大学 Malicious domain name detection method and system based on RCNN-SPP network
CN110392056A (en) * 2019-07-24 2019-10-29 成都积微物联集团股份有限公司 A kind of the Internet of Things malware detection system and method for lightweight
CN111224941B (en) * 2019-11-19 2020-12-04 北京邮电大学 Threat type identification method and device
CN113079123B (en) * 2020-01-03 2022-11-22 中国移动通信集团广东有限公司 Malicious website detection method and device and electronic equipment
CN111259219B (en) * 2020-01-10 2023-04-21 北京金睛云华科技有限公司 Malicious webpage identification model establishment method, malicious webpage identification method and malicious webpage identification system
CN111626309A (en) * 2020-05-26 2020-09-04 北京墨云科技有限公司 Website fingerprint identification method based on deep learning
CN114124564B (en) * 2021-12-03 2023-11-28 北京天融信网络安全技术有限公司 Method and device for detecting counterfeit website, electronic equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999638A (en) * 2013-01-05 2013-03-27 南京邮电大学 Phishing website detection method excavated based on network group

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080034428A1 (en) * 2006-07-17 2008-02-07 Yahoo! Inc. Anti-phishing for client devices
CN105279272A (en) * 2015-10-30 2016-01-27 南京未来网络产业创新有限公司 Content aggregation method based on distributed web crawlers

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999638A (en) * 2013-01-05 2013-03-27 南京邮电大学 Phishing website detection method excavated based on network group

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"Phishing Detection Method Based on Borderline-Smote Deep Belief Network";Jiahua Zhang等;《International Conference on Security,Privacy and Anonymity in Computation,Communication and Storage,SpaCCS 2017》;20171209;正文第2-3节 *
"基于图像感知哈希技术的钓鱼网页检测";周国强等;《南京邮电大学学报(自然科学版)》;20130110;正文第3-4节 *
"基于图像相似性的钓鱼网站检测";卢康等;《技研学术》;20160426;摘要,正文第2-3节 *
Jiahua Zhang等."Phishing Detection Method Based on Borderline-Smote Deep Belief Network".《International Conference on Security,Privacy and Anonymity in Computation,Communication and Storage,SpaCCS 2017》.2017,正文第2-3节. *

Also Published As

Publication number Publication date
CN108650260A (en) 2018-10-12

Similar Documents

Publication Publication Date Title
CN108650260B (en) Malicious website identification method and device
JP6220407B2 (en) Document classification using multi-scale text fingerprinting
CN107888616B (en) Construction method of classification model based on URI and detection method of Webshell attack website
CN110798488B (en) Web application attack detection method
CN108092963B (en) Webpage identification method and device, computer equipment and storage medium
EP3703329B1 (en) Webpage request identification
CN107992738B (en) Account login abnormity detection method and device and electronic equipment
CN103067347B (en) Method for detecting phishing website and network device thereof
CN109547426B (en) Service response method and server
CN104143008A (en) Method and device for detecting phishing webpage based on picture matching
CN103986731A (en) Method and device for detecting phishing web pages through picture matching
CN116366338B (en) Risk website identification method and device, computer equipment and storage medium
CN112532624B (en) Black chain detection method and device, electronic equipment and readable storage medium
CN111224941A (en) Threat type identification method and device
CN110958244A (en) Method and device for detecting counterfeit domain name based on deep learning
CN108234454B (en) Identity authentication method, server and client device
CN112751804B (en) Method, device and equipment for identifying counterfeit domain name
CN114448664B (en) Method and device for identifying phishing webpage, computer equipment and storage medium
CN111259216B (en) Information identification method, device and equipment
CN107786529B (en) Website detection method, device and system
CN115004181A (en) Webpage detection method and device, electronic equipment and storage medium
CN111382432A (en) Malicious software detection and classification model generation method and device
CN109672678B (en) Phishing website identification method and device
CN106055693B (en) Information processing method and terminal
CN109992960B (en) Counterfeit parameter detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant