CN108650260B

CN108650260B - Malicious website identification method and device

Info

Publication number: CN108650260B
Application number: CN201810438563.7A
Authority: CN
Inventors: 李小勇; 张家桦; 李继蕊; 苑洁; 高云全
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2018-05-09
Filing date: 2018-05-09
Publication date: 2021-10-15
Anticipated expiration: 2038-05-09
Also published as: CN108650260A

Abstract

The embodiment of the invention provides a method and a device for identifying a malicious website, which can acquire website information of the website to be identified, wherein the website information comprises a website identification of the website to be identified, the website identification is input into a pre-trained identification model, the identification model is obtained by training according to initial data and over-sampling data, the initial data comprises a website identification of a preset comparison website, the comparison website comprises a preset malicious website and a preset non-malicious website, the over-sampling data is obtained by processing the initial data according to a preset over-sampling algorithm, and an identification result of the website to be identified is determined according to an output result of the identification model. Based on the processing, the training data can be balanced, the recognition precision of the recognition model is improved, and the accuracy of malicious website recognition is further improved.

Description

Malicious website identification method and device

Technical Field

The invention relates to the technical field of internet, in particular to a method and a device for identifying a malicious website.

Background

The rapid development of the internet brings convenience to people, for example, users can download various data through the internet and can also shop through the internet. Meanwhile, various cyber crimes are more and more frequent. Lawbreakers often impersonate banks, e-commerce or social websites to send fraud information to users, induce users to log in malicious websites, and further steal the information of the users, resulting in economic loss of the users.

In order to solve the above problems, a method based on machine learning may be generally adopted in the prior art to identify malicious websites, and specifically, the method mainly includes that, according to web page features of known malicious websites and non-malicious websites, for example: the method comprises the steps of constructing training data by information such as an ICP (Internet Content Provider) certificate number of a webpage, the number of hyperlinks in the webpage, the number of empty links in the webpage, whether the webpage contains a form and the like, training a preset identification model, identifying a website to be identified according to the trained identification model, and judging whether the website to be identified is a malicious website.

However, in the prior art, training data are often unbalanced, which may result in low recognition accuracy of the recognition model, and further reduce accuracy of malicious website recognition.

Disclosure of Invention

The embodiment of the invention aims to provide a method and a device for identifying a malicious website, which can improve the accuracy of identifying the malicious website. The specific technical scheme is as follows:

in a first aspect, to achieve the above object, an embodiment of the present invention discloses a method for identifying a malicious website, where the method includes:

acquiring website information of a website to be identified, wherein the website information comprises a website identification of the website to be identified;

inputting the website identification into a pre-trained identification model, wherein the identification model is obtained by training according to initial data and oversampling data, the initial data comprises a website identification of a preset comparison website, the comparison website comprises a preset malicious website and a preset non-malicious website, and the oversampling data is obtained by processing the initial data according to a preset oversampling algorithm;

and determining the identification result of the website to be identified according to the output result of the identification model.

Optionally, before the inputting the website identifier into a pre-trained recognition model, the method further includes:

judging whether a website with the same website identification as the website to be identified exists in the comparison website;

if the website with the same website identification as the website to be identified exists in the comparison websites, determining the identification result of the website to be identified according to the website with the same website identification as the website to be identified;

and if the website which is the same as the website identification of the website to be identified does not exist in the comparison websites, executing the step of inputting the website identification to a pre-trained identification model.

Optionally, the website information further includes a domain name of the website to be recognized, and before the step of inputting the website identifier into a pre-trained recognition model, the method further includes:

acquiring a target digital signature corresponding to the domain name of the website to be identified;

judging whether a digital signature with similarity greater than a first preset threshold value to the target digital signature exists in preset malicious digital signatures;

if the digital signature with the similarity larger than a first preset threshold value with the target digital signature exists in the preset malicious digital signature, judging the website to be identified as a malicious website;

and if no digital signature with the similarity greater than a first preset threshold value with the target digital signature exists in the preset malicious digital signatures, executing the step of inputting the website identification to a pre-trained recognition model.

Optionally, the website information further includes a web page image of the website to be recognized, and before the step of inputting the website identifier into a recognition model trained in advance, the method further includes:

acquiring a target image fingerprint corresponding to the webpage image of the website to be identified;

judging whether an image fingerprint with similarity greater than a second preset threshold value with the target image fingerprint exists in preset malicious image fingerprints;

if image fingerprints with similarity greater than a second preset threshold value with the target image fingerprints exist in preset malicious image fingerprints, judging the website to be identified as a malicious website;

and if no image fingerprint with the similarity of the target image fingerprint larger than a second preset threshold exists in the preset malicious image fingerprints, the step of inputting the website identification into a pre-trained recognition model is executed.

Optionally, the method further includes:

and sending the identification result of the website to be identified to a preset terminal so that the terminal displays the identification result of the website to be identified.

In a second aspect, in order to achieve the above object, an embodiment of the present invention discloses an apparatus for identifying a malicious website, where the apparatus includes:

the system comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring website information of a website to be recognized, and the website information comprises a website identification of the website to be recognized;

the first processing module is used for inputting the website identification to a pre-trained recognition model, wherein the recognition model is obtained by training according to initial data and oversampling data, the initial data comprises a preset website identification of a comparison website, the comparison website comprises a preset malicious website and a preset non-malicious website, and the oversampling data is obtained by processing the initial data according to a preset oversampling algorithm;

and the determining module is used for determining the identification result of the website to be identified according to the output result of the identification model.

Optionally, the apparatus further comprises:

the second processing module is used for judging whether websites with the same website identification as the website to be identified exist in the comparison websites;

and if the website with the same website identification as the website to be identified does not exist in the comparison websites, triggering the first processing module.

Optionally, the website information further includes a domain name of the website to be identified,

the second processing module is further configured to obtain a target digital signature corresponding to the domain name of the website to be identified;

and if the preset malicious digital signature does not have a digital signature with the similarity to the target digital signature larger than a first preset threshold value, triggering the first processing module.

Optionally, the website information further includes a webpage image of the website to be identified,

the second processing module is further configured to acquire a target image fingerprint corresponding to the web page image of the website to be identified;

and if no image fingerprint with the similarity of the target image fingerprint larger than a second preset threshold exists in the preset malicious image fingerprints, triggering the first processing module.

Optionally, the apparatus further comprises:

and the sending module is used for sending the identification result of the website to be identified to a preset terminal so that the terminal displays the identification result of the website to be identified.

In another aspect of the present invention, in order to achieve the above object, an embodiment of the present invention discloses a terminal, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

the memory is used for storing a computer program;

the processor is configured to implement any of the above method steps when executing the program stored in the memory.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform any of the method steps described above.

In yet another aspect of the present invention, the present invention also provides a computer program product containing instructions which, when executed on a computer, cause the computer to perform any of the method steps described above.

The method and the device for identifying the malicious website can acquire website information of the website to be identified, the website information comprises a website identification of the website to be identified, the website identification is input into a pre-trained identification model, the identification model is obtained by training according to initial data and over-sampling data, the initial data comprises a website identification of a preset comparison website, the comparison website comprises a preset malicious website and a preset non-malicious website, the over-sampling data is obtained by processing the initial data according to a preset over-sampling algorithm, and an identification result of the website to be identified is determined according to an output result of the identification model. Based on the processing, the training data can be balanced, the recognition precision of the recognition model is improved, and the accuracy of malicious website recognition is further improved.

Of course, it is not necessary for any product or method of practicing the invention to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a method for identifying a malicious website according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating an example of a method for identifying a malicious website according to an embodiment of the present invention;

fig. 3 is a structural diagram of an apparatus for identifying malicious websites according to an embodiment of the present invention;

fig. 4 is a structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the prior art, when a website to be recognized is recognized according to a trained recognition model, the set training data are unbalanced, so that the recognition accuracy of the recognition model is low, and the recognition accuracy of malicious websites is reduced.

In order to solve the above problem, embodiments of the present invention provide a method and an apparatus for identifying a malicious website, which may be applied to an electronic device, where the electronic device may be a terminal or a server. The electronic equipment can obtain website information of a website to be recognized, the website information comprises a website identification of the website to be recognized, then the electronic equipment inputs the website identification into a recognition model which is trained in advance, the recognition model is obtained by training according to initial data and oversampling data, the initial data comprises a website identification of a preset comparison website, the comparison website comprises a preset malicious website and a preset non-malicious website, the oversampling data is obtained by processing the initial data according to a preset oversampling algorithm, and a recognition result of the website to be recognized is determined according to an output result of the recognition model. Based on the processing, the training data can be balanced, the recognition precision of the recognition model is improved, and the accuracy of malicious website recognition is further improved.

Referring to fig. 1, fig. 1 is a flowchart of a method for identifying a malicious website according to an embodiment of the present invention, where the method may include the following steps:

s101: and acquiring the website information of the website to be identified.

The website information may include a website identifier of the website to be identified, and the website identifier may be a URL (Uniform Resource Locator) corresponding to the website to be identified.

In implementation, for a website to be identified, the electronic device may obtain a website identifier of the website to be identified.

S102: and inputting the website identification into a pre-trained recognition model.

The identification model may be trained according to initial data and oversampled data, and specifically, the identification model may be a DBN (Deep Belief Network) model or other probability generation models in the prior art, the initial data may include website identifiers of preset comparison websites, the comparison websites include preset malicious websites and preset non-malicious websites, which may be set by technicians based on experience, the oversampled data may be obtained by processing the initial data according to a preset oversampling algorithm, the preset oversampling algorithm may be a boundary-class data-based (boundary Synthetic minor Over-sampling Technique) algorithm or other oversampling algorithms in the prior art, the boundary-class data may be determined according to the initial data according to the boundary-class data-based on the boundary-class data, composite minority class data (i.e., oversampled data) is then generated from the boundary minority class data. Specifically, the Borderline-Smote algorithm can be realized by the following steps:

the initial data (i.e. the website id of the comparison website) can be represented by T, and accordingly, a few types of data (i.e. the website id of the malicious website) in the initial data can be represented by P, a majority type of data (i.e. the website id of the non-malicious website) in the initial data can be represented by N, and each type of data can be represented by formula (1):

T＝P+N,P＝{p₁,p₂,…,p_pnum},N＝{n₁,n₂,…,n_nnum} (1)

pnum represents the number of samples (which may be referred to as minority class samples) in the minority class data P, and nnum represents the number of samples (which may be referred to as majority class samples) in the majority class data N.

For each minority class sample P in the minority class data P_i(i ═ 1,2, …, pnum), the minority class sample p can be determined in the initial data T_iM neighbors of, the minority class samples p_iThe number of data N belonging to the majority class in the m neighbors of (a) can be represented by m '(0. ltoreq. m'. ltoreq.m).

If m ═ m, the minority class of samples p_iAll m neighbors belong to the majority class data N, the minority class sample p can be determined_iFor noisy samples, the minority class samples p are not needed_iAnd carrying out subsequent processing. If it is not

I.e. the minority sample p_iThe number of samples belonging to the majority class of data in the m neighbors of (1) is greater than the number of samples p in the minority class_iThe number of samples belonging to the minority class of data in the m neighbors of (1), the minority class of samples p_iTo be susceptible to misclassification, the minority class samples p may be classified_iAs a few classes of samples of the boundary. Then, according to the obtained few class samples of the boundary, a set DANGER can be obtained; if it is not

The minority class sample p is confirmed_iFor the safety sample, the minority class sample p is not needed_iAnd carrying out subsequent processing.

The number of boundary minority class samples in the set DANGER can be represented by dnum, and then equation (2) can be obtained:

DANGER＝{p′₁,p′₂,…,p′_dnum},0≤dnum≤pnum (2)

minority class samples p 'for each boundary in the set DANGER'_x(0 ≦ x ≦ dnum), the boundary minority class sample P 'may be determined in the minority class data P'_xK neighbors.

For each boundary minority class sample p'_xFrom the boundary a few class samples p'_xRandomly taking s samples belonging to minority class data P in the k neighbors of (1), and calculating the minority class sample P 'of each sample and the boundary'_xDistance diff of_j(j ═ 1,2, …, s). For each distance diff_jGenerating a corresponding random number r between 0 and 1_j(j ═ 1,2, …, s), then, s synthetic minority samples can be generated according to equation (3):

synthetic_j＝p′_x+r_j×diff_j,j＝1,2,…s,0≤x≤dnum (3)

wherein synthetic_jRepresents boundary minority class sample p'_xA few classes of samples, i.e., oversampled data.

The initial data T and the oversampled data synthesized may then be combined_jThe recognition model is trained as training data of the model.

In implementation, the electronic device may input a website identifier of a website to be recognized into a recognition model trained in advance, so as to recognize the website to be recognized.

S103: and determining the identification result of the website to be identified according to the output result of the identification model.

In implementation, the electronic device may determine the recognition result of the website to be recognized according to the output result of the recognition model, generally, the output result of the recognition model is the probability that the website to be recognized belongs to the malicious website, specifically, when the output result of the recognition model is greater than a third preset threshold, the electronic device may determine that the website to be recognized is the malicious website, and when the output result of the recognition model is less than or equal to the third preset threshold, the electronic device may determine that the website to be recognized is the non-malicious website.

Optionally, before the website to be recognized is recognized according to the recognition model, the website to be recognized may also be recognized directly according to the website identifier of the comparison website. Specifically, before inputting the website identifier into the pre-trained recognition model, the method may further include the following steps: judging whether websites with the same website identification as the website to be identified exist in the comparison websites; if the website which has the same website identification as the website to be identified exists in the comparison websites, determining the identification result of the website to be identified according to the website which has the same website identification as the website to be identified; if no website with the same website identification as the website to be identified exists in the comparison websites, step S102 is executed.

The electronic device may locally store a website blacklist for storing website identifications of malicious websites in comparison websites, and may locally store a website whitelist for storing website identifications of non-malicious websites in comparison websites.

In implementation, after acquiring the website identifier of the website to be identified, the electronic device may directly perform query in a local website blacklist and a local website whitelist to determine whether the website identifier of the website to be identified exists. When the electronic device queries the website identifier of the website to be identified in the website blacklist, the electronic device can determine that the website to be identified is a malicious website. When the electronic device queries the website identifier of the website to be identified in the website white list, the electronic device may determine that the website to be identified is a non-malicious website. Correspondingly, when the electronic device does not inquire the website identification of the website to be identified in the website blacklist and the website whitelist, the electronic device can input the website identification of the website to be identified to the pre-trained identification model so as to further identify the website to be identified.

As can be seen from the above, based on the method for identifying a malicious website in the embodiment of the present invention, the electronic device can directly identify the website to be identified according to the website identifier of the comparison website, and the efficiency of identifying the malicious website can be improved.

Optionally, the electronic device may further identify the website to be identified according to the similarity of the domain names of the websites. Specifically, before the step of inputting the website identifier into the pre-trained recognition model, the method may further include the following steps: acquiring a target digital signature corresponding to a domain name of a website to be identified; judging whether a digital signature with similarity greater than a first preset threshold value to a target digital signature exists in preset malicious digital signatures; if the digital signature with the similarity larger than a first preset threshold value with the target digital signature exists in the preset malicious digital signature, judging the website to be identified as a malicious website; if there is no digital signature with similarity greater than the first preset threshold with the target digital signature in the preset malicious digital signatures, step S102 is executed.

The digital signature (i.e., malicious digital signature) corresponding to the domain name of the malicious website may also be stored in the local website blacklist of the electronic device, and the first preset threshold may be set by a technician according to experience.

In implementation, the electronic device may obtain a digital signature (i.e., a target digital signature) corresponding to a domain name of a website to be identified. Specifically, for a domain name of a certain website, the electronic device may generate a short string set corresponding to the domain name of the website, and the electronic device may obtain the short string set corresponding to the domain name of the website according to a K-shifts (K-plate) method. For example, for the domain name "location. { "log", "ogi", "gin", "in.", "n.t", ". ta", "tao", "aob", "oba", "bao", "ao.", "o.c", ". co", "com" }. Wherein the value of K can be set empirically by a skilled person. Then, the electronic device may perform Hash processing on the short string set according to a Hash (Hash) function, and extract a minimum Hash to obtain a digital signature corresponding to the domain name. For each malicious digital signature in the website blacklist, the electronic device may calculate a similarity between the malicious digital signature and a target digital signature, and specifically, the electronic device may calculate a Jaccard (Jaccard) similarity between the malicious digital signature and the target digital signature. Then, the electronic device may determine whether there is a digital signature with similarity greater than a first preset threshold with the target digital signature in the preset malicious digital signature. When the electronic equipment judges that the digital signature with the similarity greater than the first preset threshold value exists in the preset malicious digital signature, the electronic equipment can judge that the website to be identified is a malicious website. When the electronic device judges that no digital signature with similarity greater than a first preset threshold value to a target digital signature exists in preset malicious digital signatures, the electronic device can input the website identification of the website to be identified to a pre-trained identification model so as to further identify the website to be identified.

As can be seen from the above, based on the method for identifying a malicious website in the embodiment of the present invention, the electronic device may further identify the website to be identified according to the similarity of the domain names of the websites, so that the accuracy of identification can be improved.

Optionally, the electronic device may further identify the website to be identified according to the similarity of the web page images. Specifically, before the step of inputting the website identifier into the pre-trained recognition model, the method may further include the following steps: acquiring a target image fingerprint corresponding to a webpage image of a website to be identified; judging whether an image fingerprint with similarity greater than a second preset threshold value with the target image fingerprint exists in the preset malicious image fingerprints; if image fingerprints with similarity greater than a second preset threshold value with the target image fingerprints exist in the preset malicious image fingerprints, judging the website to be identified as a malicious website; if there is no image fingerprint with the similarity of the target image fingerprint larger than the second preset threshold in the preset malicious image fingerprints, executing step S102.

The electronic device may further store an image fingerprint (i.e., a malicious image fingerprint) corresponding to the web page image of the malicious website in the local website blacklist, and the second preset threshold may be set by a technician according to experience.

In implementation, the electronic device may acquire an image fingerprint (i.e., a target image fingerprint) corresponding to a web page image of a website to be identified. Specifically, for a certain web page image, the electronic device may first obtain a screenshot of the web page image to obtain a screenshot image, and then, the electronic device may scale the screenshot image to a preset size, for example, the screenshot image with 8 pixels × 8 pixels. The electronic device may calculate an average gray value of a pixel point in the zoomed screenshot image, and for each pixel point, when the gray value of the pixel point is greater than or equal to the average gray value, the value of the pixel point may be recorded as 1, and when the gray value of the pixel point is smaller than the average gray value, the value of the pixel point may be recorded as 0. For the screenshot image, the electronic device may obtain an 0/1 string with 64 bits, i.e., the image fingerprint corresponding to the webpage image. The electronic equipment can judge whether an image fingerprint with similarity greater than a second preset threshold value to the target image fingerprint exists in the preset malicious image fingerprints. Specifically, the electronic device may calculate a hamming distance between the target image fingerprint and each malicious image fingerprint. Correspondingly, the hamming distance between the target image fingerprint and a certain malicious image fingerprint is smaller than a fourth preset threshold, which can indicate that the similarity between the target image fingerprint and the malicious image fingerprint is larger than a second preset threshold; the hamming distance between the target image fingerprint and the malicious image fingerprint is greater than a fourth preset threshold, which means that the similarity between the target image fingerprint and the malicious image fingerprint is less than a second preset threshold. When the electronic equipment judges that image fingerprints with similarity greater than a second preset threshold value with the target image fingerprints exist in the preset malicious image fingerprints, the electronic equipment can judge that the website to be identified is a malicious website. When the electronic equipment judges that no image fingerprint with the similarity to the target image fingerprint larger than a second preset threshold exists in the preset malicious image fingerprints, the electronic equipment can input the website identification of the website to be identified into a pre-trained identification model so as to further identify the website to be identified.

Optionally, the method may further display the identification result of the website to be identified. Specifically, the method may further include the steps of: and sending the identification result of the website to be identified to a preset terminal so that the terminal displays the identification result of the website to be identified.

Wherein the terminal comprises a display component.

In implementation, the electronic device may send the identification result of the website to be identified to a preset terminal, so that the terminal may display the identification result of the website to be identified through the display component.

Therefore, based on the malicious website identification method implemented by the invention, the electronic equipment can also send the identification result of the website to be identified to the preset terminal, so that the terminal can display the identification result of the website to be identified, and further the user experience can be improved.

Referring to fig. 2, fig. 2 is a flowchart of an example of a method for identifying a malicious website according to an embodiment of the present invention, where the method may include the following steps:

s201: and acquiring the website information of the website to be identified.

The website information may include a website identifier of the website to be identified.

S202: and judging whether a website with the same website identification as that of the website to be identified exists in the comparison websites, if so, executing S203, and if not, executing S204.

S203: and determining the identification result of the website to be identified according to the website with the same website identification as the website to be identified.

S204: judging whether a digital signature with the similarity to the target digital signature larger than a first preset threshold exists in the preset malicious digital signatures, if so, executing S205, and if not, executing S206.

The website information can also comprise a domain name of the website to be identified, and the target digital signature is a digital signature corresponding to the domain name of the website to be identified.

S205: and judging the website to be identified as a malicious website.

S206: and judging whether an image fingerprint with the similarity to the target image fingerprint larger than a second preset threshold exists in the preset malicious image fingerprints, if so, executing S205, and if not, executing S207.

The website information can also comprise a webpage image of the website to be identified, and the target image fingerprint is an image fingerprint corresponding to the webpage image of the website to be identified.

S207: and inputting the website identification of the website to be recognized into a recognition model which is trained in advance.

The identification model can be obtained by training according to initial data and oversampling data, the initial data can include a website identifier of a preset comparison website, the comparison website can include a preset malicious website and a preset non-malicious website, and the oversampling data is obtained by processing the initial data according to a preset oversampling algorithm.

S208: and determining the identification result of the website to be identified according to the output result of the identification model.

S209: and sending the identification result of the website to be identified to a preset terminal so that the terminal displays the identification result of the website to be identified.

As can be seen from the above, the method for identifying a malicious website according to the embodiment of the present invention can obtain website information of a website to be identified, where the website information includes a website identifier of the website to be identified, and inputs the website identifier into a pre-trained identification model, where the identification model is obtained by training according to initial data and over-sampling data, the initial data includes a website identifier of a preset comparison website, the comparison website includes a preset malicious website and a preset non-malicious website, the over-sampling data is obtained by processing the initial data according to a preset over-sampling algorithm, and an identification result of the website to be identified is determined according to an output result of the identification model. Based on the processing, the training data can be balanced, the recognition precision of the recognition model is improved, and the accuracy of malicious website recognition is further improved.

Corresponding to the embodiment of the method in fig. 1, referring to fig. 3, fig. 3 is a block diagram of an apparatus for identifying a malicious website according to an embodiment of the present invention, where the apparatus may include:

an obtaining module 301, configured to obtain website information of a website to be identified, where the website information includes a website identifier of the website to be identified;

the first processing module 302 is configured to input the website identifier into a pre-trained recognition model, where the recognition model is obtained by training according to initial data and over-sampling data, the initial data includes a website identifier of a preset comparison website, the comparison website includes a preset malicious website and a preset non-malicious website, and the over-sampling data is obtained by processing the initial data according to a preset over-sampling algorithm;

and the determining module 303 is configured to determine the recognition result of the website to be recognized according to the output result of the recognition model.

Optionally, the apparatus further comprises:

if there is no website with the same website identification as the website to be identified in the comparison websites, the first processing module 302 is triggered.

if there is no digital signature with similarity greater than a first preset threshold in the preset malicious digital signatures, the first processing module 302 is triggered.

if there is no image fingerprint in the preset malicious image fingerprints whose similarity of the target image fingerprint is greater than a second preset threshold, the first processing module 302 is triggered.

Optionally, the apparatus further comprises:

As can be seen from the above, the malicious website recognition device according to the embodiment of the present invention can obtain website information of a to-be-recognized website, where the website information includes a website identifier of the to-be-recognized website, and inputs the website identifier into a recognition model trained in advance, where the recognition model is obtained by training according to initial data and oversampled data, the initial data includes a website identifier of a preset comparison website, the comparison website includes a preset malicious website and a preset non-malicious website, the oversampled data is obtained by processing the initial data according to a preset oversampling algorithm, and a recognition result of the to-be-recognized website is determined according to an output result of the recognition model. Based on the processing, the training data can be balanced, the recognition precision of the recognition model is improved, and the accuracy of malicious website recognition is further improved.

The embodiment of the present invention further provides a terminal, as shown in fig. 4, including a processor 401, a communication interface 402, a memory 403, and a communication bus 404, where the processor 401, the communication interface 402, and the memory 403 complete mutual communication through the communication bus 404,

a memory 403 for storing a computer program;

the processor 401, when executing the program stored in the memory 403, implements the following steps:

The communication bus mentioned in the above embodiments may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry standard rd Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the terminal and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In another embodiment of the present invention, there is also provided a computer-readable storage medium, having stored therein instructions, which when run on a computer, cause the computer to execute the method for identifying a malicious website described in any one of the above embodiments.

In yet another embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the method for identifying malicious websites according to any one of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, the electronic device, the computer-readable storage medium, and the computer program product embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiments.

Claims

1. A method for identifying a malicious website, the method comprising:

determining the identification result of the website to be identified according to the output result of the identification model;

the website information further includes a domain name of the website to be recognized, and before the website identification is input into a recognition model trained in advance, the method further includes:

if no digital signature with the similarity greater than a first preset threshold value with the target digital signature exists in preset malicious digital signatures, the step of inputting the website identification into a pre-trained recognition model is executed;

the acquiring of the target digital signature corresponding to the domain name of the website to be identified includes:

generating a short character string set corresponding to the domain name of the website to be identified according to a K-plate shingles method;

carrying out hash processing on the short character string set according to a hash function;

taking the obtained minimum hash as a target digital signature corresponding to the domain name of the website to be identified;

the website information further comprises a webpage image of the website to be recognized, and before the website identification is input into a pre-trained recognition model, the method further comprises the following steps:

2. The method of claim 1, wherein prior to said inputting said website identification into a pre-trained recognition model, said method further comprises:

3. The method of claim 1, further comprising:

4. An apparatus for identifying malicious websites, the apparatus comprising:

the determining module is used for determining the identification result of the website to be identified according to the output result of the identification model;

the website information further includes a domain name of the website to be identified,

the second processing module is used for acquiring a target digital signature corresponding to the domain name of the website to be identified;

if no digital signature with the similarity greater than a first preset threshold value with the target digital signature exists in the preset malicious digital signatures, triggering the first processing module;

the second processing module is specifically configured to generate a short string set corresponding to the domain name of the website to be identified according to a K-plate shingles method;

the website information further includes a web page image of the website to be identified,

5. The apparatus according to claim 4, wherein the second processing module is further configured to determine whether there is a website with the same website identifier as the website to be identified in the comparison website;

6. The apparatus of claim 4, further comprising: