CN116015772A - Malicious website processing method, device, equipment and storage medium - Google Patents

Malicious website processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN116015772A
CN116015772A CN202211590350.9A CN202211590350A CN116015772A CN 116015772 A CN116015772 A CN 116015772A CN 202211590350 A CN202211590350 A CN 202211590350A CN 116015772 A CN116015772 A CN 116015772A
Authority
CN
China
Prior art keywords
webpage
screenshot
target
tag sequence
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211590350.9A
Other languages
Chinese (zh)
Inventor
王晓伟
马庆贺
高磊
杨真
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Secxun Technology Co ltd
Original Assignee
Shenzhen Secxun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Secxun Technology Co ltd filed Critical Shenzhen Secxun Technology Co ltd
Priority to CN202211590350.9A priority Critical patent/CN116015772A/en
Publication of CN116015772A publication Critical patent/CN116015772A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to the technical field of data security, and discloses a method, a device, equipment and a storage medium for processing a malicious website, wherein the method comprises the following steps: obtaining a webpage text, a webpage tag sequence and a webpage screenshot according to a malicious website to be processed; identifying the characteristic information through a target webpage image classification model; identifying the webpage text through the target text classification model, and identifying the webpage tag sequence through the target tag sequence classification model; determining the category of the malicious website to be processed according to the identification result and the category of the webpage screenshot; determining a target website processing strategy according to the category of the malicious website to be processed, and processing the malicious website to be processed according to the target website processing strategy; by the method, the malicious websites to be processed are processed according to the target website processing strategy determined by the category, so that the efficiency and the accuracy of processing the malicious websites can be effectively improved.

Description

Malicious website processing method, device, equipment and storage medium
Technical Field
The present invention relates to the field of data security technologies, and in particular, to a method, an apparatus, a device, and a storage medium for processing a malicious website.
Background
The internet brings convenience to people and also brings harm, such as fraud, and the following internet technology is changed continuously, so that old users who touch the network more and more or teenagers who use the network for a long time are subjected to property loss and mental loss due to fraud, one of the ways of fraud is to implement fraud on users through malicious websites, such as loans, bill swiping, pig killing discs, fraud law fraud and the like, only the malicious websites are recognized to be far insufficient, how to treat the malicious websites is important, and at present, the common related technology is a firewall, particularly the firewall intercepts the malicious websites, access to the malicious websites is forbidden, but manufacturers of the malicious websites can manufacture novel malicious websites according to the working principle of the firewall, so that the resistance of the firewall is rapidly reduced, and the efficiency and the accuracy of processing the malicious websites are lower.
The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.
Disclosure of Invention
The invention mainly aims to provide a malicious website processing method, device, equipment and storage medium, and aims to solve the technical problems that the efficiency and accuracy of processing malicious websites are low in the prior art.
In order to achieve the above object, the present invention provides a method for processing a malicious website, where the method for processing a malicious website includes the following steps:
obtaining a webpage text, a webpage tag sequence and a webpage screenshot according to a malicious website to be processed;
extracting characteristic information of the webpage screenshot, and identifying the characteristic information through a target webpage image classification model to obtain the class of the webpage screenshot;
when the content of the webpage text is less than a preset text content threshold and the content of the webpage tag sequence is less than a preset tag sequence threshold, identifying the webpage text through a target text classification model, and identifying the webpage tag sequence through a target tag sequence classification model;
determining the category of the malicious website to be processed according to the identification result and the category of the webpage screenshot;
determining a target website processing strategy according to the category of the malicious website to be processed, and processing the malicious website to be processed according to the target website processing strategy.
Optionally, the obtaining the webpage text, the webpage tag sequence and the webpage screenshot according to the malicious website to be processed includes:
acquiring a malicious website to be processed, and accessing the malicious website to be processed on a virtual machine through target HTTP get naming to obtain malicious website source codes and malicious website contents;
analyzing the malicious website source code to obtain webpage text and webpage label data;
obtaining a corresponding webpage label sequence according to the webpage label data;
and capturing the malicious website content through a target operation browser to obtain a webpage screenshot.
Optionally, the extracting feature information of the webpage screenshot, identifying the feature information through a target webpage image classification model, and obtaining the class of the webpage screenshot includes:
detecting the webpage screenshot to obtain a webpage screenshot shape;
adjusting the screenshot shape of the webpage according to a preset fixed image shape;
obtaining a corresponding screenshot pixel value according to the webpage screenshot after the shape adjustment;
when a screenshot pixel value is located in a preset pixel interval, carrying out mean value calculation on the screenshot pixel value to obtain a current screenshot pixel mean value, and carrying out variance calculation on the screenshot pixel value to obtain a current screenshot pixel variance;
and when the current screenshot pixel mean value is a preset mean value threshold value and the current screenshot pixel variance is a preset variance threshold value, extracting characteristic information of the webpage screenshot, and identifying the characteristic information through a target webpage image classification model to obtain the class of the webpage screenshot.
Optionally, when the current screenshot pixel mean value is a preset mean value threshold and the current screenshot pixel variance is a preset variance threshold, extracting feature information of the webpage screenshot, and identifying the feature information through a target webpage image classification model to obtain a class of the webpage screenshot, including:
when the current screenshot pixel mean value is a preset mean value threshold value and the current screenshot pixel variance is a preset variance threshold value, extracting features of the webpage screenshot through a RestNet network to obtain feature information of each scale;
fusing the characteristic information of each scale through a MaxPooling network layer to obtain multi-scale characteristic information;
identifying the multi-scale characteristic information through a full-connection layer of a target webpage image classification model to obtain probability values of various categories to which the webpage screenshot belongs;
and extracting the maximum probability value in the probability values of the various categories, and taking the category corresponding to the maximum probability value as the category of the webpage screenshot.
Optionally, when the content of the web page text is less than a preset text content threshold and the content of the web page tag sequence is less than a preset tag sequence threshold, identifying the web page text through a target text classification model, and identifying the web page tag sequence through a target tag sequence classification model includes:
detecting the webpage text, and obtaining corresponding text content according to a webpage text detection result;
detecting the webpage tag sequence, and obtaining corresponding tag sequence content according to a tag sequence detection result;
when the content of the webpage text is less than a preset text content threshold value, judging whether the content of the webpage tag sequence is less than a preset tag sequence threshold value or not;
and when the content of the webpage tag sequence is less than a preset tag sequence threshold value, identifying the webpage text through a target text classification model, and identifying the webpage tag sequence through the target tag sequence classification model.
Optionally, when the content of the web page tag sequence is less than a preset tag sequence threshold, identifying the web page text through a target text classification model, and identifying the web page tag sequence through a target tag sequence classification model includes:
when the content of the webpage tag sequence is less than a preset tag sequence threshold value, performing word meaning analysis on the webpage text to obtain each vocabulary;
counting the occurrence frequency of each vocabulary, and screening the vocabulary with the frequency larger than a preset frequency threshold value from each vocabulary;
constructing a corresponding vocabulary according to the frequency obtained by screening, and constructing a word embedding matrix according to the vocabulary;
obtaining a word vector list according to the webpage tag sequence and the word embedding matrix;
inquiring a word vector corresponding to the webpage text according to the word embedding matrix, and identifying the word vector through a target text classification model;
the word vectors and the word vector list are converged through a global pooling layer and a weight connection layer to obtain target word vector characteristics;
and identifying the target word vector features through a target tag sequence classification model.
Optionally, the determining a target website processing policy according to the category of the malicious website to be processed, and processing the malicious website to be processed according to the target website processing policy includes:
selecting a target website processing strategy from a target malicious website processing strategy set according to the category of the malicious website to be processed;
intercepting the malicious website to be processed, and acquiring a uniform resource locator of the malicious website to be processed;
obtaining a corresponding uniform resource locator segment according to the domain name information of the uniform resource locator;
inserting a barrier character at a preset position of the uniform resource locator segment according to a target website processing strategy, and calculating a hash value of the uniform resource locator segment;
and storing the hash value of the uniform resource locator segment into a malicious website block chain.
In addition, in order to achieve the above object, the present invention further provides a processing device for a malicious website, where the processing device for a malicious website includes:
the acquisition module is used for acquiring a webpage text, a webpage tag sequence and a webpage screenshot according to the malicious website to be processed;
the extraction module is used for extracting the characteristic information of the webpage screenshot, and identifying the characteristic information through a target webpage image classification model to obtain the class of the webpage screenshot;
the identification module is used for identifying the webpage text through the target text classification model and identifying the webpage tag sequence through the target tag sequence classification model when the content of the webpage text is less than a preset text content threshold and the content of the webpage tag sequence is less than a preset tag sequence threshold;
the determining module is used for determining the category of the malicious website to be processed according to the identification result and the category of the webpage screenshot;
and the processing module is used for determining a target website processing strategy according to the category of the malicious website to be processed and processing the malicious website to be processed according to the target website processing strategy.
In addition, in order to achieve the above object, the present invention further provides a malicious website processing device, where the malicious website processing device includes: the system comprises a memory, a processor and a malicious website processing program stored on the memory and capable of running on the processor, wherein the malicious website processing program is configured to realize the malicious website processing method.
In addition, in order to achieve the above object, the present invention further provides a storage medium, where a processing program of a malicious website is stored, where the processing program of the malicious website is executed by a processor to implement the processing method of the malicious website as described above.
According to the malicious website processing method, a webpage text, a webpage tag sequence and a webpage screenshot are obtained according to the malicious website to be processed; extracting characteristic information of the webpage screenshot, and identifying the characteristic information through a target webpage image classification model to obtain the class of the webpage screenshot; when the content of the webpage text is less than a preset text content threshold and the content of the webpage tag sequence is less than a preset tag sequence threshold, identifying the webpage text through a target text classification model, and identifying the webpage tag sequence through a target tag sequence classification model; determining the category of the malicious website to be processed according to the identification result and the category of the webpage screenshot; determining a target website processing strategy according to the category of the malicious website to be processed, and processing the malicious website to be processed according to the target website processing strategy; by the method, the malicious websites to be processed are processed according to the target website processing strategy determined by the category, so that the efficiency and the accuracy of processing the malicious websites can be effectively improved.
Drawings
FIG. 1 is a schematic structural diagram of a processing device for malicious websites of a hardware running environment according to an embodiment of the present invention;
FIG. 2 is a flowchart of a first embodiment of a method for processing a malicious website according to the present invention;
FIG. 3 is a flowchart illustrating a second embodiment of a method for processing a malicious website according to the present invention;
fig. 4 is a schematic functional module diagram of a first embodiment of a malicious website processing apparatus according to the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic diagram of a processing device structure of a malicious website of a hardware running environment according to an embodiment of the present invention.
As shown in fig. 1, the processing device of the malicious website may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) Memory or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
Those skilled in the art will appreciate that the architecture shown in fig. 1 does not constitute a limitation of the processing device of malicious web sites, and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a storage medium, may include an operating system, a network communication module, a user interface module, and a processing program of a malicious web site.
In the processing device of the malicious website shown in fig. 1, the network interface 1004 is mainly used for performing data communication with a workstation of the network integration platform; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the malicious website processing device of the present invention may be disposed in the malicious website processing device, where the malicious website processing device invokes, through the processor 1001, a processing program of a malicious website stored in the memory 1005, and executes a processing method of a malicious website provided by the embodiment of the present invention.
Based on the hardware structure, the embodiment of the method for processing the malicious website is provided.
Referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of a method for processing a malicious website according to the present invention.
In a first embodiment, the method for processing a malicious website includes the following steps:
and step S10, obtaining a webpage text, a webpage label sequence and a webpage screenshot according to the malicious website to be processed.
It should be noted that, the execution body of the embodiment is a processing device of a malicious website, and may be other devices that can implement the same or similar functions, such as a website processor, which is not limited in this embodiment, and in this embodiment, the description is given by taking the website processor as an example.
It should be understood that web page text refers to the text content of a web page generated by accessing a malicious web site to be processed on a virtual machine, a web page tag sequence refers to a tag sequence of the generated web page, the web page tag may be a web page HTML tag, and web page screenshots refer to screenshots of web page content including, but not limited to, web page text and web page pictures.
Further, step S10 includes: acquiring a malicious website to be processed, and accessing the malicious website to be processed on a virtual machine through target HTTP get naming to obtain malicious website source codes and malicious website contents; analyzing the malicious website source code to obtain webpage text and webpage label data; obtaining a corresponding webpage label sequence according to the webpage label data; and capturing the malicious website content through a target operation browser to obtain a webpage screenshot.
It can be understood that, after obtaining a malicious website to be processed, in order to avoid attack and intrusion of the malicious website to the device, in this embodiment, the virtual machine accesses the malicious website to be processed through naming a target HTTP get to obtain a malicious website source code and malicious website content, the malicious website source code refers to a source code of a webpage corresponding to the malicious website to be processed, the webpage tag data refers to tag data of the webpage corresponding to the malicious website to be processed, the webpage tag data are located at two ends of the source code, then a corresponding webpage tag sequence is obtained according to the webpage tag data, and then screenshot is performed on the malicious website content through a target operation browser to obtain webpage screenshot, where the target operation browser can be a Selenium operation browser.
And step S20, extracting the characteristic information of the webpage screenshot, and identifying the characteristic information through a target webpage image classification model to obtain the class of the webpage screenshot.
It can be understood that the feature information refers to information capable of uniquely identifying a webpage screenshot, the feature information can be a webpage screenshot identification field, the target webpage image classification model refers to a model for classifying a webpage image, the target webpage image classification model is obtained by performing transfer learning to fine tuning of a webpage screenshot data set by adopting a model pre-trained by an ImageNet data set, and compared with a general image classification model, the depth of the target webpage image classification model is increased and internal residual blocks are connected by using a jump mode, so that the trouble of gradient disappearance caused by the increase of the depth can be relieved.
Further, step S20 includes: detecting the webpage screenshot to obtain a webpage screenshot shape; adjusting the screenshot shape of the webpage according to a preset fixed image shape; obtaining a corresponding screenshot pixel value according to the webpage screenshot after the shape adjustment; when a screenshot pixel value is located in a preset pixel interval, carrying out mean value calculation on the screenshot pixel value to obtain a current screenshot pixel mean value, and carrying out variance calculation on the screenshot pixel value to obtain a current screenshot pixel variance; and when the current screenshot pixel mean value is a preset mean value threshold value and the current screenshot pixel variance is a preset variance threshold value, extracting characteristic information of the webpage screenshot, and identifying the characteristic information through a target webpage image classification model to obtain the class of the webpage screenshot.
It should be understood that after the shape of the webpage screenshot is obtained, the shape of the webpage screenshot needs to be adjusted to a preset fixed image shape, then whether the screenshot pixel value of the webpage screenshot after the shape adjustment is located in a preset pixel interval is judged, if yes, the screenshot pixel value needs to be reduced to the preset pixel interval according to a proportion, the preset pixel interval is [0,1], then the current screenshot pixel mean value and the current screenshot pixel variance of the screenshot pixel value are calculated respectively, then whether the condition that the current screenshot pixel mean value is a preset mean value threshold value and the current screenshot pixel variance is a preset variance threshold value is met is judged, if no, the current screenshot pixel mean value needs to be normalized to the preset mean value threshold value, the current screenshot pixel variance needs to be normalized to the preset variance threshold value, the preset mean value threshold value is 0, the preset variance threshold value is 1, and then the classification of the webpage screenshot is identified through the target webpage image classification model.
Further, when the current screenshot pixel mean value is a preset mean value threshold value and the current screenshot pixel variance is a preset variance threshold value, extracting feature information of the webpage screenshot, and identifying the feature information through a target webpage image classification model to obtain a class of the webpage screenshot, including: when the current screenshot pixel mean value is a preset mean value threshold value and the current screenshot pixel variance is a preset variance threshold value, extracting features of the webpage screenshot through a RestNet network to obtain feature information of each scale; fusing the characteristic information of each scale through a MaxPooling network layer to obtain multi-scale characteristic information; identifying the multi-scale characteristic information through a full-connection layer of a target webpage image classification model to obtain probability values of various categories to which the webpage screenshot belongs; and extracting the maximum probability value in the probability values of the various categories, and taking the category corresponding to the maximum probability value as the category of the webpage screenshot.
It will be appreciated that after obtaining the web page screenshot satisfying the condition, feature extraction is performed through a RestNet network, where the RestNet network includes network layers with different scales, referring to fig. 3, where the RestNet network includes, but is not limited to (7×7conv,64,/2), (3×3conv, 64), (3×3conv,128,/2), (3×3conv, 128), (3×3conv,256,/2), (3×3conv, 512), and feature extraction is performed through a RestNet network, where feature information with each scale is then fused into multi-scale feature information through a MaxPooling network layer, then the multi-scale feature information is identified through a full-connection layer of a target web page image classification model, and a probability value of each class to which the web page screenshot belongs is output, and then a class corresponding to the probability value of each class is regarded as the class of the web page screenshot, for example, the probability value of class 1 is 60%, the probability value of class 2 is regarded as the class of the web page screenshot, and the probability value of class 3 is regarded as class 95.
Step S30, when the content of the webpage text is less than a preset text content threshold and the content of the webpage tag sequence is less than a preset tag sequence threshold, identifying the webpage text through a target text classification model, and identifying the webpage tag sequence through a target tag sequence classification model.
It should be understood that after the web page text is obtained and after the web page tag sequence is obtained, whether the content of the web page text is less than the preset text content threshold and the content of the web page tag sequence is less than the preset tag sequence threshold needs to be judged, if yes, the content of the web page text and the content of the web page tag sequence are indicated to be too small, at this time, the web page text is identified through a target text classification model, the web page tag sequence is identified through a target tag sequence classification model, both the target text classification model and the target tag sequence classification model are trained through a TextCNN deep learning algorithm, and the target text classification model is trained to adopt text types such as pornography, lottery, loan, brush bill, ETC fraud, counterfeit public inspection and normal legitimacy.
And step S40, determining the category of the malicious website to be processed according to the identification result and the category of the webpage screenshot.
It can be understood that after the recognition results of the webpage text and the webpage tag sequence are obtained, the category of the malicious website to be processed is comprehensively considered and determined by combining the category of the webpage screenshot.
And S50, determining a target website processing strategy according to the category of the malicious website to be processed, and processing the malicious website to be processed according to the target website processing strategy.
It should be understood that the target website processing policy refers to a policy for processing malicious websites, and because the processing policies of malicious websites of different types are different, when the type of the malicious website to be processed is obtained, the most appropriate website processing policy is determined according to the type of the malicious website to be processed, and then the malicious website to be processed is processed through the target website processing policy.
Further, step S50 includes: selecting a target website processing strategy from a target malicious website processing strategy set according to the category of the malicious website to be processed; intercepting the malicious website to be processed, and acquiring a uniform resource locator of the malicious website to be processed; obtaining a corresponding uniform resource locator segment according to the domain name information of the uniform resource locator; inserting a barrier character at a preset position of the uniform resource locator segment according to a target website processing strategy, and calculating a hash value of the uniform resource locator segment; and storing the hash value of the uniform resource locator segment into a malicious website block chain.
It can be understood that after a target website processing policy most suitable for the category of the malicious website to be processed is selected, then the malicious website to be processed is intercepted, i.e. the malicious website to be processed is not accessed, then a corresponding uniform resource locator segment is obtained according to the domain name information of the uniform resource locator of the malicious website to be processed, then a blocking character is inserted into a preset position of the uniform resource locator segment, so that the whole malicious website to be processed is in an invalid state, then the hash value of the uniform resource locator segment is stored into a malicious website block chain, and when other users meet the malicious website to be processed, a malicious label is automatically popped up, so that the equipment of the other users is prevented from being damaged by the malicious website to be processed.
According to the embodiment, a webpage text, a webpage tag sequence and a webpage screenshot are obtained according to a malicious website to be processed; extracting characteristic information of the webpage screenshot, and identifying the characteristic information through a target webpage image classification model to obtain the class of the webpage screenshot; when the content of the webpage text is less than a preset text content threshold and the content of the webpage tag sequence is less than a preset tag sequence threshold, identifying the webpage text through a target text classification model, and identifying the webpage tag sequence through a target tag sequence classification model; determining the category of the malicious website to be processed according to the identification result and the category of the webpage screenshot; determining a target website processing strategy according to the category of the malicious website to be processed, and processing the malicious website to be processed according to the target website processing strategy; by the method, the malicious websites to be processed are processed according to the target website processing strategy determined by the category, so that the efficiency and the accuracy of processing the malicious websites can be effectively improved.
In an embodiment, as shown in fig. 3, a second embodiment of the method for processing a malicious website according to the present invention is provided based on the first embodiment, where the step S30 includes:
step S301, detecting the web page text, and obtaining corresponding text content according to the web page text detection result.
It should be understood that text content refers to content of web page text, including but not limited to web page text and web page pictures, and specifically, after obtaining web page text, the web page text is detected to obtain corresponding text content.
Step S302, detecting the webpage tag sequence, and obtaining corresponding tag sequence content according to a tag sequence detection result.
It can be understood that the tag sequence content refers to the content of the web page tag sequence, specifically, after the web page tag sequence is obtained, the web page tag sequence is detected to obtain the corresponding tag sequence content.
Step S303, when the content of the web page text is less than a preset text content threshold, determining whether the content of the web page tag sequence is less than a preset tag sequence threshold.
It should be understood that after obtaining the content of the web page text, it needs to be determined whether the content of the web page text is less than the preset text content threshold, and if so, it needs to continuously determine whether the content of the web page tag sequence is less than the preset tag sequence threshold.
And step S304, when the content of the webpage label sequence is less than a preset label sequence threshold value, identifying the webpage text through a target text classification model, and identifying the webpage label sequence through the target label sequence classification model.
It can be understood that when the content of the web page tag sequence is determined to be less than the preset tag sequence threshold, the web page tag sequence and the web page text are indicated to be too less, at this time, the web page text is identified by the target text classification model, and the web page tag sequence is identified by the target tag sequence classification model.
Further, step S304 includes: when the content of the webpage tag sequence is less than a preset tag sequence threshold value, performing word meaning analysis on the webpage text to obtain each vocabulary; counting the occurrence frequency of each vocabulary, and screening the vocabulary with the frequency larger than a preset frequency threshold value from each vocabulary; constructing a corresponding vocabulary according to the frequency obtained by screening, and constructing a word embedding matrix according to the vocabulary; obtaining a word vector list according to the webpage tag sequence and the word embedding matrix; inquiring a word vector corresponding to the webpage text according to the word embedding matrix, and identifying the word vector through a target text classification model; the word vectors and the word vector list are converged through a global pooling layer and a weight connection layer to obtain target word vector characteristics; and identifying the target word vector features through a target tag sequence classification model.
It should be understood that when it is determined that the content of the web page tag sequence is less than the preset tag sequence threshold, dividing the web page text into word sizes, obtaining each word according to the word sizes, counting the occurrence frequency of each word, judging whether the counted frequency is greater than the preset frequency threshold, if so, constructing a corresponding vocabulary table with the words corresponding to the frequency, constructing a word embedding matrix according to the vocabulary table, inquiring a word vector corresponding to the word through any word by the word embedding matrix, characterizing the feature of each dimension of the word through the word vector, identifying the word vector through a target text classification model, converging the word vector and the word vector list through a global pooling layer and a weight connection layer, and identifying the feature of the converged target word vector through a target tag sequence classification model.
According to the embodiment, the webpage text is detected, and corresponding text content is obtained according to a webpage text detection result; detecting the webpage tag sequence, and obtaining corresponding tag sequence content according to a tag sequence detection result; when the content of the webpage text is less than a preset text content threshold value, judging whether the content of the webpage tag sequence is less than a preset tag sequence threshold value or not; when the content of the webpage tag sequence is less than a preset tag sequence threshold value, identifying the webpage text through a target text classification model, and identifying the webpage tag sequence through a target tag sequence classification model; through the method, the webpage text and the webpage tag sequence are detected respectively, whether the condition that the content of the webpage text is less than the preset text content threshold and the content of the webpage tag sequence is less than the preset tag sequence threshold is judged, if yes, the webpage text is identified through the target text classification model, and the webpage tag sequence is identified through the target tag sequence classification model, so that the accuracy of identifying the webpage text and the webpage tag sequence can be improved effectively.
In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium stores a processing program of the malicious website, and the processing program of the malicious website realizes the steps of the processing method of the malicious website when being executed by a processor.
Because the storage medium adopts all the technical schemes of all the embodiments, the storage medium has at least all the beneficial effects brought by the technical schemes of the embodiments, and the description is omitted here.
In addition, referring to fig. 4, an embodiment of the present invention further provides a processing apparatus for a malicious website, where the processing apparatus for a malicious website includes:
the obtaining module 10 is configured to obtain a webpage text, a webpage tag sequence and a webpage screenshot according to a malicious website to be processed.
And the extracting module 20 is used for extracting the characteristic information of the webpage screenshot, and identifying the characteristic information through a target webpage image classification model to obtain the class of the webpage screenshot.
The identifying module 30 is configured to identify the web page text through a target text classification model and identify the web page tag sequence through a target tag sequence classification model when the content of the web page text is less than a preset text content threshold and the content of the web page tag sequence is less than a preset tag sequence threshold.
And the determining module 40 is configured to determine a category of the malicious website to be processed according to the identification result and the category of the webpage screenshot.
The processing module 50 is configured to determine a target website processing policy according to the category of the malicious website to be processed, and process the malicious website to be processed according to the target website processing policy.
According to the embodiment, a webpage text, a webpage tag sequence and a webpage screenshot are obtained according to a malicious website to be processed; extracting characteristic information of the webpage screenshot, and identifying the characteristic information through a target webpage image classification model to obtain the class of the webpage screenshot; when the content of the webpage text is less than a preset text content threshold and the content of the webpage tag sequence is less than a preset tag sequence threshold, identifying the webpage text through a target text classification model, and identifying the webpage tag sequence through a target tag sequence classification model; determining the category of the malicious website to be processed according to the identification result and the category of the webpage screenshot; determining a target website processing strategy according to the category of the malicious website to be processed, and processing the malicious website to be processed according to the target website processing strategy; by the method, the malicious websites to be processed are processed according to the target website processing strategy determined by the category, so that the efficiency and the accuracy of processing the malicious websites can be effectively improved.
It should be noted that the above-described working procedure is merely illustrative, and does not limit the scope of the present invention, and in practical application, a person skilled in the art may select part or all of them according to actual needs to achieve the purpose of the embodiment, which is not limited herein.
In addition, technical details not described in detail in this embodiment may refer to the method for processing a malicious website provided in any embodiment of the present invention, which is not described herein again.
Other embodiments of the malicious website processing apparatus or the implementation method thereof according to the present invention may refer to the above method embodiments, and are not repeated herein.
Furthermore, it should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. Read Only Memory)/RAM, magnetic disk, optical disk) and including several instructions for causing a terminal device (which may be a mobile phone, a computer, an integrated platform workstation, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (10)

1. The method for processing the malicious website is characterized by comprising the following steps of:
obtaining a webpage text, a webpage tag sequence and a webpage screenshot according to a malicious website to be processed;
extracting characteristic information of the webpage screenshot, and identifying the characteristic information through a target webpage image classification model to obtain the class of the webpage screenshot;
when the content of the webpage text is less than a preset text content threshold and the content of the webpage tag sequence is less than a preset tag sequence threshold, identifying the webpage text through a target text classification model, and identifying the webpage tag sequence through a target tag sequence classification model;
determining the category of the malicious website to be processed according to the identification result and the category of the webpage screenshot;
determining a target website processing strategy according to the category of the malicious website to be processed, and processing the malicious website to be processed according to the target website processing strategy.
2. The method for processing a malicious website as set forth in claim 1, wherein the obtaining a web page text, a web page tag sequence and a web page screenshot according to the malicious website to be processed includes:
acquiring a malicious website to be processed, and accessing the malicious website to be processed on a virtual machine through target HTTP get naming to obtain malicious website source codes and malicious website contents;
analyzing the malicious website source code to obtain webpage text and webpage label data;
obtaining a corresponding webpage label sequence according to the webpage label data;
and capturing the malicious website content through a target operation browser to obtain a webpage screenshot.
3. The method for processing a malicious web site according to claim 1, wherein the extracting the feature information of the web page screenshot, and identifying the feature information through a target web page image classification model, to obtain the category of the web page screenshot comprises:
detecting the webpage screenshot to obtain a webpage screenshot shape;
adjusting the screenshot shape of the webpage according to a preset fixed image shape;
obtaining a corresponding screenshot pixel value according to the webpage screenshot after the shape adjustment;
when a screenshot pixel value is located in a preset pixel interval, carrying out mean value calculation on the screenshot pixel value to obtain a current screenshot pixel mean value, and carrying out variance calculation on the screenshot pixel value to obtain a current screenshot pixel variance;
and when the current screenshot pixel mean value is a preset mean value threshold value and the current screenshot pixel variance is a preset variance threshold value, extracting characteristic information of the webpage screenshot, and identifying the characteristic information through a target webpage image classification model to obtain the class of the webpage screenshot.
4. The method for processing a malicious web site according to claim 3, wherein when the current screenshot pixel mean is a preset mean threshold and the current screenshot pixel variance is a preset variance threshold, extracting feature information of the webpage screenshot, and identifying the feature information through a target webpage image classification model to obtain a class of the webpage screenshot, includes:
when the current screenshot pixel mean value is a preset mean value threshold value and the current screenshot pixel variance is a preset variance threshold value, extracting features of the webpage screenshot through a RestNet network to obtain feature information of each scale;
fusing the characteristic information of each scale through a MaxPooling network layer to obtain multi-scale characteristic information;
identifying the multi-scale characteristic information through a full-connection layer of a target webpage image classification model to obtain probability values of various categories to which the webpage screenshot belongs;
and extracting the maximum probability value in the probability values of the various categories, and taking the category corresponding to the maximum probability value as the category of the webpage screenshot.
5. The method for processing a malicious web site according to claim 1, wherein when the content of the web text is less than a preset text content threshold and the content of the web tag sequence is less than a preset tag sequence threshold, identifying the web text by a target text classification model, and identifying the web tag sequence by a target tag sequence classification model, comprises:
detecting the webpage text, and obtaining corresponding text content according to a webpage text detection result;
detecting the webpage tag sequence, and obtaining corresponding tag sequence content according to a tag sequence detection result;
when the content of the webpage text is less than a preset text content threshold value, judging whether the content of the webpage tag sequence is less than a preset tag sequence threshold value or not;
and when the content of the webpage tag sequence is less than a preset tag sequence threshold value, identifying the webpage text through a target text classification model, and identifying the webpage tag sequence through the target tag sequence classification model.
6. The method for processing a malicious web site according to claim 5, wherein when the content of the web tag sequence is less than a preset tag sequence threshold, identifying the web page text by the target text classification model, and identifying the web page tag sequence by the target tag sequence classification model, comprises:
when the content of the webpage tag sequence is less than a preset tag sequence threshold value, performing word meaning analysis on the webpage text to obtain each vocabulary;
counting the occurrence frequency of each vocabulary, and screening the vocabulary with the frequency larger than a preset frequency threshold value from each vocabulary;
constructing a corresponding vocabulary according to the frequency obtained by screening, and constructing a word embedding matrix according to the vocabulary;
obtaining a word vector list according to the webpage tag sequence and the word embedding matrix;
inquiring a word vector corresponding to the webpage text according to the word embedding matrix, and identifying the word vector through a target text classification model;
the word vectors and the word vector list are converged through a global pooling layer and a weight connection layer to obtain target word vector characteristics;
and identifying the target word vector features through a target tag sequence classification model.
7. The method for processing a malicious website according to any one of claims 1 to 6, wherein determining a target website processing policy according to the category of the malicious website to be processed, and processing the malicious website to be processed according to the target website processing policy, includes:
selecting a target website processing strategy from a target malicious website processing strategy set according to the category of the malicious website to be processed;
intercepting the malicious website to be processed, and acquiring a uniform resource locator of the malicious website to be processed;
obtaining a corresponding uniform resource locator segment according to the domain name information of the uniform resource locator;
inserting a barrier character at a preset position of the uniform resource locator segment according to a target website processing strategy, and calculating a hash value of the uniform resource locator segment;
and storing the hash value of the uniform resource locator segment into a malicious website block chain.
8. The malicious website processing device is characterized by comprising:
the acquisition module is used for acquiring a webpage text, a webpage tag sequence and a webpage screenshot according to the malicious website to be processed;
the extraction module is used for extracting the characteristic information of the webpage screenshot, and identifying the characteristic information through a target webpage image classification model to obtain the class of the webpage screenshot;
the identification module is used for identifying the webpage text through the target text classification model and identifying the webpage tag sequence through the target tag sequence classification model when the content of the webpage text is less than a preset text content threshold and the content of the webpage tag sequence is less than a preset tag sequence threshold;
the determining module is used for determining the category of the malicious website to be processed according to the identification result and the category of the webpage screenshot;
and the processing module is used for determining a target website processing strategy according to the category of the malicious website to be processed and processing the malicious website to be processed according to the target website processing strategy.
9. A malicious web site processing apparatus, wherein the malicious web site processing apparatus includes: memory, processor and stored on said memory and executable on said processor a malicious web site handling program configured with a method for handling a malicious web site according to any one of claims 1 to 7.
10. A storage medium, wherein a processing program of a malicious web site is stored on the storage medium, and when the processing program of the malicious web site is executed by a processor, the processing method of the malicious web site according to any one of claims 1 to 7 is implemented.
CN202211590350.9A 2022-12-12 2022-12-12 Malicious website processing method, device, equipment and storage medium Pending CN116015772A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211590350.9A CN116015772A (en) 2022-12-12 2022-12-12 Malicious website processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211590350.9A CN116015772A (en) 2022-12-12 2022-12-12 Malicious website processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116015772A true CN116015772A (en) 2023-04-25

Family

ID=86018315

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211590350.9A Pending CN116015772A (en) 2022-12-12 2022-12-12 Malicious website processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116015772A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116861412A (en) * 2023-06-26 2023-10-10 深圳市赛凌伟业科技有限公司 Information security analysis method and system based on big data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116861412A (en) * 2023-06-26 2023-10-10 深圳市赛凌伟业科技有限公司 Information security analysis method and system based on big data

Similar Documents

Publication Publication Date Title
Khan et al. Defending malicious script attacks using machine learning classifiers
CN110413908B (en) Method and device for classifying uniform resource locators based on website content
CN104077396B (en) Method and device for detecting phishing website
US8438386B2 (en) System and method for developing a risk profile for an internet service
US8806622B2 (en) Fraudulent page detection
CN109274632B (en) Website identification method and device
US20080162449A1 (en) Dynamic page similarity measurement
Mishra et al. SMS phishing and mitigation approaches
CN109922065B (en) Quick identification method for malicious website
CN112685739B (en) Malicious code detection method, data interaction method and related equipment
CN108134784A (en) web page classification method and device, storage medium and electronic equipment
US20230040895A1 (en) System and method for developing a risk profile for an internet service
CN111401416A (en) Abnormal website identification method and device and abnormal countermeasure identification method
US8484742B2 (en) Rendered image collection of potentially malicious web pages
CN109831459B (en) Method, device, storage medium and terminal equipment for secure access
CN104143008A (en) Method and device for detecting phishing webpage based on picture matching
CN110474889A (en) One kind being based on the recognition methods of web graph target fishing website and device
CN104239582A (en) Method and device for identifying phishing webpage based on feature vector model
CN107896225A (en) Fishing website decision method, server and storage medium
CN116015772A (en) Malicious website processing method, device, equipment and storage medium
CN113949526A (en) Access control method and device, storage medium and electronic equipment
CN108270754B (en) Detection method and device for phishing website
CN111967503A (en) Method for constructing multi-type abnormal webpage classification model and abnormal webpage detection method
CN114124448B (en) Cross-site script attack recognition method based on machine learning
CN115001763B (en) Phishing website attack detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination