CN117614644A - Malicious website identification method, electronic equipment and storage medium - Google Patents

Malicious website identification method, electronic equipment and storage medium Download PDF

Info

Publication number
CN117614644A
CN117614644A CN202311370072.0A CN202311370072A CN117614644A CN 117614644 A CN117614644 A CN 117614644A CN 202311370072 A CN202311370072 A CN 202311370072A CN 117614644 A CN117614644 A CN 117614644A
Authority
CN
China
Prior art keywords
model
text
identification
malicious
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311370072.0A
Other languages
Chinese (zh)
Inventor
张帆
王龙
周辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Information Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Information Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202311370072.0A priority Critical patent/CN117614644A/en
Publication of CN117614644A publication Critical patent/CN117614644A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application discloses a malicious website identification method, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring pictures in a webpage; converting the picture into text based on a picture-text conversion model; the text is used for describing the picture; identifying the text based on a malicious website identification model to obtain an identification result; and determining whether the website corresponding to the webpage is a malicious website or not based on the identification result. I.e. the present application improves the accuracy of the prediction of the potential value of the customer.

Description

Malicious website identification method, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of network security technologies, and in particular, to a malicious website identification method, an electronic device, and a storage medium.
Background
With the rapid development of IT (Internet Technology ) and the diversity of network attacks, network security problems are becoming more serious. At present, besides the traditional network security problems such as computer viruses, network threats and the like, the new network security problems such as click fraud, web page tampering, SQL (StrucTrued Query Language, structured query language) injection attack, website backdoor, junk mail, phishing and the like are more concerned.
In practical application, the deep learning model is used for identifying and classifying the URL through the characteristics of the text sequences on some web pages, the vocabulary characteristics, the domain name number, the special character number and the like of the URL (Uniform Resource Locator ), and determining whether the URL is a malicious website. However, when mining such a large amount of data, it is unavoidable to obtain some noise data, which to some extent reduces the accuracy of model identification classification.
Therefore, in practical application, a scheme capable of improving accuracy of model identification and classification is needed.
Disclosure of Invention
The main purpose of the application is to provide a malicious website identification method, electronic equipment and storage medium, and aims to solve the technical problem of low accuracy of model identification and classification.
In order to achieve the above purpose, the present application provides a malicious website identification method, which includes the following steps:
acquiring pictures in a webpage;
converting the picture into text based on a picture-text conversion model; the text is used for describing the picture;
identifying the text based on a malicious website identification model to obtain an identification result;
and determining whether the website corresponding to the webpage is a malicious website or not based on the identification result.
Illustratively, the identifying the text based on the malicious website identification model includes, before obtaining the identification result:
acquiring a BERT model, and acquiring an image-text training data set and a title training data set; the BERT model is a pre-training model; the image-text training data set and the title training data set both comprise URL labels;
merging the graphic training data set and the title training data set based on the URL label;
and performing migration learning on the BERT model based on the combined target training data set to obtain a malicious website identification model.
Exemplary, the performing migration learning on the BERT model based on the combined target training data set to obtain a malicious website identification model includes:
adjusting the learning rate of the BERT model and adjusting a default value False of a loss function of the BERT model to True;
and performing migration learning on the BERT model with the parameters adjusted based on the combined target training data set to obtain a malicious website identification model.
Illustratively, before the converting the picture into text based on the graphic conversion model, the method includes:
acquiring a BLIP model and acquiring a title training data set; the BLIP model is a pre-training model;
and performing migration learning on the BLIP model based on the title training data set to obtain a graphic conversion model.
Exemplary, the performing migration learning on the BLIP model based on the title training data set to obtain a graphic conversion model includes:
initializing parameters of an output layer of the BLIP model, and adjusting random seed numbers of the BLIP model;
and performing migration learning on the BLIP model with the parameters adjusted based on the title training data set to obtain a graphic conversion model.
Exemplary, the obtaining the picture in the webpage includes:
acquiring links of pictures in a webpage based on a crawler technology;
and acquiring the picture based on the link.
Illustratively, the identifying the text based on the malicious website identification model includes, before obtaining the identification result:
acquiring a title in the webpage;
the text is identified based on the malicious website identification model, and an identification result is obtained, which comprises the following steps:
identifying the text and the title based on a malicious website identification model to obtain an identification result
For example, to achieve the above object, the present application further provides a malicious website identification device, where the malicious website identification device includes:
the first acquisition module is used for acquiring pictures in the webpage;
the conversion module is used for converting the picture into text based on a picture-text conversion model; the text is used for describing the picture;
the identification module is used for identifying the text based on the malicious website identification model to obtain an identification result;
and the determining module is used for determining whether the website corresponding to the webpage is a malicious website or not based on the identification result.
For the purpose of achieving the above object, the present application further provides an electronic device, including: the system comprises a memory, a processor and a malicious website identification program stored on the memory and capable of running on the processor, wherein the malicious website identification program is configured to realize the steps of the malicious website identification method.
For example, to achieve the above object, the present application further provides a computer-readable storage medium having stored thereon a malicious website identification program, which when executed by a processor, implements the steps of the malicious website identification method as described above.
In order to solve the problem of low accuracy of model identification and classification, when whether a website is a malicious website or not is identified through a model, pictures in a webpage are converted into texts through a graphic-text conversion model. Wherein the text is used to describe the picture. And then, identifying the text through a malicious website identification model to obtain an identification result, and determining whether the website corresponding to the webpage is a malicious website or not through the identification result. It will be appreciated that a large amount of data needs to be mined relative to the actual application, and noise data is included in the large amount of data, so that the accuracy of model identification and classification is not high. The image can reflect the characteristics of the website most intuitively, so that the embodiment of the application can avoid mining noise data, and the accuracy rate of the malicious website identification model can be improved when the malicious website identification model is identified and classified through texts corresponding to the image.
Drawings
FIG. 1 is a flowchart illustrating an embodiment of a malicious website identification method according to the present application;
FIG. 2 is a schematic block diagram of a malicious website identification apparatus according to the present application;
fig. 3 is a schematic structural diagram of a hardware running environment according to an embodiment of the present application.
The realization, functional characteristics and advantages of the present application will be further described with reference to the embodiments, referring to the attached drawings.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
For a better understanding of the embodiments of the present application, the following briefly describes the embodiments of the present application:
the applicant researches find that a large number of pictures are arranged on the web pages corresponding to most of malicious websites, and the pictures are mostly related to the category of the websites, so that the pictures can reflect the characteristics of the malicious websites most intuitively.
In order to solve the problem of low accuracy of model identification and classification, when whether a website is a malicious website or not is identified through a model, pictures in a webpage are converted into texts through a graphic-text conversion model. Wherein the text is used to describe the picture. And then, identifying the text through a malicious website identification model to obtain an identification result, and determining whether the website corresponding to the webpage is a malicious website or not through the identification result. It will be appreciated that a large amount of data needs to be mined relative to the actual application, and noise data is included in the large amount of data, so that the accuracy of model identification and classification is not high. The image can reflect the characteristics of the website most intuitively, so that the embodiment of the application can avoid mining noise data, and the accuracy rate of the malicious website identification model can be improved when the malicious website identification model is identified and classified through texts corresponding to the image.
For a better understanding of the embodiments of the present application, some terms involved in the embodiments of the present application are briefly described below:
title training dataset: its data sources include, but are not limited to, web security technology anti-contests, alexa website ranking, web search collection. It should be noted that, the content (i.e. title) of the title tag on the website can be crawled by using the crawler technology, and the data of the website which cannot be responded is cleaned, and then the URL of the website and the content of the title tag are stored in the file in csv format in the form of the following table 1:
TABLE 1
url title class
Web site Content of website tag Category(s)
The contents in the file are randomly distributed according to a certain proportion, so as to obtain a title training data set for training and a title test data set for testing, wherein the certain proportion is 9:1, 7:3 and the like. Taking the certain proportion of 9:1 as an example, 90% of the contents in the file are randomly selected as a title training data set, put into a track 1.Csv file, and the rest 10% of the contents in the file are taken as a title test data set, and put into a test1.Csv file.
Graphic training data set: the data sources may also include, but are not limited to, cyber security technology anti-contests, alexa website ranking, web search collection. It should be noted that, the crawler technology may be utilized to crawl the links of the pictures on the web page, clean the data of the web site which cannot be responded, then download the pictures through the links, and mark the pictures manually, that is, describe the text content in the pictures manually, and finally store the text content in the csv format file in the form as shown in table 2:
TABLE 2
url images text
Web site Picture picture Describing the content of a picture
The contents in the file are randomly distributed according to a certain proportion, so as to obtain an image-text training data set for training and an image-text testing data set for testing, wherein the certain proportion is 9:1, 7:3 and the like. Taking the certain proportion of 9:1 as an example, 90% of the contents in the file are randomly selected as an image-text training data set, put into a track 2.Csv file, and the rest 10% of the contents in the file are taken as an image-text testing data set, and put into a test2.Csv file.
Referring to fig. 1, fig. 1 is a schematic flow chart of an embodiment of a malicious website identification method.
Embodiments of the present application provide embodiments of a malicious web site identification method, it being noted that although a logical sequence is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than that shown or described herein. The malicious website identification method comprises the following steps:
step S110, obtaining pictures in a webpage;
step S120, converting the picture into text based on a picture-text conversion model; the text is used for describing the picture;
step S130, identifying the text based on a malicious website identification model to obtain an identification result;
step S140, determining whether the web address corresponding to the web page is a malicious web address based on the identification result.
The following description is made for each step:
step S110, obtaining pictures in the webpage.
The web page comprises a large number of pictures, such as live photos of a report in a news web page, and price change line diagrams in financial web pages.
Exemplary, the obtaining the picture in the webpage includes: acquiring links of pictures in a webpage based on a crawler technology; and acquiring the picture based on the link.
After the link is acquired, the picture resource can be acquired from the server corresponding to the link.
Step S120, converting the picture into text based on a picture-text conversion model; the text is used to describe the picture.
The image-text conversion model is obtained by performing transfer learning by using a BLIP model, and the BLIP model is a new VLP (visual-Language Pre-training) framework in the multi-modal field, and mainly comprises the following four modules: image Encoder (ViT), text Encoder (BERT), image-grounded Text Encoder (variant BERT), and Image-grounded Text Decoder (variant BERT).
The following is an expanded description of the modules:
image Encoder (ViT), which is used for extraction of Image features, is a forward process, i.e. encoding a picture as an Image embedding.
Text Encoder (BERT), which is a standard BERT, is activated by an ITC (Image-Text Contrastive Loss, image Text contrast loss) objective function, which in the present embodiment is aimed at aligning the feature spaces of Image Encoder Transformer and Text Encoder Transformer. The BERT model is a language pre-training model proposed by Google AI (Artificial Intelligence ) institute.
Image-grounded Text Encoder (variant BERT) introduces visual features by inserting a Cross Attention (CA) module between Bi Self-Att and Feed Forward, based on the structure of the standard BERT. The Image-grounded Text Encoder is activated using ITM (Image-Text Matching Loss, image text matching loss) as an objective function. The ITM is a two-class task and is used for predicting whether the image-text pair is positive or negative, so as to learn the multi-modal representation of the image-text and adjust the fine granularity alignment between vision and language.
Image-grounded Text Decoder (variant BERT), which is obtained by replacing Bi Self-Att in the structure of Image-grounded Text Encoder with Causer Self-Att. Language Modeling Loss (LM) objective function activation is used in the embodiments of the present application, the goal of which LM is to generate a textual description of a given image.
Illustratively, before the converting the picture into text based on the graphic conversion model, the method includes: acquiring a BLIP model and acquiring a title training data set; the BLIP model is a pre-training model; and performing migration learning on the BLIP model based on the title training data set to obtain a graphic conversion model.
Since the open-source BLIP model is widely used in the field of text generation for pictures and is an english expression, and the BLIP model is an AI large model, it consumes a large amount of resources to train the model from scratch. In the embodiment of the application, the field of malicious websites is only classified and identified, so that the BLIP model is subjected to transfer learning through the title training data set, which is a fine tuning process, and the training speed of training to obtain the BLIP model can be improved. Wherein, the pre-training model of the BLIP model, namely the source model, can be obtained through downloading the https:// gitsub.
Exemplary, the performing migration learning on the BLIP model based on the title training data set to obtain a graphic conversion model includes: initializing parameters of an output layer of the BLIP model, and adjusting random seed numbers of the BLIP model; and performing migration learning on the BLIP model with the parameters adjusted based on the title training data set to obtain a graphic conversion model.
It should be noted that in the process of migration learning, a new neural network model, i.e., a target model, needs to be created, because the output layer of the source model is closely related to the labels of the source data set, and the output layer of the target model needs to be closely related to the labels of the title training data set. Thus, the target model does not replicate the output layer and its parameters on the source model, but replicates all model designs and their parameters on the source model except for the output layer.
It should be noted that, assuming that the model parameters include knowledge learned on the source data set, and these knowledge are also applicable to the header training data set, the embodiment of the present application uses the header training data set to train the replicated target model from the beginning, so that the parameters of the output layer finally obtained are the parameters learned through the header training data set, and the parameters of the remaining layers are obtained by the training process based on the parameters of the source model.
In training the target model by the header training data set, the parameter seed, that is, the random seed number is adjusted, for example, the default value 42 is adjusted to 52, 54, or the like. Wherein, the random seed is used for generating random numbers with weight as initial condition.
It should be noted that, since the output layer in the target model is reconstructed, the model parameters in the output layer need to be randomly initialized, and the existence of randomness is just used to evaluate the robustness of the target model, and the model parameters are enlarged to make the robustness of the model better. Meanwhile, the parameter world_size is also adjusted, that is, the distributed process number is adjusted, for example, the default value 1 is adjusted to 3, 4 and the like, so that the training process of the target model can be accelerated. It can be appreciated that through the fine tuning process described above, a graphics-text conversion model suitable for generating text from pictures can be obtained.
Illustratively, the identifying the text based on the malicious website identification model includes, before obtaining the identification result: acquiring a BERT model, and acquiring an image-text training data set and a title training data set; the BERT model is a pre-training model; the image-text training data set and the title training data set both comprise URL labels; merging the graphic training data set and the title training data set based on the URL label; and performing migration learning on the BERT model based on the combined target training data set to obtain a malicious website identification model.
The general architecture of the BERT model, which is a typical bi-directional coding model, is to use Transformer Encoder block for the connection.
It should be noted that, similar to the BLIP model, the BERT model is also an AI large model, and the malicious website recognition model can be trained and obtained by adopting a mode of transfer learning and fine tuning.
It should be noted that, in the step of merging the graphic training data set and the title training data set based on the URL tag, in this embodiment of the present application, the content of the text tag in the track 2.Csv is added to the title tag content of the track 1.Csv through the same URL tag in the csv file, so as to train the target model as the merged target training data set.
Exemplary, the performing migration learning on the BERT model based on the combined target training data set to obtain a malicious website identification model includes: adjusting the learning rate of the BERT model and adjusting a default value False of a loss function of the BERT model to True; and performing migration learning on the BERT model with the parameters adjusted based on the combined target training data set to obtain a malicious website identification model.
For the BERT model, the parameters that are adjusted are max_seq_length (maximum total input sequence length), learning_rate (learning rate), focal_loss (loss function).
For max_seq_length_hcong, the default value 128 may be adjusted to 256, 512, etc., it may be understood that, because some title contents of malicious websites are more, and after some descriptions of picture contents are added, the input contents of the model are more, so that the maximum total input sequence length needs to be increased, so that the information of the acquired features is more complete.
For the learning_rate, the default value 5e-5 may be adjusted to 1e-5, 2e-5, etc., so that the network can be reduced to a more optimal lost space region, thereby improving the accuracy of the model.
For focal_loss, the default value False can be adjusted to True because a large portion of the content in the web page corresponding to many malicious websites is pornographically related, which can lead to an imbalance in the categories of data in the target training dataset. Therefore, the robustness and accuracy of the model can be improved through the loss function of focal_loss after parameter adjustment, and the loss function is realized through a formula I:
FL(p)=-α(1-p) γ ×ylog(p)-(1-α)p γ x (1-y) log (1-p) (one)
Where p is the predictive probability of the class of data; y is the label value of the sample; α and γ are hyper-parameters, α=0.25, γ=2.
It can be appreciated that through the foregoing fine tuning process, a malicious website identification model can be obtained.
It should be noted that, the resources required for fine tuning of the two models, namely BLIP and BERT, are very few, and the model obtained by fine tuning training has the advantages of simple structure, strong robustness, strong generalization capability and the like. Experimental results show that the classification effect is superior to that of the prior researches.
And step S130, identifying the text based on the malicious website identification model to obtain an identification result.
It can be understood that the recognition result is the probability of whether the website is a malicious website, and the malicious website recognition model can be used for recognizing the text and also can be used for recognizing the text and the title.
Illustratively, the identifying the text based on the malicious website identification model includes, before obtaining the identification result: acquiring a title in the webpage; the text is identified based on the malicious website identification model, and an identification result is obtained, which comprises the following steps: and identifying the text and the title based on a malicious website identification model to obtain an identification result.
In practical applications, whether a malicious website is determined mainly by analyzing character features in the URL.
The method can be concretely divided into the following three aspects:
1. the classified malicious websites are identified using conventional methods. For example, the detection method of malicious websites based on manually designed feature extraction rules is mainly based on static features contained in the URLs, such as vocabulary features, domain name number, special character number and the like, and dynamic features, such as WHIOS information, whether the URLs are valid or not, and the like. Also, for example, using a blacklist approach to identify categorized malicious web sites, it is almost impossible to maintain an exhaustive list of malicious URLs, especially every day that new URLs are generated. An attacker may use confusion techniques to modify URLs to be "legitimate" to evade blacklists or confuse users. Thus, the blacklist approach has serious limitations, and the blacklist can be easily bypassed, and the blacklist cannot be used to predict new URLs.
2. Machine learning is used to identify categorized malicious web sites. For propagating quick and diverse malicious URLs, in order to improve the universality of malicious URL detection, a machine learning technology is used for identifying and classifying malicious websites, and because the malicious websites or online decoy webpages have certain characteristics different from benign websites, the machine learning can be effectively processed. Although machine learning methods have generalization capability, one potential drawback for malicious URL detection is its resource-intensive nature, especially when extracting features that are not trivial and computationally expensive, machine learning is typically shallow learning, generalizing complex problems is weak, and the false positive rate is high. For example, the feature space selection and evaluation for the feature extraction method specifically includes a feature selection method based on random forest, a feature selection algorithm based on Filter (feature subset with high correlation, low redundancy and internal dependence is selected), an intrusion detection method (feature dimension reduction is performed to obtain the optimal low-dimensional representation of data) using a method based on support vector machine as feature subset discrimination criterion, a deep belief network and multi-class SVM (Support Vector Machine ) in combination.
3. A deep learning method is used to identify categorized malicious web sites. Namely, selection and research of a deep learning-based method are performed, for example, character features of URLs are obtained by utilizing the capability of identifying local features by CNN (Convolutional Neural Network) and the capability of learning text sequences by LSTM (Long Short-Term Memory network), complementary features of CNN and LSTM are captured, and then the URLs are detected in parallel by using the convolutional neural network based on attention and the Long-Short Term Memory network MWCL, so that feature information of a plurality of layers of learning URLs is obtained, and the like. Furthermore, convolutional neural networks have been applied to the characters and words of URL strings, in particular to learn how to embed in a co-optimized framework. However, this approach requires a large amount of data to enable the end-to-end approach to work in its very nature, and has no feature selection and deep feature mining capabilities. In addition, URL strings and DNS (Domain Name System ) strings are mapped to deep neural network online detection schemes in the form of character-level vectors using natural language processing methods. But information on the character sequence cannot be acquired, and thus malicious URL information and good detection effect cannot be sufficiently obtained.
Therefore, the information of the character type is aimed at in practical application, but the embodiment of the application is triggered by the title and the picture, namely, the characteristic extraction of the URL is not limited to the character type any more from the multi-mode field, and the picture on the URL can be also used. It should be noted that, in the embodiment of the present application, it is not necessary to analyze static and dynamic features of the URL, and also not necessary to perform feature selection and deep feature mining on the URL, a text is generated from a picture on a web page corresponding to the URL by using a BLIP model, and the text is used as a feature for describing the URL, and meanwhile, the content in the title of the URL is added as a feature expression of the URL, so that the obtained feature is multi-level and multi-directional, and the problem that the confusion technology modifies the URL into "legal" information features provided for us can be effectively avoided.
Step S140, determining whether the web address corresponding to the web page is a malicious web address based on the identification result.
It can be understood that when the recognition result is greater than the preset probability threshold, the website corresponding to the webpage can be determined to be a malicious website; otherwise, the website corresponding to the webpage can be determined to be a non-malicious website. The preset probability threshold may be set as required, and embodiments of the present application are not specifically limited.
In addition, the application also provides a malicious website identification device, which comprises:
a first obtaining module 20, configured to obtain a picture in a web page;
a conversion module 21, configured to convert the picture into text based on a graphics-text conversion model; the text is used for describing the picture;
the recognition module 22 is configured to recognize the text based on a malicious website recognition model, so as to obtain a recognition result;
the determining module 23 is configured to determine whether the website corresponding to the web page is a malicious website based on the identification result.
Illustratively, the malicious website identification apparatus further includes:
the second acquisition module is used for acquiring the BERT model and acquiring an image-text training data set and a title training data set; the BERT model is a pre-training model; the image-text training data set and the title training data set both comprise URL labels;
the merging module is used for merging the image-text training data set and the title training data set based on the URL label;
and the first training module is used for performing migration learning on the BERT model based on the combined target training data set to obtain a malicious website identification model.
Illustratively, the first training module is specifically configured to:
adjusting the learning rate of the BERT model and adjusting a default value False of a loss function of the BERT model to True;
and performing migration learning on the BERT model with the parameters adjusted based on the combined target training data set to obtain a malicious website identification model.
Illustratively, the malicious website identification apparatus further includes:
a third acquisition module for acquiring a BLIP model and acquiring a title training data set; the BLIP model is a pre-training model;
and the second training module is used for performing migration learning on the BLIP model based on the title training data set to obtain a graphic conversion model.
Illustratively, the second training module is specifically configured to:
initializing parameters of an output layer of the BLIP model, and adjusting random seed numbers of the BLIP model;
and performing migration learning on the BLIP model with the parameters adjusted based on the title training data set to obtain a graphic conversion model.
Illustratively, the first acquiring module 20 is specifically configured to:
acquiring links of pictures in a webpage based on a crawler technology;
and acquiring the picture based on the link.
Illustratively, the malicious website identification apparatus further includes:
a fourth obtaining module, configured to obtain a title in the web page;
the identification module 22 is specifically configured to:
identifying the text and the title based on a malicious website identification model to obtain an identification result
The specific implementation manner of the malicious website identification device is basically the same as that of each embodiment of the malicious website identification method, and is not repeated here.
In addition, the application also provides electronic equipment. As shown in fig. 3, fig. 3 is a schematic structural diagram of a hardware running environment according to an embodiment of the present application.
As shown in fig. 3, the electronic device may include a processor 301, a communication interface 302, a memory 303, and a communication bus 304, where the processor 301, the communication interface 302, and the memory 303 perform communication with each other through the communication bus 304, and the memory 303 is used to store a computer program; the processor 301 is configured to implement steps of a malicious website identification method when executing a program stored in the memory 303.
The communication bus 304 mentioned above for the electronic device may be a Peripheral component interconnect standard (Peripheral ComponentInterconnect, PCI) bus or an extended industry standard architecture (Extended Industry StandardArchitecTrue, EISA) bus, etc. The communication bus 304 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface 302 is used for communication between the electronic device and other devices described above.
The Memory 303 may include a random access Memory (Random Access Memory, RMD) or may include a Non-Volatile Memory (NM), such as at least one disk Memory. Optionally, the memory 303 may also be at least one memory device located remotely from the aforementioned processor 301.
The processor 301 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific IntegratedCircuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
The specific implementation manner of the electronic device is basically the same as that of each embodiment of the malicious website identification method, and is not repeated here.
In addition, the embodiment of the application also provides a computer readable storage medium, wherein a malicious website identification program is stored on the computer readable storage medium, and the malicious website identification program realizes the steps of the malicious website identification method when being executed by a processor.
The specific implementation manner of the computer readable storage medium is basically the same as the above embodiments of the malicious website identification method, and will not be described herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) as described above, including several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method described in the embodiments of the present application.
In addition, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance. It will also be understood that, although the terms "first," "second," etc. may be used in this document to describe various elements in some embodiments of the present application, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first table may be named a second table, and similarly, a second table may be named a first table without departing from the scope of the various described embodiments. The first table and the second table are both tables, but they are not the same table.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims (10)

1. The malicious website identification method is characterized by comprising the following steps of:
acquiring pictures in a webpage;
converting the picture into text based on a picture-text conversion model; the text is used for describing the picture;
identifying the text based on a malicious website identification model to obtain an identification result;
and determining whether the website corresponding to the webpage is a malicious website or not based on the identification result.
2. The malicious website identification method as set forth in claim 1, wherein the identifying the text based on the malicious website identification model includes, before obtaining the identification result:
acquiring a BERT model, and acquiring an image-text training data set and a title training data set; the BERT model is a pre-training model; the image-text training data set and the title training data set both comprise URL labels;
merging the graphic training data set and the title training data set based on the URL label;
and performing migration learning on the BERT model based on the combined target training data set to obtain a malicious website identification model.
3. The malicious website identification method as set forth in claim 2, wherein the performing the migration learning on the BERT model based on the combined target training data set to obtain a malicious website identification model includes:
adjusting the learning rate of the BERT model and adjusting a default value False of a loss function of the BERT model to True;
and performing migration learning on the BERT model with the parameters adjusted based on the combined target training data set to obtain a malicious website identification model.
4. The malicious website identification method as set forth in claim 1, wherein before converting the picture into text based on a graphic conversion model, the method comprises:
acquiring a BLIP model and acquiring a title training data set; the BLIP model is a pre-training model;
and performing migration learning on the BLIP model based on the title training data set to obtain a graphic conversion model.
5. The malicious website identification method according to claim 4, wherein the performing the migration learning on the BLIP model based on the title training data set to obtain a graphic conversion model includes:
initializing parameters of an output layer of the BLIP model, and adjusting random seed numbers of the BLIP model;
and performing migration learning on the BLIP model with the parameters adjusted based on the title training data set to obtain a graphic conversion model.
6. The malicious website identification method as set forth in claim 1, wherein the obtaining a picture in a web page comprises:
acquiring links of pictures in a webpage based on a crawler technology;
and acquiring the picture based on the link.
7. The malicious website identification method as set forth in claim 1, wherein the identifying the text based on the malicious website identification model includes, before obtaining the identification result:
acquiring a title in the webpage;
the text is identified based on the malicious website identification model, and an identification result is obtained, which comprises the following steps:
and identifying the text and the title based on a malicious website identification model to obtain an identification result.
8. A malicious web site identification device, characterized in that the malicious web site identification device comprises:
the first acquisition module is used for acquiring pictures in the webpage;
the conversion module is used for converting the picture into text based on a picture-text conversion model; the text is used for describing the picture;
the identification module is used for identifying the text based on the malicious website identification model to obtain an identification result;
and the determining module is used for determining whether the website corresponding to the webpage is a malicious website or not based on the identification result.
9. An electronic device, the electronic device comprising: a memory, a processor and a malicious web site identification program stored on the memory and executable on the processor, the malicious web site identification program being configured to implement the steps of the malicious web site identification method of any one of claims 1 to 7.
10. A computer readable storage medium, wherein a malicious web address identification program is stored on the computer readable storage medium, which when executed by a processor, implements the steps of the malicious web address identification method according to any one of claims 1 to 7.
CN202311370072.0A 2023-10-20 2023-10-20 Malicious website identification method, electronic equipment and storage medium Pending CN117614644A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311370072.0A CN117614644A (en) 2023-10-20 2023-10-20 Malicious website identification method, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311370072.0A CN117614644A (en) 2023-10-20 2023-10-20 Malicious website identification method, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117614644A true CN117614644A (en) 2024-02-27

Family

ID=89955038

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311370072.0A Pending CN117614644A (en) 2023-10-20 2023-10-20 Malicious website identification method, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117614644A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117853492A (en) * 2024-03-08 2024-04-09 厦门微亚智能科技股份有限公司 Intelligent industrial defect detection method and system based on fusion model

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117853492A (en) * 2024-03-08 2024-04-09 厦门微亚智能科技股份有限公司 Intelligent industrial defect detection method and system based on fusion model

Similar Documents

Publication Publication Date Title
US11329952B2 (en) System and method for detecting generated domain
Wang et al. PDRCNN: Precise phishing detection with recurrent convolutional neural networks
CN110210617B (en) Confrontation sample generation method and generation device based on feature enhancement
Opara et al. HTMLPhish: Enabling phishing web page detection by applying deep learning techniques on HTML analysis
KR102093275B1 (en) Malicious code infection inducing information discrimination system, storage medium in which program is recorded and method
Patil et al. Feature-based Malicious URL and Attack Type Detection Using Multi-class Classification.
CN110933105B (en) Web attack detection method, system, medium and equipment
CN108038173B (en) Webpage classification method and system and webpage classification equipment
Aljabri et al. An assessment of lexical, network, and content-based features for detecting malicious urls using machine learning and deep learning models
CN117614644A (en) Malicious website identification method, electronic equipment and storage medium
Ojewumi et al. Performance evaluation of machine learning tools for detection of phishing attacks on web pages
CN115757991A (en) Webpage identification method and device, electronic equipment and storage medium
Gong et al. Model uncertainty based annotation error fixing for web attack detection
Nowroozi et al. An adversarial attack analysis on malicious advertisement url detection framework
Opara et al. Look before You leap: Detecting phishing web pages by exploiting raw URL And HTML characteristics
Khan Detection of phishing websites using deep learning techniques
Zhu et al. CCBLA: a lightweight phishing detection model based on CNN, BiLSTM, and attention mechanism
CN110958244A (en) Method and device for detecting counterfeit domain name based on deep learning
CN113449816A (en) Website classification model training method, website classification method, device, equipment and medium
CN115001763B (en) Phishing website attack detection method and device, electronic equipment and storage medium
KR20240013640A (en) Method for detecting harmful url
Elnagar et al. A cognitive framework for detecting phishing websites
CN115964478A (en) Network attack detection method, model training method and device, equipment and medium
Wan et al. Generation of malicious webpage samples based on GAN
Sirisha et al. Phishing URL detection using machine learning techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination