CN111198995A - Malicious webpage identification method - Google Patents
Malicious webpage identification method Download PDFInfo
- Publication number
- CN111198995A CN111198995A CN202010012212.7A CN202010012212A CN111198995A CN 111198995 A CN111198995 A CN 111198995A CN 202010012212 A CN202010012212 A CN 202010012212A CN 111198995 A CN111198995 A CN 111198995A
- Authority
- CN
- China
- Prior art keywords
- layer
- malicious
- embedding
- url link
- neural network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a malicious webpage identification method, which comprises the following steps: step 1, acquiring a malicious webpage data set, and obtaining a training set and a test set of malicious webpages through data preprocessing; step 2, acquiring character-level embedding of a training set and a test set by using a Char-CNN model; step 3, constructing a BiLSTM-Attention neural network model; step 4, constructing a BilSTM-Attention neural network model by utilizing a training set, character level embedding thereof and static word embedding training step 3; step 5, verifying the trained BilSTM-Attention neural network model in the step 4 by using a test set and character level embedding thereof and static word embedding; and step 6, after verification in the step 5, using the trained BilSTM-Attention neural network model for carrying out malicious webpage identification on webpage data accessed by the user. The method adopts a bidirectional long-time memory cyclic neural network based on an attention mechanism, and simultaneously adopts a method combining character level embedding and static word embedding, thereby realizing the purpose of identifying malicious webpages.
Description
Technical Field
The invention relates to the technical field of internet security, in particular to a malicious webpage identification method.
Background
With the development of the internet industry in recent years, networks have become an indispensable part of people's lives. At the same time, however, malicious criminal activity using the internet is also increasing. The operations of utilizing malicious webpages to carry out phishing attacks, popularizing junk advertisements, guiding to download malicious software and the like are main activities of internet crimes. According to the' global Chinese phishing website status statistical analysis report (2016 >) and the recent report of the China anti-phishing alliance, China is a country suffering the greatest proportion of malicious webpages, and the number of the malicious webpages rapidly increases year by year.
The traditional method for identifying malicious web pages is generally an identification method based on a blacklist technology. Is also the method most widely used in industry today. The blacklist technique is to maintain a list of malicious domain names, if the accessed domain name is not in the list of malicious domain names, the browser will regard the accessed domain name as a normal domain name, and if the accessed domain name is in the list, the accessed domain name is regarded as a malicious domain name. The method has the advantages that the technology is simple to implement, and the confirmed malicious webpage can be accurately identified. But has the disadvantage of not being able to identify previously undisplayed malicious domain names and requires technicians to maintain lists of malicious domain names at all times.
With the development of machine learning technology in recent years, more and more people apply the machine learning technology to malicious web page detection. Manually extracting characteristics such as url length, whether the url length is an https link, domain name length and the like from the url link, detecting content of a webpage by using a honeypot technology, detecting whether a malicious script exists, detecting whether pictures on a website are illegal pictures and the like, and then classifying the pictures based on machine learning algorithms such as svm and random forest. However, this method is very dependent on experts in network security, and requires a person who is very familiar with malicious web pages to perform artificial feature extraction on the malicious web page data set. The quality of the final classification result is greatly influenced by the manually extracted features.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the existing problems, the method for identifying the malicious web pages is provided, and the URL links are directly classified by using character-level embedding and a bidirectional long-and-short time memory recurrent neural network (Bi LSTM), so that the purpose of identifying the malicious web pages is achieved.
The technical scheme adopted by the invention is as follows:
a malicious webpage identification method comprises the following steps:
step 1, acquiring a malicious webpage data set, and obtaining a training set and a test set of malicious webpages through data preprocessing;
step 2, acquiring character-level embedding of a training set and a test set by using a Char-CNN model;
step 3, constructing a BiLSTM-Attention neural network model;
step 4, constructing a BilSTM-Attention neural network model by utilizing a training set, character level embedding thereof and static word embedding training step 3;
step 5, verifying the trained BilSTM-Attention neural network model in the step 4 by using a test set and character level embedding thereof and static word embedding;
and step 6, after verification in the step 5, using the trained BilSTM-Attention neural network model for carrying out malicious webpage identification on webpage data accessed by the user.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
the invention adopts a bidirectional long-time memory cyclic neural network based on an attention mechanism, and simultaneously adopts a method combining character level embedding and static word embedding to realize the purpose of malicious webpage identification, compared with the traditional malicious webpage identification method, the method of the invention comprises the following steps:
1. no personnel are required to maintain a domain name blacklist;
2. no special network security personnel are required to design features;
3. the identification rate of newly appeared malicious web pages is high;
4. the method is suitable for identifying the malicious web pages appearing at the mobile terminal.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a flowchart of a malicious web page identification method according to the present invention.
FIG. 2 is a schematic structural diagram of a BilSTM-Attention neural network model constructed by the present invention.
FIG. 3 is a schematic diagram of the present invention using a trained BilSTM-Attention neural network model for malicious web page identification of web page data accessed by a user.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, a method for identifying a malicious web page includes the following steps:
step 1, acquiring a malicious webpage data set, and obtaining a training set and a test set of malicious webpages through data preprocessing;
specifically, the method comprises the following steps:
step 1.1, removing samples with malicious webpage data set url link missing or label missing, and then performing word segmentation processing; the segmentation of the English text is based on a space, but the url link is a special English text without a space, the embodiment adopts a python word ninia module to perform segmentation processing on the url link in the malicious webpage data set, and all symbols in the url link are reserved;
step 1.2, the url link contains many abbreviated words, so preprocessing operations such as stem extraction and word shape reduction are also needed. According to the method, a Porter Stemmer module and a WordNetLemmatizer module in a python NLTK package are adopted to extract stems and restore word shapes of url links in a malicious webpage data set;
step 1.3, in order to avoid the situation that the letters are mixed in capital and lowercase, the embodiment adopts a python lower () method to convert all the letters of url links in the malicious webpage data set into lowercase or uppercase (preferably lowercase), and completes the normalization operation;
and step 1.4, dividing the malicious webpage data set processed in the step 1.1-step 1.3 into a training set and a testing set according to the proportion of 7:3 or 8:2 (preferably 8: 2).
Step 2, acquiring character-level embedding of a training set and a test set by using a Char-CNN model;
specifically, the method comprises the following steps:
step 2.1, constructing a character table:
abcdefghijklmnopqrsttuvwxyz 0123456789-; | A! Is there a The 69 characters, '/' __ $% & [ ] { } are encoded using one-hot, plus a full 0 vector for processing characters not in the character table, forming a character table including 70 characters, and representing the character table as a one-hot vector, e.g., a [1,0,0 … 0], having a dimension of 50.
Step 2.2, representing the training set or the test set by using one-hot vectors of the character table, and then inputting the Char-CNN model for training to obtain corresponding character level embedding, for example: a [0.2324124,0.2124244,0.5252411, … ].
Wherein, the Char-CNN model is a neural network model of 6 convolutional layers.
Step 3, constructing a BiLSTM-Attention neural network model; the BilSTM model is a bidirectional LSTM (bidirectional long-short memory recurrent neural network), and the Bi-LSTM model trains data of an input layer in a forward direction and a backward direction, so that the BilSTM can better capture context information in sentences than the LSTM, and then an Attention layer is added on the BilSTM model to form the BilSTM-Attention neural network model.
As shown in fig. 2, specifically:
step 3.1, constructing an input layer, wherein the input layer is used for inputting the malicious webpage data set preprocessed by the data in the step 1; com, for example [ www.dark moon ];
step 3.2, constructing an embedding layer, wherein the embedding layer utilizes character-level embedding of the malicious webpage data set and static word embedding to replace words in the malicious webpage data set to obtain an embedded representation of each url link in the malicious webpage data set;
step 3.3, constructing an LSTM layer, wherein the LSTM layer comprises two layers, one layer is a forward propagation layer, and the other layer is a backward propagation layer; each LSTM layer includes a forgetting gate, an input gate, an output gate, and a cell state, wherein,
(1) updating the forget gate output: f. oft=σ(wfht-1+Ufxt+bf);ht-1Indicating history information, xtRepresenting new information flowing into the cell, bfIs a bias term;
(2) update input gate two part output:
it=σ(wiht-1+Uixt+bi);
at=tanh(waht-1+Uaxt+ba);
(3) and (3) updating the cell state:
Ct=Ct-1ft+itat;
(4) updating two parts of output of an output gate:
ot=σ(w0ht-1+U0xt+b0);
ht=ottanh(Ct);
(5) the current sequence index prediction outputs:
yt=σ(Vht+c);
wherein, wf,Uf,bf,wi,Ui,wa,Ua,w0,U0Obtaining parameters required to be trained for the BilSTM-Attention neural network model; sigma is sigmoid function;
step 3.4, an attention layer is constructed, the attention layer is used for calculating the weight of all time sequences, and then the weight of all time sequences is output as a feature vector;
step 3.5, an output layer is constructed, the output layer is a full connection layer, the output of the attention layer is used as the input of the output layer, and a softmax classifier is used for processing the output of the attention layer to obtain a classification result
Step 4, constructing a BilSTM-Attention neural network model by utilizing a training set, character level embedding thereof and static word embedding training step 3; wherein, the static word embedding can adopt a Glove static word vector which is trained by Stanford university and has a dimension of 50.
Specifically, the method comprises the following steps:
step 4.1, constructing a text dictionary represented by vectors by all words in url link texts in a training set;
step 4.2, comparing the constructed text dictionary with the static word embedding one by one, if the static word embedding contains word vectors in the text dictionary, replacing the word vectors in the static word embedding, and if the static word embedding does not contain the word vectors in the text dictionary, replacing the word vectors by character-level embedding, thereby obtaining the vector representation of each url link in the training set; that is, Wi in the words (containing words and characters) S ═ in the training set url links (W1, W2 …, Wn) is mapped as Wi. S represents a url link, Wi represents a word in the url link; wi is a vector, i.e., embedding, dimension of 50. The entire training set is replaced. The vector representation of each url link is obtained, i.e. a two-dimensional matrix, each column representing a word vector or a character vector.
4.3, inputting the vector representation of each url link in the training set into a forward propagation layer and a backward propagation layer in the lstm layer; the forward propagation layer and the backward propagation layer together extract language information represented by the input url-linked vector; adding the results of the forward propagation layer and the backward propagation layer at the same time to obtain a semantic feature vector in each url link, and then transmitting the semantic feature vector to an attention layer;
step 4.4, the attribute layer receives the semantic feature vector in each url link, calculates the weight of all time sequences, outputs the weight of all time sequences as the feature vector, and adopts the following calculation formula to calculate:
Ut=V tan h(w1h+bw);
at=softmax(Ut);
ct=∑ath;
where h is the semantic feature vector in each url link, w1Is a parameter vector, bw is a bias term; u shapetHidden layer representation for the neural network; a istIs to UtPerforming softmax function normalization to obtain a weight matrix; then the weight matrix atCarrying out weighted sum with the semantic feature vector h to obtain a text vector c containing important information in url linktFinally, the text vector ctTransmitting to the output layer;
step 4.5, the output layer processes the text vector c by adopting a softmax functiontThe formula is as follows:
y=softmax(wjct+bj)
wherein y is the output of the model, 0 represents a normal url link, and 1 represents a malicious url link; w is ajRepresenting attention layer to output layer to be trainedA weight coefficient matrix; bjRepresenting the corresponding bias term to be trained.
Because the malicious webpage identification problem is a two-classification problem, the loss function adopted by the output layer is a binary cross entropy loss function, and the loss function is an index for measuring whether the model is converged or not. And (5) the Loss function Loss is stable, and the model is converged, so that the model training is completed. The formula is as follows:
log(yt|yp)=-(yt*log(yp)+(1-yt)log(1-yp))
wherein y is a label corresponding to the x sample in the training set, the value set of the binary problem is {0, 1}, yt is a real label of a certain sample, and yp is the probability when yt of the sample is 1; and then drawing a Loss curve through a python matplotlib package to judge whether the Loss of the Loss function is stable or not by judging whether the Loss is balanced or not.
Step 5, verifying the trained BilSTM-Attention neural network model in the step 4 by using a test set and character level embedding thereof and static word embedding;
specifically, the method comprises the following steps:
step 5.1, inputting a test set, embedding the test set in a character level, and embedding a static word into a trained BilSTM-Attention neural network model to obtain a classification result of each url link, wherein 0 represents a normal url link, and 1 represents a malicious url link;
and 5.2, comparing the classification result of each url link with a labeled label (namely, the label labeled by each url link in the data set is 0 or 1), if the classification result is consistent with the labeled label, pred +1, and finally calculating the quantity of acc, pred/url in the test set, wherein acc is the accuracy of the trained BilSTM-Attention neural network model for identifying the malicious web pages, and the verification is passed when the accuracy meets the requirement.
And step 6, after verification in the step 5, using the trained BilSTM-Attention neural network model for carrying out malicious webpage identification on webpage data accessed by the user. As shown in fig. 3, specifically: inputting a webpage data set accessed by a user into an input layer of a BilSTM-Attention neural network model after the processing of the step 1 and the step 2; and after the replacement is performed by combining the embedding layer with character level embedding and static word embedding, the classification result is output through an LSTM layer, an attention layer and an output layer in sequence, if the classification result is a normal url link, the access is allowed, and if the classification result is a malicious url link, the access is denied.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (8)
1. A malicious webpage identification method is characterized by comprising the following steps:
step 1, acquiring a malicious webpage data set, and obtaining a training set and a test set of malicious webpages through data preprocessing;
step 2, acquiring character-level embedding of a training set and a test set by using a Char-CNN model;
step 3, constructing a BiLSTM-Attention neural network model;
step 4, constructing a BilSTM-Attention neural network model by utilizing a training set, character level embedding thereof and static word embedding training step 3;
step 5, verifying the trained BilSTM-Attention neural network model in the step 4 by using a test set and character level embedding thereof and static word embedding;
and step 6, after verification in the step 5, using the trained BilSTM-Attention neural network model for carrying out malicious webpage identification on webpage data accessed by the user.
2. The malicious webpage identification method according to claim 1, wherein the method of step 1 is:
step 1.1, removing samples of url link loss or label loss of the malicious webpage data set, performing word segmentation processing on the url link in the malicious webpage data set by adopting a pythonwordannia module, and reserving all symbols in the url link;
step 1.2, performing stemming extraction and morphological reduction on url links in a malicious webpage data set by adopting a PorterStemmer module and a WordNetLemmatizer module in a python NLTK package;
step 1.3, adopting a python lower () method to convert all letters of url links in the malicious webpage data set into lower case or upper case, and finishing normalization operation;
and step 1.4, dividing the malicious webpage data set processed in the step 1.1-step 1.3 into a training set and a testing set according to the proportion of 7:3 or 8: 2.
3. The malicious webpage identification method according to claim 1, wherein the method of step 2 is:
step 2.1, constructing a character table:
abcdefghijklmnopqrsttuvwxyz 0123456789-; | A! Is there a : one-hot coding is used for 69 characters of'/\\ _ @ # & +- - < > () [ ] { }, then a full 0 vector is used for processing characters which are not in the character table, a character table comprising 70 characters is formed, and the character table is represented as a one-hot vector;
and 2.2, representing the training set or the test set by using one-hot vectors of the character table, and inputting a Char-CNN model for training to obtain corresponding character level embedding.
4. The malicious webpage identification method according to claim 1 or 3, wherein the Char-CNN model is a neural network model of 6 convolutional layers.
5. The malicious webpage identification method according to claim 1, wherein the method of step 3 is:
step 3.1, constructing an input layer, wherein the input layer is used for inputting the malicious webpage data set preprocessed by the data in the step 1;
step 3.2, constructing an embedding layer, wherein the embedding layer utilizes character-level embedding of the malicious webpage data set and static word embedding to replace words in the malicious webpage data set to obtain an embedded representation of each url link in the malicious webpage data set;
step 3.3, constructing an LSTM layer, wherein the LSTM layer comprises two layers, one layer is a forward propagation layer, and the other layer is a backward propagation layer; each LSTM layer includes a forgetting gate, an input gate, an output gate, and a cell state, wherein,
(1) updating the forget gate output: f. oft=σ(wfht-1+Ufxt+bf);ht-1Indicating history information, xtRepresenting new information flowing into the cell, bfIs a bias term;
(2) update input gate two part output:
it=σ(wiht-1+Uixt+bi);
at=tanh(waht-1+Uaxt+ba);
(3) and (3) updating the cell state:
Ct=Ct-1ft+itat;
(4) updating two parts of output of an output gate:
ot=σ(w0ht-1+U0xt+b0);
ht=ottanh(Ct);
(5) the current sequence index prediction outputs:
yt=σ(Vht+c);
wherein, wf,Uf,bf,wi,Ui,wa,Ua,w0,U0Obtaining parameters required to be trained for the BilSTM-Attention neural network model; sigma is sigmoid function;
step 3.4, an attention layer is constructed, the attention layer is used for calculating the weight of all time sequences, and then the weight of all time sequences is output as a feature vector;
and 3.5, constructing an output layer, wherein the output layer is a fully-connected layer, the output of the attention layer is used as the input of the output layer, and the output of the attention layer is processed by using a softmax classifier to obtain a classification result.
6. The malicious webpage identification method according to claim 5, wherein the method of step 4 is:
step 4.1, constructing a text dictionary represented by vectors by all words in url link texts in a training set;
step 4.2, comparing the constructed text dictionary with the static word embedding one by one, if the static word embedding contains word vectors in the text dictionary, replacing the word vectors in the static word embedding, and if the static word embedding does not contain the word vectors in the text dictionary, replacing the word vectors by character-level embedding, thereby obtaining the vector representation of each url link in the training set;
4.3, inputting the vector representation of each url link in the training set into a forward propagation layer and a backward propagation layer in the lstm layer; the forward propagation layer and the backward propagation layer together extract language information represented by the input url-linked vector; adding the results of the forward propagation layer and the backward propagation layer at the same time to obtain a semantic feature vector in each url link, and then transmitting the semantic feature vector to an attention layer;
step 4.4, the attention layer receives the semantic feature vector in each url link, and the following calculation formula is adopted for calculation:
Ut=V tanh(w1h+bw);
at=softmax(Ut);
ct=∑ath;
where h is the semantic feature vector in each url link, w1Is a parameter vector, bw is a bias term; u shapetHidden layer representation for the neural network; a istIs to UtPerforming softmax function normalization to obtain a weight matrix; then the weight matrix atCarrying out weighted sum with the semantic feature vector h to obtain a text vector c containing important information in url linktFinally, the text vector ctTransmitting to the output layer;
step 4.5, the output layer processes the text vector c by adopting a softmax functiontThe formula is as follows:
y=softmax(wjct+bj)
wherein y is the output of the model, 0 represents a normal url link, and 1 represents a malicious url link; w is ajRepresenting a weight coefficient matrix to be trained from an attention layer to an output layer; bjRepresenting the corresponding bias term to be trained.
7. The malicious webpage identification method according to claim 6, wherein the loss function adopted by the output layer is a binary cross entropy loss function, and the formula is as follows:
log(yt|yp)=-(yt*log(yp)+(1-yt)log(1-yp))
wherein y is a label corresponding to the x sample in the training set, the value set of the binary problem is {0, 1}, yt is a real label of a certain sample, and yp is the probability when yt of the sample is 1; then, a Loss curve is drawn through a pythonmatplotlib package, and whether the Loss of the Loss function is stable or not is judged by judging whether the Loss is balanced or not.
8. The malicious webpage identification method according to claim 6, wherein the method of step 5 is:
step 5.1, inputting a test set, embedding the test set in a character level, and embedding a static word into a trained BilSTM-Attention neural network model to obtain a classification result of each url link, wherein 0 represents a normal url link, and 1 represents a malicious url link;
and 5.2, comparing the classification result of each url link with the labeled label, if the classification result is consistent with the labeled label, pred +1, and finally calculating the number of the acs (pre/url in the test set), wherein the acc is the accuracy of the trained BilSTM-Attention neural network model for identifying the malicious web pages, and the accuracy passes the verification when meeting the requirement.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010012212.7A CN111198995B (en) | 2020-01-07 | 2020-01-07 | Malicious webpage identification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010012212.7A CN111198995B (en) | 2020-01-07 | 2020-01-07 | Malicious webpage identification method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111198995A true CN111198995A (en) | 2020-05-26 |
CN111198995B CN111198995B (en) | 2023-03-24 |
Family
ID=70744746
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010012212.7A Active CN111198995B (en) | 2020-01-07 | 2020-01-07 | Malicious webpage identification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111198995B (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111475626A (en) * | 2020-06-22 | 2020-07-31 | 上海冰鉴信息科技有限公司 | Structured partitioning method and device for referee document |
CN111538929A (en) * | 2020-07-08 | 2020-08-14 | 腾讯科技(深圳)有限公司 | Network link identification method and device, storage medium and electronic equipment |
CN112541476A (en) * | 2020-12-24 | 2021-03-23 | 西安交通大学 | Malicious webpage identification method based on semantic feature extraction |
CN112632549A (en) * | 2021-01-06 | 2021-04-09 | 四川大学 | Web attack detection method based on context analysis |
CN113037729A (en) * | 2021-02-27 | 2021-06-25 | 中国人民解放军战略支援部队信息工程大学 | Deep learning-based phishing webpage hierarchical detection method and system |
CN113051500A (en) * | 2021-03-25 | 2021-06-29 | 武汉大学 | Phishing website identification method and system fusing multi-source data |
CN113315789A (en) * | 2021-07-29 | 2021-08-27 | 中南大学 | Web attack detection method and system based on multi-level combined network |
CN113794689A (en) * | 2021-08-20 | 2021-12-14 | 浙江网安信创电子技术有限公司 | Malicious domain name detection method based on TCN |
CN113946677A (en) * | 2021-09-14 | 2022-01-18 | 中北大学 | Event identification and classification method based on bidirectional cyclic neural network and attention mechanism |
CN114553555A (en) * | 2022-02-24 | 2022-05-27 | 北京字节跳动网络技术有限公司 | Malicious website identification method and device, storage medium and electronic equipment |
US20220253502A1 (en) * | 2021-02-05 | 2022-08-11 | Microsoft Technology Licensing, Llc | Inferring information about a webpage based upon a uniform resource locator of the webpage |
CN115242484A (en) * | 2022-07-19 | 2022-10-25 | 深圳大学 | DGA domain name detection model and method based on gated convolution sum LSTM |
CN117235532A (en) * | 2023-11-09 | 2023-12-15 | 西南民族大学 | Training and detecting method for malicious website detection model based on M-Bert |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108667816A (en) * | 2018-04-19 | 2018-10-16 | 重庆邮电大学 | A kind of the detection localization method and system of Network Abnormal |
CN109194635A (en) * | 2018-08-22 | 2019-01-11 | 杭州安恒信息技术股份有限公司 | Malice URL recognition methods and device based on natural language processing and deep learning |
US20190057200A1 (en) * | 2017-08-16 | 2019-02-21 | Biocatch Ltd. | System, apparatus, and method of collecting and processing data in electronic devices |
US20190075123A1 (en) * | 2017-09-06 | 2019-03-07 | Rank Software Inc. | Systems and methods for cyber intrusion detection and prevention |
CN109617909A (en) * | 2019-01-07 | 2019-04-12 | 福州大学 | A kind of malice domain name detection method based on SMOTE and BI-LSTM network |
CN110233849A (en) * | 2019-06-20 | 2019-09-13 | 电子科技大学 | The method and system of network safety situation analysis |
CN110365691A (en) * | 2019-07-22 | 2019-10-22 | 云南财经大学 | Fishing website method of discrimination and device based on deep learning |
EP3561708A1 (en) * | 2018-04-26 | 2019-10-30 | Wipro Limited | Method and device for classifying uniform resource locators based on content in corresponding websites |
-
2020
- 2020-01-07 CN CN202010012212.7A patent/CN111198995B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190057200A1 (en) * | 2017-08-16 | 2019-02-21 | Biocatch Ltd. | System, apparatus, and method of collecting and processing data in electronic devices |
US20190075123A1 (en) * | 2017-09-06 | 2019-03-07 | Rank Software Inc. | Systems and methods for cyber intrusion detection and prevention |
CN108667816A (en) * | 2018-04-19 | 2018-10-16 | 重庆邮电大学 | A kind of the detection localization method and system of Network Abnormal |
EP3561708A1 (en) * | 2018-04-26 | 2019-10-30 | Wipro Limited | Method and device for classifying uniform resource locators based on content in corresponding websites |
CN109194635A (en) * | 2018-08-22 | 2019-01-11 | 杭州安恒信息技术股份有限公司 | Malice URL recognition methods and device based on natural language processing and deep learning |
CN109617909A (en) * | 2019-01-07 | 2019-04-12 | 福州大学 | A kind of malice domain name detection method based on SMOTE and BI-LSTM network |
CN110233849A (en) * | 2019-06-20 | 2019-09-13 | 电子科技大学 | The method and system of network safety situation analysis |
CN110365691A (en) * | 2019-07-22 | 2019-10-22 | 云南财经大学 | Fishing website method of discrimination and device based on deep learning |
Non-Patent Citations (3)
Title |
---|
SARA A. ALTHUBITI: ""LSTM for Anomaly-Based Network Intrusion Detection"", 《 2018 28TH INTERNATIONAL TELECOMMUNICATION NETWORKS AND APPLICATIONS CONFERENCE (ITNAC)》 * |
邱瑶瑶等: ""基于语义分析的恶意JavaScript代码检测方法"", 《四川大学学报(自然科学版)》 * |
魏旭等: ""基于特征融合和机器学习的恶意网页识别研究"", 《南京邮电大学学报(自然科学版)》 * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111475626A (en) * | 2020-06-22 | 2020-07-31 | 上海冰鉴信息科技有限公司 | Structured partitioning method and device for referee document |
CN111538929A (en) * | 2020-07-08 | 2020-08-14 | 腾讯科技(深圳)有限公司 | Network link identification method and device, storage medium and electronic equipment |
CN112541476A (en) * | 2020-12-24 | 2021-03-23 | 西安交通大学 | Malicious webpage identification method based on semantic feature extraction |
CN112541476B (en) * | 2020-12-24 | 2023-09-29 | 西安交通大学 | Malicious webpage identification method based on semantic feature extraction |
CN112632549B (en) * | 2021-01-06 | 2022-07-12 | 四川大学 | Web attack detection method based on context analysis |
CN112632549A (en) * | 2021-01-06 | 2021-04-09 | 四川大学 | Web attack detection method based on context analysis |
US11727077B2 (en) * | 2021-02-05 | 2023-08-15 | Microsoft Technology Licensing, Llc | Inferring information about a webpage based upon a uniform resource locator of the webpage |
US20220253502A1 (en) * | 2021-02-05 | 2022-08-11 | Microsoft Technology Licensing, Llc | Inferring information about a webpage based upon a uniform resource locator of the webpage |
CN113037729A (en) * | 2021-02-27 | 2021-06-25 | 中国人民解放军战略支援部队信息工程大学 | Deep learning-based phishing webpage hierarchical detection method and system |
CN113051500A (en) * | 2021-03-25 | 2021-06-29 | 武汉大学 | Phishing website identification method and system fusing multi-source data |
CN113051500B (en) * | 2021-03-25 | 2022-08-16 | 武汉大学 | Phishing website identification method and system fusing multi-source data |
CN113315789B (en) * | 2021-07-29 | 2021-10-15 | 中南大学 | Web attack detection method and system based on multi-level combined network |
CN113315789A (en) * | 2021-07-29 | 2021-08-27 | 中南大学 | Web attack detection method and system based on multi-level combined network |
CN113794689A (en) * | 2021-08-20 | 2021-12-14 | 浙江网安信创电子技术有限公司 | Malicious domain name detection method based on TCN |
CN113946677A (en) * | 2021-09-14 | 2022-01-18 | 中北大学 | Event identification and classification method based on bidirectional cyclic neural network and attention mechanism |
CN114553555A (en) * | 2022-02-24 | 2022-05-27 | 北京字节跳动网络技术有限公司 | Malicious website identification method and device, storage medium and electronic equipment |
CN114553555B (en) * | 2022-02-24 | 2023-11-07 | 抖音视界有限公司 | Malicious website identification method and device, storage medium and electronic equipment |
CN115242484A (en) * | 2022-07-19 | 2022-10-25 | 深圳大学 | DGA domain name detection model and method based on gated convolution sum LSTM |
CN117235532A (en) * | 2023-11-09 | 2023-12-15 | 西南民族大学 | Training and detecting method for malicious website detection model based on M-Bert |
CN117235532B (en) * | 2023-11-09 | 2024-01-26 | 西南民族大学 | Training and detecting method for malicious website detection model based on M-Bert |
Also Published As
Publication number | Publication date |
---|---|
CN111198995B (en) | 2023-03-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111198995B (en) | Malicious webpage identification method | |
CN111371806B (en) | Web attack detection method and device | |
CN110019839B (en) | Medical knowledge graph construction method and system based on neural network and remote supervision | |
CN110351301B (en) | HTTP request double-layer progressive anomaly detection method | |
CN112347367B (en) | Information service providing method, apparatus, electronic device and storage medium | |
CN111291195B (en) | Data processing method, device, terminal and readable storage medium | |
CN106296195A (en) | A kind of Risk Identification Method and device | |
CN109376240A (en) | A kind of text analyzing method and terminal | |
CN108536756A (en) | Mood sorting technique and system based on bilingual information | |
CN111078978A (en) | Web credit website entity identification method and system based on website text content | |
CN110717325A (en) | Text emotion analysis method and device, electronic equipment and storage medium | |
CN112541476A (en) | Malicious webpage identification method based on semantic feature extraction | |
CN103593431A (en) | Internet public opinion analyzing method and device | |
CN113469298B (en) | Model training method and resource recommendation method | |
CN107341143A (en) | A kind of sentence continuity determination methods and device and electronic equipment | |
CN109933648B (en) | Real user comment distinguishing method and device | |
CN112559747A (en) | Event classification processing method and device, electronic equipment and storage medium | |
CN110287314A (en) | Long text credibility evaluation method and system based on Unsupervised clustering | |
CN107357895A (en) | A kind of processing method of the text representation based on bag of words | |
CN111680506A (en) | External key mapping method and device of database table, electronic equipment and storage medium | |
CN111966832A (en) | Evaluation object extraction method and device and electronic equipment | |
CN113204956B (en) | Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device | |
CN111680120B (en) | News category detection method and system | |
CN111859979A (en) | Ironic text collaborative recognition method, ironic text collaborative recognition device, ironic text collaborative recognition equipment and computer readable medium | |
CN113051607B (en) | Privacy policy information extraction method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |