CN111198995B - Malicious webpage identification method - Google Patents

Malicious webpage identification method Download PDF

Info

Publication number
CN111198995B
CN111198995B CN202010012212.7A CN202010012212A CN111198995B CN 111198995 B CN111198995 B CN 111198995B CN 202010012212 A CN202010012212 A CN 202010012212A CN 111198995 B CN111198995 B CN 111198995B
Authority
CN
China
Prior art keywords
layer
malicious
embedding
url link
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010012212.7A
Other languages
Chinese (zh)
Other versions
CN111198995A (en
Inventor
廖永建
王勇
王栋
吴宇
梁艺宽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010012212.7A priority Critical patent/CN111198995B/en
Publication of CN111198995A publication Critical patent/CN111198995A/en
Application granted granted Critical
Publication of CN111198995B publication Critical patent/CN111198995B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a malicious webpage identification method, which comprises the following steps: step 1, acquiring a malicious webpage data set, and obtaining a training set and a test set of malicious webpages through data preprocessing; step 2, acquiring character-level embedding of a training set and a test set by using a Char-CNN model; step 3, constructing a BiLSTM-Attention neural network model; step 4, constructing a BilSTM-Attention neural network model by utilizing a training set and character level embedding thereof and static word embedding training step 3; step 5, verifying the trained BilSTM-Attention neural network model in the step 4 by using a test set and character level embedding thereof and static word embedding; and step 6, after verification in the step 5, using the trained BilSTM-Attention neural network model for carrying out malicious webpage identification on webpage data accessed by the user. The method adopts a bidirectional long-time and short-time memory cyclic neural network based on an attention mechanism, and simultaneously adopts a method of combining character level embedding and static word embedding, thereby realizing the purpose of malicious webpage identification.

Description

Malicious webpage identification method
Technical Field
The invention relates to the technical field of internet security, in particular to a malicious webpage identification method.
Background
With the development of the internet industry in recent years, networks have become an indispensable part of people's lives. At the same time, however, malicious criminal activity using the internet is also increasing. The operations of utilizing malicious webpages to carry out phishing attacks, popularizing junk advertisements, guiding to download malicious software and the like are main activities of internet crimes. According to the' global Chinese phishing website status statistical analysis report (2016 >) and the recent report of the anti-phishing alliance, china is known to be the country with the largest proportion of suffering from malicious webpages, and the number of the malicious webpages rapidly increases year by year.
The traditional method for identifying malicious web pages is generally an identification method based on a blacklist technology. Is also the method most widely used in industry today. The blacklist technique is to maintain a list of malicious domain names, if the accessed domain name is not in the list of malicious domain names, the browser will regard the accessed domain name as a normal domain name, and if the accessed domain name is in the list, the accessed domain name is regarded as a malicious domain name. The method has the advantages that the technology is simple to implement, and the confirmed malicious webpage can be accurately identified. But has the disadvantage of not being able to identify malicious domain names that have not previously appeared and requires a technician to maintain a list of malicious domain names at all times.
With the development of machine learning technology in recent years, more and more people apply the machine learning technology to malicious web page detection. Manually extracting characteristics such as url length, whether the url length is an https link, domain name length and the like from the url link, detecting content of a webpage by using a honeypot technology, detecting whether a malicious script exists, detecting whether pictures on a website are illegal pictures and the like, and then classifying the pictures based on machine learning algorithms such as svm and random forest. However, this method is very dependent on experts in network security, and requires a person who is very familiar with malicious web pages to perform artificial feature extraction on the malicious web page data set. The quality of the final classification result is greatly influenced by the manually extracted features.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the existing problems, the method for identifying the malicious web pages directly classifies URL links by using character-level embedding and a bidirectional long-and-short-term memory recurrent neural network (Bi LSTM), thereby achieving the purpose of identifying the malicious web pages.
The technical scheme adopted by the invention is as follows:
a malicious webpage identification method comprises the following steps:
step 1, acquiring a malicious webpage data set, and obtaining a training set and a test set of malicious webpages through data preprocessing;
step 2, acquiring character-level embedding of a training set and a test set by using a Char-CNN model;
step 3, constructing a BiLSTM-Attention neural network model;
step 4, constructing a BilSTM-Attention neural network model by utilizing a training set, character level embedding thereof and static word embedding training step 3;
step 5, verifying the trained BilSTM-Attention neural network model in the step 4 by using a test set and character level embedding thereof and static word embedding;
and step 6, after verification in the step 5, using the trained BilSTM-Attention neural network model for carrying out malicious webpage identification on webpage data accessed by the user.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
the invention adopts a bidirectional long-time memory cyclic neural network based on an attention mechanism, and simultaneously adopts a method combining character level embedding and static word embedding to realize the purpose of malicious webpage identification, compared with the traditional malicious webpage identification method, the method of the invention comprises the following steps:
1. no personnel are required to maintain a domain name blacklist;
2. no special network security personnel are required to design features;
3. the identification rate of newly appeared malicious web pages is high;
4. the method is suitable for identifying the malicious web pages appearing at the mobile terminal.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a flowchart of a malicious web page identification method according to the present invention.
FIG. 2 is a schematic structural diagram of a BilSTM-Attention neural network model constructed by the present invention.
FIG. 3 is a schematic diagram of the present invention using a trained BilSTM-Attention neural network model for malicious web page identification of web page data accessed by a user.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, a method for identifying a malicious web page includes the following steps:
step 1, acquiring a malicious webpage data set, and obtaining a training set and a test set of malicious webpages through data preprocessing;
specifically, the method comprises the following steps:
step 1.1, removing samples with missing url links or missing labels of malicious webpage data sets, and then performing word segmentation processing; the segmentation of the English text is based on a space, but the url link is a special English text without a space, the embodiment adopts a python word ninia module to perform segmentation processing on the url link in the malicious webpage data set, and all symbols in the url link are reserved;
in the step 1.2, a plurality of abbreviated words are contained in the url link, so preprocessing operations such as stem extraction and word shape reduction are required. According to the method, a Porter Stemmer module and a WordNetLemmatizer module in a python NLTK package are adopted to extract stems and restore word shapes of url links in a malicious webpage data set;
step 1.3, in order to avoid the situation that the letters are mixed in capital and lowercase, the embodiment adopts a python lower () method to convert all letters of url links in the malicious webpage data set into lowercase or uppercase (preferably lowercase), and completes the normalization operation;
and step 1.4, dividing the malicious webpage data set processed in the step 1.1 to the step 1.3 into a training set and a testing set according to the proportion of 7:3 or 8:2 (preferably 8:2).
Step 2, acquiring character-level embedding of a training set and a test set by using a Char-CNN model;
specifically, the method comprises the following steps:
step 2.1, constructing a character table:
abcdefghijklmnopqrsttuvwxyz 0123456789-; | A! Is there a 69 characters,'/\\\ # $% & & - + - = [ ] { } are encoded using one-hot, and then a full 0 vector is added for processing characters not in the character table, forming a character table including 70 characters, and the character table is represented as a one-hot vector, for example, a [1,0,0,0 … ], having a dimension of 50.
Step 2.2, the training set or the test set is expressed by a one-hot vector of a character table, and then a Char-CNN model is input for training to obtain corresponding character level embedding, for example: a [0.2324124,0.2124244,0.5252411, … ].
Wherein, the Char-CNN model is a neural network model of 6 convolutional layers.
Step 3, constructing a BiLSTM-Attention neural network model; the BilSTM model is a bidirectional LSTM (bidirectional long-short memory recurrent neural network), and the Bi-LSTM model trains data of an input layer in a forward direction and a backward direction, so that the BilSTM can better capture context information in sentences than the LSTM, and then an Attention layer is added on the BilSTM model to form the BilSTM-Attention neural network model.
As shown in fig. 2, specifically:
step 3.1, constructing an input layer, wherein the input layer is used for inputting the malicious webpage data set preprocessed by the data in the step 1; for example [ www.dark moon. Com ];
step 3.2, constructing an embedding layer, wherein the embedding layer utilizes character-level embedding of the malicious webpage data set and static word embedding to replace words in the malicious webpage data set to obtain an embedded representation of each url link in the malicious webpage data set;
step 3.3, constructing an LSTM layer, wherein the LSTM layer comprises two layers, one layer is a forward propagation layer, and the other layer is a backward propagation layer; each LSTM layer includes a forgetting gate, an input gate, an output gate, and a cell state, wherein,
(1) Updating the forget gate output: f. of t =σ(w f h t-1 +U f x t +b f );h t-1 Indicating history information, x t Representing new information flowing into the cell, b f Is a bias term;
(2) Update input gate two part output:
i t =σ(w i h t-1 +U i x t +bi);
a t =tanh(w a h t-1 +U a x t +b a );
(3) And (3) updating the cell state:
C t =C t-1 f t +i t a t
(4) Updating two parts of output of an output gate:
o t =σ(w 0 h t-1 +U 0 x t +b 0 );
h t =o t tanh(C t );
(5) The current sequence index prediction outputs:
y t =σ(Vh t +c);
wherein, w f ,U f ,b f ,w i ,U i ,w a ,U a ,w 0 ,U 0 Obtaining parameters required to be trained for the BilSTM-Attention neural network model; sigma is a sigmoid function;
step 3.4, an attention layer is constructed, the attention layer is used for calculating the weight of all time sequences, and then the weight of all time sequences is output as a feature vector;
step 3.5, an output layer is constructed, the output layer is a full connection layer, the output of the attention layer is used as the input of the output layer, and a softmax classifier is used for processing the output of the attention layer to obtain a classification result
Step 4, constructing a BilSTM-Attention neural network model by utilizing a training set and character level embedding thereof and static word embedding training step 3; wherein, the static word embedding can adopt a Glove static word vector which is trained by Stanford university and has a dimension of 50.
Specifically, the method comprises the following steps:
step 4.1, constructing a text dictionary represented by vectors by all words in url link texts in a training set;
step 4.2, comparing the constructed text dictionary with the static word embedding one by one, if the static word embedding contains word vectors in the text dictionary, replacing the word vectors in the static word embedding, and if the static word embedding does not contain the word vectors in the text dictionary, replacing the word vectors by character-level embedding, thereby obtaining the vector representation of each url link in the training set; that is, wi in the word (containing words and characters) S = (W1, W2 …, wn) in the training set url link is mapped to Wi. S represents a url link, wi represents a word in the url link; wi is a vector, i.e., embedding, dimension of 50. The entire training set is replaced. The vector representation of each url link is obtained, i.e. a two-dimensional matrix, each column representing a word vector or a character vector.
4.3, inputting the vector representation of each url link in the training set into a forward propagation layer and a backward propagation layer in the lstm layer; the forward propagation layer and the backward propagation layer together extract language information represented by the input url-linked vector; adding the results of the forward propagation layer and the backward propagation layer at the same time to obtain a semantic feature vector in each url link, and then transmitting the semantic feature vector to an attribute layer;
4.4, receiving the semantic feature vector in each url link by the attribute layer, firstly calculating the weights of all time sequences, then outputting the weights of all time sequences as feature vectors, and calculating by adopting the following calculation formula:
U t =V tan h(w 1 h+bw);
a t =softmax(U t );
c t =∑a t h;
where h is the semantic feature vector in each url link, w 1 Is a parameter vector, bw is a bias term; u shape t Hidden layer representation for the neural network; a is t Is to U t Carrying out softmax function normalization to obtain a weight matrix; then the weight matrix a t Carrying out weighted sum with the semantic feature vector h to obtain a text vector c containing important information in url link t Finally, the text vector c t Transmitting to the output layer;
step 4.5, the output layer processes the text vector c by adopting a softmax function t The formula is as follows:
y=softmax(w j c t +b j )
wherein y is the output of the model, 0 represents a normal url link, and 1 represents a malicious url link; w is a j Representing a weight coefficient matrix to be trained from an attention layer to an output layer; b j Representing the corresponding bias term to be trained.
Because the malicious webpage identification problem is a two-classification problem, the loss function adopted by the output layer is a binary cross entropy loss function, and the loss function is an index for measuring whether the model is converged or not. And (5) the Loss function Loss is stable, and the model is converged, so that the model training is completed. The formula is as follows:
log(yt|yp)=-(yt*log(yp)+(1-yt)log(1-yp))
wherein y is a label corresponding to the x sample in the training set, the value set of the classification problem is {0,1}, yt is a real label of a certain sample, and yp is the probability of the sample yt = 1; and then drawing a Loss curve through a python matplotlib package to judge whether the Loss of the Loss function is stable or not by judging whether the Loss is balanced or not.
Step 5, verifying the trained BilSTM-Attention neural network model in the step 4 by using a test set and character level embedding thereof and static word embedding;
specifically, the method comprises the following steps:
step 5.1, inputting a test set, embedding the test set in a character level, and embedding a static word into a trained BilSTM-Attention neural network model to obtain a classification result of each url link, wherein 0 represents a normal url link, and 1 represents a malicious url link;
and 5.2, comparing the classification result of each url link with a labeled label (namely, the label labeled by each url link in the data set is 0 or 1), if the classification result is consistent with the labeled label, pred +1, and finally calculating acc = pred/number of urls in the test set, wherein acc is the accuracy of malicious webpage identification of the trained BilSTM-Attention neural network model, and when the accuracy meets the requirement, the verification is passed.
And step 6, after verification in the step 5, using the trained BilSTM-Attention neural network model for carrying out malicious webpage identification on webpage data accessed by the user. As shown in fig. 3, specifically: inputting a webpage data set accessed by a user into an input layer of a BilSTM-Attention neural network model after the processing of the step 1 and the step 2; and after the replacement is performed by combining the embedding layer with character level embedding and static word embedding, the classification result is output through an LSTM layer, an attention layer and an output layer in sequence, if the classification result is a normal url link, the access is allowed, and if the classification result is a malicious url link, the access is denied.
The above description is intended to be illustrative of the preferred embodiment of the present invention and should not be taken as limiting the invention, but rather, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims (7)

1. A malicious webpage identification method is characterized by comprising the following steps:
step 1, acquiring a malicious webpage data set, and obtaining a training set and a test set of malicious webpages through data preprocessing;
step 2, acquiring character-level embedding of a training set and a test set by using a Char-CNN model;
step 3, constructing a BiLSTM-Attention neural network model;
step 4, constructing a BilSTM-Attention neural network model by utilizing a training set, character level embedding thereof and static word embedding training step 3;
step 5, verifying the trained BilSTM-Attention neural network model in the step 4 by using a test set and character level embedding thereof and static word embedding;
step 6, after the verification of the step 5, using the trained BilSTM-Attention neural network model for carrying out malicious webpage identification on webpage data accessed by the user;
the method of the step 3 comprises the following steps:
step 3.1, constructing an input layer, wherein the input layer is used for inputting the malicious webpage data set preprocessed by the data in the step 1;
step 3.2, constructing an embedding layer, wherein the embedding layer utilizes character-level embedding of the malicious webpage data set and static word embedding to replace words in the malicious webpage data set to obtain an embedded representation of each url link in the malicious webpage data set;
step 3.3, constructing an LSTM layer, wherein the LSTM layer comprises two layers, one layer is a forward propagation layer, and the other layer is a backward propagation layer; each LSTM layer includes a forgetting gate, an input gate, an output gate, and a cell state, wherein,
(1) Updating the forget gate output: f. of t =σ(w f h t-1 +U f x t +b f );h t-1 Indicating history information, x t Representing new information flowing into the cell, b f Is a bias term;
(2) Update input gate two part output:
i t =σ(w i h t-1 +U i x t +bi);
a t =tanh(w a h t-1 +U a x t +b a );
(3) And (3) updating the cell state:
C t =C t-1 f t +i t a t
(4) Updating two parts of output of an output gate:
o t =σ(w 0 h t-1 +U 0 x t +b 0 );
h t =o t tanh(C t );
(5) Current sequence index prediction output:
y t =σ(Vh t +c);
wherein, w f ,U f ,b f ,w i ,U i ,w a ,U a ,w 0 ,U 0 Obtaining parameters required to be trained for the BilSTM-Attention neural network model; sigma is sigmoid function;
step 3.4, an attention layer is constructed, the attention layer is used for calculating the weight of all time sequences, and then the weight of all time sequences is output as a feature vector;
and 3.5, constructing an output layer, wherein the output layer is a fully-connected layer, the output of the attention layer is used as the input of the output layer, and the output of the attention layer is processed by using a softmax classifier to obtain a classification result.
2. The malicious webpage identification method according to claim 1, wherein the method of step 1 is:
step 1.1, removing samples of url link loss or label loss of the malicious webpage data set, performing word segmentation processing on the url link in the malicious webpage data set by adopting a python word ninia module, and reserving all symbols in the url link;
step 1.2, performing stemming extraction and morphological reduction on url links in a malicious webpage data set by adopting a PorterStemmer module and a WordNetLemmatizer module in a python NLTK package;
step 1.3, adopting a python lower () method to convert all letters of url links in the malicious webpage data set into lower case or upper case, and finishing normalization operation;
step 1.4, processing the malicious webpage data set processed in the steps 1.1-1.3 according to the following formula 7:3 or 8: the scale of 2 is divided into a training set and a test set.
3. The malicious webpage identification method according to claim 1, wherein the method of step 2 is:
step 2.1, constructing a character table:
abcdefghijklmnopqrsttuvwxyz 0123456789-; | But! Is there a : 69 characters, "/" ___ $% & - + - = [ ] { }, are encoded using one-hot, and then a full 0 vector is used for processing characters which are not in the character table, so that a character table comprising 70 characters is formed, and the character table is represented as a one-hot vector;
and 2.2, representing the training set or the test set by using one-hot vectors of the character table, and inputting a Char-CNN model for training to obtain corresponding character level embedding.
4. The malicious webpage identification method according to claim 1 or 3, wherein the Char-CNN model is a neural network model of 6 convolutional layers.
5. The malicious webpage identification method according to claim 4, wherein the method of step 4 is:
step 4.1, constructing a text dictionary represented by vectors by all words in url link texts in a training set;
step 4.2, comparing the constructed text dictionary with the static word embedding one by one, if the static word embedding contains word vectors in the text dictionary, replacing the word vectors in the static word embedding, and if the static word embedding does not contain the word vectors in the text dictionary, replacing the word vectors by character-level embedding, thereby obtaining the vector representation of each url link in the training set;
4.3, inputting the vector representation of each url link in the training set into a forward propagation layer and a backward propagation layer in the lstm layer; the forward propagation layer and the backward propagation layer together extract language information represented by the input url-linked vector; adding the results of the forward propagation layer and the backward propagation layer at the same time to obtain a semantic feature vector in each url link, and then transmitting the semantic feature vector to an attention layer;
step 4.4, the attribute layer receives the semantic feature vector in each url link and calculates by adopting the following calculation formula:
U t =V tanh(w 1 h+bw);
a t =softmax(U t );
c t =∑a t h;
where h is the semantic feature vector in each url link, w 1 Is a parameter vector, bw is a bias term; u shape t Hidden layer representation for the neural network; a is t Is to U t Performing softmax function normalization to obtain a weight matrix; then the weight matrix a t Carrying out weighted sum with the semantic feature vector h to obtain a text vector c containing important information in url link t Finally, the text vector c t Transmitting to the output layer;
step 4.5, the output layer processes the text vector c by adopting a softmax function t The formula is as follows:
y=softmax(w j c t +b j )
wherein y is the output of the model, 0 represents a normal url link, and 1 represents a malicious url link; w is a j Representing a weight coefficient matrix to be trained from an attention layer to an output layer; b j Representing the corresponding bias term to be trained.
6. The malicious webpage identification method according to claim 5, wherein the loss function adopted by the output layer is a binary cross entropy loss function, and the formula is as follows:
log(yt|yp)=-(yt*log(yp)+(1-yt)log(1-yp))
wherein y is a label corresponding to the x sample in the training set, the value set of the classification problem is {0,1}, yt is a real label of a certain sample, and yp is the probability of the sample yt = 1; and then drawing a Loss curve through a pythonomaplib package to judge whether the Loss of the Loss function is stable or not by judging whether the Loss curve is balanced or not.
7. The malicious webpage identification method according to claim 5, wherein the method of step 5 is:
step 5.1, inputting a test set, embedding the test set in a character level, and embedding a static word into a trained BilSTM-Attention neural network model to obtain a classification result of each url link, wherein 0 represents a normal url link, and 1 represents a malicious url link;
and 5.2, comparing the classification result of each url link with the labeled label, if the classification result is consistent with the labeled label, pred +1, and finally calculating the number of acc = pred/url in the test set, wherein acc is the accuracy of the trained BilSTM-Attention neural network model for identifying the malicious web pages, and the verification is passed when the accuracy meets the requirement.
CN202010012212.7A 2020-01-07 2020-01-07 Malicious webpage identification method Active CN111198995B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010012212.7A CN111198995B (en) 2020-01-07 2020-01-07 Malicious webpage identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010012212.7A CN111198995B (en) 2020-01-07 2020-01-07 Malicious webpage identification method

Publications (2)

Publication Number Publication Date
CN111198995A CN111198995A (en) 2020-05-26
CN111198995B true CN111198995B (en) 2023-03-24

Family

ID=70744746

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010012212.7A Active CN111198995B (en) 2020-01-07 2020-01-07 Malicious webpage identification method

Country Status (1)

Country Link
CN (1) CN111198995B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111475626A (en) * 2020-06-22 2020-07-31 上海冰鉴信息科技有限公司 Structured partitioning method and device for referee document
CN111538929B (en) * 2020-07-08 2020-12-18 腾讯科技(深圳)有限公司 Network link identification method and device, storage medium and electronic equipment
CN112541476B (en) * 2020-12-24 2023-09-29 西安交通大学 Malicious webpage identification method based on semantic feature extraction
CN112632549B (en) * 2021-01-06 2022-07-12 四川大学 Web attack detection method based on context analysis
US11727077B2 (en) * 2021-02-05 2023-08-15 Microsoft Technology Licensing, Llc Inferring information about a webpage based upon a uniform resource locator of the webpage
CN113037729B (en) * 2021-02-27 2022-11-18 中国人民解放军战略支援部队信息工程大学 Deep learning-based phishing webpage hierarchical detection method and system
CN113051500B (en) * 2021-03-25 2022-08-16 武汉大学 Phishing website identification method and system fusing multi-source data
CN113315789B (en) * 2021-07-29 2021-10-15 中南大学 Web attack detection method and system based on multi-level combined network
CN113794689A (en) * 2021-08-20 2021-12-14 浙江网安信创电子技术有限公司 Malicious domain name detection method based on TCN
CN114553555B (en) * 2022-02-24 2023-11-07 抖音视界有限公司 Malicious website identification method and device, storage medium and electronic equipment
CN115242484A (en) * 2022-07-19 2022-10-25 深圳大学 DGA domain name detection model and method based on gated convolution sum LSTM
CN117235532B (en) * 2023-11-09 2024-01-26 西南民族大学 Training and detecting method for malicious website detection model based on M-Bert

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108667816A (en) * 2018-04-19 2018-10-16 重庆邮电大学 A kind of the detection localization method and system of Network Abnormal
CN109194635A (en) * 2018-08-22 2019-01-11 杭州安恒信息技术股份有限公司 Malice URL recognition methods and device based on natural language processing and deep learning
CN109617909A (en) * 2019-01-07 2019-04-12 福州大学 A kind of malice domain name detection method based on SMOTE and BI-LSTM network
CN110233849A (en) * 2019-06-20 2019-09-13 电子科技大学 The method and system of network safety situation analysis
CN110365691A (en) * 2019-07-22 2019-10-22 云南财经大学 Fishing website method of discrimination and device based on deep learning
EP3561708A1 (en) * 2018-04-26 2019-10-30 Wipro Limited Method and device for classifying uniform resource locators based on content in corresponding websites

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190057200A1 (en) * 2017-08-16 2019-02-21 Biocatch Ltd. System, apparatus, and method of collecting and processing data in electronic devices
US10812504B2 (en) * 2017-09-06 2020-10-20 1262214 B.C. Unlimited Liability Company Systems and methods for cyber intrusion detection and prevention

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108667816A (en) * 2018-04-19 2018-10-16 重庆邮电大学 A kind of the detection localization method and system of Network Abnormal
EP3561708A1 (en) * 2018-04-26 2019-10-30 Wipro Limited Method and device for classifying uniform resource locators based on content in corresponding websites
CN109194635A (en) * 2018-08-22 2019-01-11 杭州安恒信息技术股份有限公司 Malice URL recognition methods and device based on natural language processing and deep learning
CN109617909A (en) * 2019-01-07 2019-04-12 福州大学 A kind of malice domain name detection method based on SMOTE and BI-LSTM network
CN110233849A (en) * 2019-06-20 2019-09-13 电子科技大学 The method and system of network safety situation analysis
CN110365691A (en) * 2019-07-22 2019-10-22 云南财经大学 Fishing website method of discrimination and device based on deep learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"LSTM for Anomaly-Based Network Intrusion Detection";Sara A. Althubiti;《 2018 28th International Telecommunication Networks and Applications Conference (ITNAC)》;20190117;全文 *
"基于特征融合和机器学习的恶意网页识别研究";魏旭等;《南京邮电大学学报(自然科学版)》;20191209;全文 *
"基于语义分析的恶意JavaScript代码检测方法";邱瑶瑶等;《四川大学学报(自然科学版)》;20190426;全文 *

Also Published As

Publication number Publication date
CN111198995A (en) 2020-05-26

Similar Documents

Publication Publication Date Title
CN111198995B (en) Malicious webpage identification method
CN111371806B (en) Web attack detection method and device
CN108959270B (en) Entity linking method based on deep learning
CN108376151A (en) Question classification method, device, computer equipment and storage medium
US20230385409A1 (en) Unstructured text classification
CN112347367A (en) Information service providing method, information service providing device, electronic equipment and storage medium
CN106296195A (en) A kind of Risk Identification Method and device
CN109376240A (en) A kind of text analyzing method and terminal
CN112541476B (en) Malicious webpage identification method based on semantic feature extraction
CN109271627A (en) Text analyzing method, apparatus, computer equipment and storage medium
CN111291195A (en) Data processing method, device, terminal and readable storage medium
CN111078978A (en) Web credit website entity identification method and system based on website text content
CN103593431A (en) Internet public opinion analyzing method and device
CN110162624B (en) Text processing method and device and related equipment
CN109918648B (en) Rumor depth detection method based on dynamic sliding window feature score
CN108090099B (en) Text processing method and device
Sheshikala et al. Natural language processing and machine learning classifier used for detecting the author of the sentence
CN107357895A (en) A kind of processing method of the text representation based on bag of words
CN113032525A (en) False news detection method and device, electronic equipment and storage medium
Al-Alyan et al. Robust URL Phishing Detection Based on Deep Learning.
CN112528294A (en) Vulnerability matching method and device, computer equipment and readable storage medium
CN109933648B (en) Real user comment distinguishing method and device
CN110852071A (en) Knowledge point detection method, device, equipment and readable storage medium
CN111680120B (en) News category detection method and system
Wibowo et al. Detection of Fake News and Hoaxes on Information from Web Scraping using Classifier Methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant