CN112541476B

CN112541476B - Malicious webpage identification method based on semantic feature extraction

Info

Publication number: CN112541476B
Application number: CN202011554458.3A
Authority: CN
Inventors: 李志雄; 林宜雄
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2023-09-29
Anticipated expiration: 2040-12-24
Also published as: CN112541476A

Abstract

The invention discloses a malicious webpage identification method based on semantic feature extraction, which comprises the following steps: s1, acquiring a webpage source code; s2, preprocessing data by using webpage source codes, wherein the preprocessing comprises the following steps: s2-1, extracting texts and images in the webpage; s2-2, identifying texts in the images extracted by the webpage in S2-1; and S3, processing the texts extracted from the S2-1 and the S2-2 through a BiLSTM-CNN neural network to realize the identification of the web page and judging whether the identified web page is a legal web page or a malicious web page. The method can be applied to the field of webpage safety, considers the scenes of deformation countermeasure of a plurality of webpage contents, achieves better recognition effect than the traditional method in the aspect of automatic identification of malicious webpages, and has higher recognition accuracy.

Description

Malicious webpage identification method based on semantic feature extraction

Technical Field

The invention relates to the field of natural language processing, relates to the field of network security, and in particular relates to a malicious webpage identification method based on semantic feature extraction.

Background

With the progress of internet technology, the number of network users is continuously increased, and various companies and institutions build portal websites in a dispute. The user can receive various trendy information through the search engine. Among the many web pages, in addition to web pages that are healthy in content and safe in sites, there are a significant portion of malicious web pages. The malicious webpages are in various forms, and some webpages provide bad information for users by utilizing pornography or comma pictures; some web pages drill legal holes to develop illegal lottery activities on the network; some web pages release false information to trap users for actions such as bill swiping, so that the users cause economic losses and the like. Analysis reports according to internet network security detection data show that the first three of internet malicious programs are rogue behavior, content illegal and information stealing. The search engine is an entrance for a user to acquire information, and if the search engine cannot effectively identify malicious webpages, the search engine can pose a great threat to the privacy security and property security of the user.

Disclosure of Invention

In order to solve the problems in the prior art, the invention aims to provide a malicious webpage identification method based on semantic feature extraction, which can be applied to the field of webpage security, considers the scenes of various webpage content deformation countermeasures, achieves better identification effect in the aspect of malicious webpage automatic identification compared with the traditional method, and has higher identification accuracy.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a malicious webpage identification method based on semantic feature extraction comprises the following steps:

s1, acquiring a webpage source code;

s2, preprocessing data by using webpage source codes, wherein the preprocessing comprises the following steps:

s2-1, extracting texts and images in the webpage;

s2-2, identifying texts in the images extracted by the webpage in S2-1;

and S3, processing the texts extracted from the S2-1 and the S2-2 through a BiLSTM-CNN neural network to realize the identification of the web page and judging whether the identified web page is a legal web page or a malicious web page.

Preferably, in S2-1, when extracting text in the webpage, the Unicode character is subjected to escape processing, text is extracted based on an html parser, and the webpage Dom tree is reconstructed based on beautfullsource.

Preferably, when reconstructing the Dom tree of the web page based on the beaufullcap, performing depth-first traversal on the Dom tree, sweeping each non-leaf node in the tree, if the attribute of a certain node is a text label, removing the node from the Dom tree by using a so-delete () method, obtaining a new Dom tree after the Dom tree is traversed, and performing two-order serialization on the new Dom tree to generate a new html text.

Preferably, in S2-2, the text of the image is recognized by an OCR method, and the extracted image is segmented to enable the size of the image to meet the length and width limitation of an OCR interface.

Preferably, the extracted image is segmented based on Phantom Js, and for overlong pictures, a Canny algorithm is utilized for picture segmentation; the overlong picture is a picture with a size exceeding 4M after base64 encoding or a picture with a longest side exceeding 4096 px.

Preferably, when the Canny algorithm is used for image segmentation, noise is eliminated through filtering, then the amplitude and the direction of the gradient are calculated, the Sobel operator is selected to calculate the amplitude and the direction of the gradient, and the formula is as follows:

and performing non-maximum value inhibition on the calculated result, and then using a double-threshold connecting edge to complete image segmentation.

Preferably, the S2 further includes S2-3, and the S2-3 includes: based on word2vec word vector method, intercepting overlong texts obtained in S2-1 and 2-2 to obtain sentences with bad keywords and sentences close to the sentences, and obtaining a text to be detected; the overlong text is text with the text character number exceeding 800;

and S3, processing the text to be detected obtained in the S2-3 through a BiLSTM-CNN neural network.

Preferably, S2-3 comprises the steps of:

s2-3-1, obtaining a bad keyword set E (w) after manual verification;

s2-3-2, word segmentation is carried out on the text to be detected, and a text vocabulary set D (w) to be detected is obtained; carrying out vector mapping on the words in E (w) and D (w) through a word-word vector corresponding model to obtain a bad keyword vector set Ve (w) and a keyword vector set Vd (w) to be detected;

s2-3-3, obtaining the nearest keywords with the threshold value among vectors larger than the preset number before preset according to the similarity between the Euclidean distance measurement Ve (w) and Vd (w);

s2-3-4, counting sentences in the text to be detected, selecting sentences containing the nearest keywords obtained in the S2-3-3, adding sentences which are immediately adjacent to each other up and down of the sentences into a sentence set, and obtaining the final text to be detected after de-duplication.

Preferably, in S2-3-2, when there are words in D (w) for which the corresponding word vector cannot be found in E (w), the words are represented by a predetermined symbol, or the words are stripped from D (w).

Preferably, in the BiLSTM-CNN neural network, three convolution kernels are selected by the CNN layer, wherein the three convolution kernels are 3*3, 4*4 and 5*5 respectively, and the number of the three convolution kernels is 128.

The invention has the following beneficial effects:

according to the method, the data preprocessing is performed by using the webpage source codes, the text in the webpage and the text in the image can be extracted, the extracted text is identified through the BiLSTM-CNN neural network, and whether the identified webpage is a legal webpage or a malicious webpage is judged. Therefore, the method and the device for detecting the content on the webpage are more comprehensive in detection, improve the accuracy rate of webpage identification, and overcome the defect that in the prior art, only characters on the webpage can be detected, but characters embedded in an image are not identified and detected, so that malicious webpages are missed.

Furthermore, when the text in the webpage is extracted, the Unicode characters are subjected to escape processing, so that potential sensitive words can be fully mined.

Further, the overlong text is intercepted, sentences with bad keywords and sentences close to the sentences are obtained, the text to be detected is obtained, and the calculated amount of the BiLSTM-CNN neural network can be reduced after the text is initially checked, so that the efficiency is improved.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of the structure of BiLSTM-CNN neural network employed in the present invention.

Detailed Description

The invention will be further described with reference to the drawings and examples.

Referring to fig. 1, the malicious webpage identification method based on semantic feature extraction of the present invention includes the following steps:

1) And crawling malicious webpage data from the Phish tank website by utilizing a crawler technology, and crawling legal Chinese webpage data from the Alexa ranking list.

2) Preprocessing the webpage data, comprising the following steps:

2-1) carrying out escape processing on Unicode characters in the webpage, extracting text elements in the webpage based on an html parser of a Google open source, and reconstructing a webpage Dom tree based on beautfullsoup.

2-2) segmenting the generated image based on Phantom Js, and for overlong pictures (i.e. pictures with the size exceeding 4M after base64 encoding or pictures with the longest side exceeding 4096 px), segmenting the pictures by utilizing a Canny algorithm, and recognizing the text content of the image by an OCR technology.

2-3) based on word2vec word vector technology, intercepting the overlong text obtained in the step 2-1) and the step 2-2) to obtain sentences with bad keywords and sentences adjacent to the sentences, and obtaining the text to be detected.

3) Processing the text to be detected obtained in the step 2), labeling legal webpages and malicious webpages, and dividing the processed data set into 8:1:1, respectively serving as a training data set, a test data set and a verification data set;

4) Model training: selecting BiLSTM-CNN neural network as a model, defining a Loss function Loss (w), and training the prediction model by using the training data set obtained in the step 3).

5) Model evaluation: the model is evaluated using the validation dataset obtained in step 3).

Innovation point of the invention

1) The invention further improves the webpage characteristics based on the rich element text semantic extraction in the webpage, and can handle the condition of webpage content tampering.

2) In the step 2-1), the escape processing is carried out on the bad web page in the original html of the web page by utilizing Unicode coding to replace sensitive characters, so that potential sensitive words are fully mined.

3) The invention is further improved in that in the step 2-2), the screen capturing of the original content of the webpage is not directly performed in consideration of the recognition efficiency of the OCR interface. Firstly, generating a Dom tree in a memory by using an open source library Beautiflup provided by Python, removing all passed child nodes containing < text > tags and other text tags by performing depth-first traversal on the Dom tree, and then performing secondary serialization on the Dom tree to obtain an html document. The processing can reject text data originally existing in the form of labels in the webpage, so that the number of characters to be identified after screenshot is reduced, and the identification efficiency is improved.

4) The invention is further improved in that the oversized screenshot is further segmented in step 2-2) so as to meet the length and width limitations of the OCR interface. Firstly, eliminating noise through filtering, then calculating the amplitude and direction of the gradient, and selecting a Sobel operator for calculation

Non-maximum suppression is performed on the calculated result, and then a dual-threshold connected edge is used.

5) The invention further improves that in the step 2-3), the text to be detected which is overlong is intercepted in a key paragraph rather than in a fixed intercepting mode. The distance between word vectors is used for measuring the similarity between the text to be detected and words in a word stock, and the similarity is not traditional character-based matching.

The method is different from the traditional identification method based on URL features or web page structural features, and the method takes web page content semantics as features to judge, so that web pages with bad contents can be prevented from being tampered due to black sites or partially screened by the URL. Aiming at the interference mode that a part of the pornographic webpages display sensitive texts in the form of pictures, the method provides a strategy for carrying out webpage screenshot and OCR recognition based on the dot tree reconstruction on the premise of fully considering the efficiency of a character recognition interface so as to cope with the deformation interference. Aiming at the defect that part of webpage content is overlong and the traditional mode cannot selectively intercept the key paragraphs, the method extracts the key paragraphs in the long text based on word vector similarity. In the aspect of detection algorithm, the method uses the network structure of BiLSTM+CNN to fully extract the local features and global semantics of text content, and has higher recognition precision compared with the original textCNN and FastText networks.

The invention aims to provide a malicious webpage identification method based on a long-short-term memory network, which is used for effectively classifying and identifying webpage contents. In view of the problems of the existing method, the method extracts the text with deformation antagonism based on the mode of the Dom tree reconstruction and the webpage screenshot, and extracts the pictured text content by utilizing the OCR technology; the method has the advantages that the long-time and short-time memory network is combined with the convolutional neural network to extract global semantics and local features of the text, the legality of the webpage content is identified, and the accuracy and the robustness of the identification effect are enhanced. And meanwhile, for the ultra-long text, keyword positioning is performed based on the word vector distance, and the keyword is used for extracting the key paragraph, so that the quality of the intercepted text is improved, and the recognition accuracy is improved.

Examples

The malicious webpage identification method based on semantic feature extraction of the embodiment comprises the following steps:

1) Collection of data:

the webpage data are divided into two categories, wherein the first category is legal webpage, and the main source is 5000 Chinese webpage data which are ranked at the top in the comprehensive ranking of Alexa website traffic; the second category is illegal web pages, and the main source of the data base of the malicious web page statistics system PhishTank is 4632 copies of malicious sample data in 2017 Chinese network security countermeasures.

2) Pretreatment of data:

2-1) utilizing a regular expression matching module of python, and escaping Unicode codes possibly contained in the webpage to obtain Chinese characters corresponding to code points.

2-2) extracting Chinese text in the webpage based on an html parser Gumbo of Google open source.

2-3) generating a static html text into a structural Dom tree in the memory by using an open source library Beau aflulSoup, wherein the Dom tree contains all information of the original html text. The Dom tree is traversed depth first, each non-leaf node in the tree is swept, and if the attribute of a node is a textualized label, such as < text >, then the node is removed from the Dom tree using the so delete () method. After traversing, the current Dom tree has no text form elements, and the Dom tree is subjected to secondary serialization to generate a new html text. The html text currently obtained no longer contains text elements.

2-4) utilizing an open source library Phantom Js to perform screenshot on the static html generated in the step 2-3) to generate a screenshot in the form of ng. And (5) calculating the length of the screenshot and the storage size of the image, and dividing the image for overlong screenshot. The segmentation algorithm selects a Canny algorithm and is divided into four steps: the first step, noise is eliminated by filtering; secondly, calculating the amplitude and direction of the gradient, and in this embodiment, selecting a Sobel operator, wherein the calculation formula is as follows:

thirdly, performing non-maximum suppression, and determining proper division points; and fourthly, connecting edges by double threshold values to complete image clipping. And performing character recognition on the segmented webpage screenshot by using an open source pyocr library, and outputting corresponding text content.

2-5) intercepting the text output in the step 2-4). Firstly, summarizing a full amount of malicious webpage samples, and performing Chinese word segmentation on texts extracted from the webpages by using a Chinese word segmentation tool jieba. Storing the word segmentation result in a list, and carrying out frequency statistics on words in the list. And taking out the first 2500 words with highest frequency to perform manual verification to obtain 932 sensitive words including the aspects of lottery, pornography, bloody smell and illegal firearms. In addition, by manually viewing 500 malicious webpages, 565 other sensitive words are distinguished. Thus, total 1497 sensitive words are obtained.

Word vectors corresponding to sensitive words are obtained based on a CBOW algorithm, and the concrete flow of the CBOW is as follows:

the Context (w) is expressed as the Context of the word w, and a word vector corresponding to a certain word and the words adjacent to the word is input into the model at the input layer of the CBOW model; these vectors are added at the projection layer as follows:

because the hidden layer in the middle of other neural network models is removed, the CBOW model directly predicts the target word according to the context of the input layer:

the optimization target formula of the model is as follows:

for each given training sample (Context (w), w), context (w) is input, w is output, when w is a positive sample, the other words in the word stock are negative samples, so that each word in the word stock satisfies the following condition

As can be seen from equation 2.10 above, the model is optimized, in fact, the maximization is required

While

The overall optimization objective can be expressed as

Where σ represents a sigmoid function, which is a common activation function in neural networks,the probability of predicting the word u as a positive sample, denoted CBOW, is +.>Expressed as CBOW predicts the probability of word u as a negative sample. When u is a positive sample, +.>The larger the model, the better the model prediction effect is>Smaller values indicate better predictive performance of the model. The optimization procedure for equation (6) is performed using a random gradient-increasing algorithm. Finally, word vector expressions of the words in the corpus can be obtained.

Since traditional keyword matching is strictly character-based, this approach is too inflexible to discern the existence of some ambiguities. Such as the synonym set R [ { "teacher", "teacher" }, { "restaurant", "restaurant" } ], would rely heavily on word stock of keywords if only keyword discovery was done based on character matching. However, the information of the bad webpage is updated very fast, and sensitive words can be expressed in various obscure modes, so that the traditional character matching mode cannot achieve a good effect. Because the CBOW algorithm obtains word vector expression forms of words in the malicious webpage, the similarity degree of two words is measured by using the distance between the word vectors in the embodiment, so that approximate keywords in the text to be detected are positioned within a certain threshold. The method comprises the following specific steps:

a) And setting the bad keyword set after manual verification as E (w).

b) And segmenting the text to be detected through a Chinese word segmentation tool jieba to obtain a text vocabulary set D (w) to be detected. And carrying out vector mapping on the words in E (w) and D (w) through a word-word vector corresponding model trained by a CBOW algorithm. It should be noted that, limited by the material of the corpus, it cannot be guaranteed that all the words in D (w) can find the corresponding word vector, and this time is denoted by the symbol < UNK > in a unified way, or such unmapped words are stripped off the D (w) set directly.

c) Thus, a bad keyword vector set Ve (w) is obtained, and a keyword vector set Vd (w) to be detected is obtained

d) Based on similarity between Euclidean distance metrics Ve (w) and Vd (w)

Wherein x, y each represent an n-dimensional vector,x _i representing the value of the vector in the ith dimension

e) The first k nearest keywords with vector-to-vector threshold greater than T are fetched

f) Counting sentences in the original Text, selecting sentences containing the first k nearest keywords obtained in the step e), adding sentences immediately adjacent to each other in the sentences into a sentence set Sen(s), and performing de-duplication to obtain a final Text-detect to be detected.

Step 3), processing the data (final Text-detect to be detected) obtained in the step 2), labeling legal web pages and malicious web pages, and dividing the processed data set into 8:1:1, respectively used as a training data set, a test data set and a verification data set;

step 4), model training: selecting a BiLSTM-CNN neural network as a model, defining a Loss function Loss (w), selecting a cross entropy Loss function by the Loss function, and calculating a Loss value between a prediction segmentation result and a real label in the training process, wherein a calculation formula is as follows:

wherein H (P, Q) represents a loss value between the prediction segmentation result and the real label, P (X) represents a real distribution of the sample, and Q (X) represents a distribution predicted by the model; training the prediction model by using the training data set obtained in the step 3). The model structure is shown in fig. 2:

the word vector matrix generated after word unbinding is firstly put into a stacked LSTM, sigmoid is selected as an activation function, 3 convolution kernels are selected by a CNN layer, the number of the convolution kernels is respectively 3*3, 4*4 and 5*5, the number of the convolution kernels is 128, relu is selected as the activation function, and then the global average is selected as a pooling function. The Dropout parameter is set to 0.5 and finally the activation function of the output layer selects Softmax, mapping the vector to a value between (0, 1) to represent the probability of malicious web pages.

Step 5), model evaluation: the model is evaluated using the validation dataset obtained in step 3).

The experiment used precision, recall and F1-score as evaluation criteria. The accuracy rate represents the proportion of all the predictions that the median is correctly predicted to be positive, and the higher the accuracy rate is, the higher the degree of distinguishing negative samples by the representative model is; the recall rate is a proportion of the correct prediction positive to all the actual positive, and the higher the recall rate is, the better the recognition degree of the representative model on the positive sample is.

Model quality was evaluated with accuracy and recall, and the control test was

A. Fixed intercept-OCR-free textCNN

B. Fixed intercept-OCR-free FastText

C. Fixed intercept-OCR-BiLSTM-CNN

D. Keyword interception-OCR-BiLSTM-CNN

The experimental results are shown in table 1:

TABLE 1

As can be seen from Table 1, the detection accuracy and recall rate based on the FastText model are both the lowest, mainly because the network structure of FastText is simpler and word vectors are simply summed and averaged. The precision of the TextCNN model is improved by 3 percentage points over the FastText model on the present dataset. The BiLSTM-CNN has stronger global semantic extraction capability than the textCNN because of extracting the sequential semantic features. The method is based on the BiLSTM-CNN model, OCR is carried out on pictures in the webpage, richer webpage text information is extracted, parameters are adjusted, 1-percent improvement is achieved compared with a traditional detection model, the recognition accuracy reaches 92.28%, the recall rate reaches 84.68%, and the malicious webpage can be recognized more accurately.

The invention takes the web pages collected from a malicious web page database PhishTank and a ranking website Alexa as a data set. In order to extract the rich text information in the webpage, unicode transcoding processing is firstly carried out on the webpage text, and the encoded sensitive information is restored. And constructing a corresponding Dom tree according to the webpage source codes, reconstructing the Dom tree to remove redundant text information and carrying out secondary serialization. And carrying out screenshot on the serialized webpage source codes, segmenting the screenshot based on a Canny algorithm, and carrying out OCR processing on the obtained picture to obtain a sensitive text for deformation countermeasure. For overlong texts, the method adopts a keyword paragraph interception mode based on word vector distance, and obtains better effect compared with the original fixed interception mode. The method selects the BiLSTM-CNN neural network to extract text characteristics, fully learns semantic characteristics and local characteristics of the webpage text, and has higher recognition accuracy compared with the original textCNN and FastText networks. Moreover, the identification method based on the webpage semantic features can cope with the condition of webpage content tampering, which is a scene that the traditional URL feature-based identification malicious webpage cannot cope with. The method can be applied to the field of webpage safety, considers the scenes of the deformation countermeasure of a plurality of webpage contents, and achieves better recognition effect than the traditional method in the aspect of the automatic identification of malicious webpages.

Claims

1. A malicious webpage identification method based on semantic feature extraction is characterized by comprising the following steps:

s1, acquiring a webpage source code;

s2-1, extracting texts and images in the webpage;

s2-2, identifying texts in the images extracted by the webpage in S2-1;

s3, byBiLSTM-CNNThe neural network processes the texts extracted in the S2-1 and the S2-2, so that the identification of the web page is realized, and whether the identified web page is a legal web page or a malicious web page is judged;

s2-1, when extracting text in the webpage, matchingUnicodeCharacter escape processing based onhtmlParser extracts text based onbeautifulsoupFor web pagesDomReconstructing a tree;

based onbeautifulsoupFor web pagesDomWhen the tree is reconstructed, the tree is reconstructedDomThe tree is subjected to a depth-first traversal,each non-leaf node in the swept tree is utilized if the attribute of a node is a textually taggedsoup.delete(）Method the node slaveDomRemoving from tree, forDomAfter the tree is traversed, a new tree is obtainedDomTree pair newDomThe tree is subjected to two-order serialization to generate a new parthtmlText;

s2-2 byOCRThe method identifies the text of the image, segments the extracted image to make the image size meetOCRThe length and width limitation of the interface;

segmenting the extracted image based on Phantom Js, and for overlong pictures, carrying out picture segmentation by using a Canny algorithm;

the S2 also comprises S2-3, and the S2-3 comprises: based onword2vecThe word vector method is used for intercepting overlong texts obtained in S2-1 and 2-2 to obtain sentences with bad keywords and sentences adjacent to the sentences, and obtaining texts to be detected;

s3, throughBiLSTM-CNNThe neural network processes the text to be detected acquired in the step S2-3;

s2-3 comprises the following steps:

s2-3-1, obtaining a bad keyword set after manual verificationE（w）；

S2-3-2, word segmentation is carried out on the text to be detected, and a vocabulary set of the text to be detected is obtainedD(w）The method comprises the steps of carrying out a first treatment on the surface of the By a word-word vector corresponding modelE(w）AndD(w）vector mapping is carried out on the words in the list to obtain a bad keyword vector setVe(w）And keyword vector set to be detectedVd(w）；

S2-3-3, according to Euclidean distance metricVe(w）AndVd(w）similarity among the keywords, obtaining the nearest keywords with the threshold value among vectors larger than the preset number of the preset value;

s2-3-4, counting sentences in the text to be detected, selecting sentences containing the nearest keywords obtained in the S2-3-3, adding sentences which are immediately adjacent to each other up and down of the sentences into a sentence set, and obtaining a final text to be detected after de-duplication;

s2-3-2, whenD(w）Is present in (a)Can not be atE(w）When the words corresponding to the word vectors are found, the words are represented by preset symbols, or the words are represented by the preset symbols from the wordsD(w）And (3) peeling.

2. The malicious webpage identification method based on semantic feature extraction as claimed in claim 1, wherein when image segmentation is performed by using Canny algorithm, noise is eliminated by filtering, then the amplitude and direction of gradient are calculated, and selection is performedSobelThe operator calculates the magnitude and direction of the gradient as follows:

3. The method for identifying malicious web page based on semantic feature extraction according to claim 1, wherein,BiLSTM-CNNin the neural network, the CNN layer selects three convolution kernels, wherein the three convolution kernels are 3*3, 4*4 and 5*5 respectively, and the number of the three convolution kernels is 128.