CN114169432A

CN114169432A - Cross-site scripting attack identification method based on deep learning

Info

Publication number: CN114169432A
Application number: CN202111482584.7A
Authority: CN
Inventors: 任玉坤; 何召阳; 何晓刚; 刘兵; 董昊辰; 李克萌
Original assignee: Nanjing Mowang Yunrui Technology Co ltd
Current assignee: Nanjing Mowang Yunrui Technology Co ltd
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-03-11
Anticipated expiration: 2041-12-06
Also published as: CN114169432B

Abstract

The invention discloses a cross-site scripting attack identification method based on deep learning, which comprises the following steps: s1, webpage data collection: constructing a target range containing cross-site scripting attack vulnerability, collecting related data containing cross-site scripting attack by using a scanner and a manual infiltration mode, and classifying and labeling the related data; s2, data feature extraction: carrying out data cleaning on related data, carrying out word segmentation on the related data through a word segmentation device, and outputting feature data after word segmentation; s3, constructing a data set: constructing a data set by adopting a Wold2Vec model; s4, training a neural network: using an LSTM neural network to perform cross-site scripting attack detection and outputting a detection result; s5, Softmax layer classification: and (4) after the detection result is subjected to Softmax function operation, obtaining probability distribution of cross-site scripting attack on all target information, and judging whether the related data is cross-site scripting attack or not. The invention can effectively improve the cross-site scripting attack identification efficiency and improve the safety.

Description

Cross-site scripting attack identification method based on deep learning

Technical Field

The invention relates to the technical field of network data security, in particular to a cross-site scripting attack identification method for deep machine learning.

Background

Nowadays, computer network technology is developed rapidly, and network crime behaviors are increasing day by day. The network crime behavior mainly has two forms, namely illegal acquisition of system data and incapability of providing service for the system. In the aspect of illegally obtaining system data, cross-site scripting attacks are very typical attack means for maliciously stealing information by utilizing website vulnerabilities. Unlike most attacks, cross-site scripting vulnerabilities involve an attacker, a client, and a website, rather than just an attacker and a victim as in most attacks. This undoubtedly increases the difficulty of defense and attack of cross-site scripting vulnerabilities.

The traditional method is carried out by two modes of a manual dynamic detection method and a static detection method. The first dynamic detection method starts from black box testing and is combined with a method related to penetration attack, so that the XSS vulnerability can be detected. The current dynamic detection method uses a real XSS attack code or utilizes a web crawler to perform crawling analysis on a target webpage, but the time overhead of the web crawler is huge, the crawled page data cannot be guaranteed to cover all pages of a website, the attack code stored in a database cannot cover all attack scenes, and the requirement of the access overhead on a server is very high. The second static detection method is that HTML5 and CORS attribute rules design a filter in the browser to detect XSS attacks and provide a system to determine if the intercepted request has a malicious attempt. Through the above, it is obvious that the traditional cross-site script detection method usually needs to spend a lot of time and energy to extract the features of the attack data, and certain experience is needed to be combined to obtain a good effect. The degree of dependence on personnel is high, the final effect is influenced by the uneven personnel ability level, and the expense on server resources is also high.

Disclosure of Invention

The invention aims to solve the technical problem that people cannot effectively prevent cross-site scripting attack in time when using internet data in the prior art, and provides a cross-site scripting attack identification method based on deep learning, which can effectively improve the cross-site scripting attack identification efficiency and improve the safety.

The invention provides a cross-site scripting attack identification method based on deep learning, which comprises the following steps:

s1, webpage data collection: constructing a target range containing cross-site scripting attack vulnerability, collecting related data containing cross-site scripting attack by using a scanner and a manual infiltration mode, and classifying and labeling the related data;

s2, data feature extraction: carrying out data cleaning on the related data, carrying out word segmentation on the related data through a word segmentation device, and outputting feature data after word segmentation;

s3, constructing a data set: constructing a data set by adopting a Wold2Vec model;

s4, training a neural network: using an LSTM neural network to perform cross-site scripting attack detection and outputting a detection result;

s5, Softmax layer classification: and (4) after the detection result is subjected to Softmax function operation, obtaining probability distribution of cross-site scripting attack on all target information, and judging whether the related data is cross-site scripting attack or not.

LSTM is an improved model of the Recurrent Neural Network (RNN) whose main feature is the ability to handle sequence features, i.e. the preceding and following inputs are related. The LSTM neural network overcomes the defect that the recurrent neural network cannot process long-distance dependence, becomes the most popular recurrent neural network at present, and is good achievement in the fields of image processing, text classification and the like. The design idea of the LSTM neural network is to add a state C for storing long-term information, called a unit state, in the hidden layer of the RNN, and to spread the modified RNN according to the time sequence.

The cross-site scripting attack identification method based on deep learning is used as an optimal mode, and related data comprise request parameters, a request method, response contents and a response state.

The deep learning-based cross-site scripting attack identification method is used as an optimal mode, and the classification labels comprise a cross-site scripting attack class and a non-cross-site scripting attack class.

According to the cross-site scripting attack identification method based on deep learning, as a preferred mode, a scanner is wvs or appscan.

The invention discloses a cross-site scripting attack identification method based on deep learning, which is a preferable mode, and the step S2 further comprises the following steps:

s21, removing data containing missing values in the related data by adopting a downsampling algorithm;

s22, removing the label in the response content by using xpath and only keeping the page content;

s23, distinguishing the request parameters, the IP address and the port number by adopting a method in a urlparse packet;

s24, evaluating the relation between the characteristic variables and the characteristics of the related data by using a Pearson correlation coefficient, and removing data irrelevant to the final category;

and S25, performing word segmentation on the related data through the word segmentation device, and outputting the feature data after word segmentation.

XPath is XML Path Language (XML Path Language), which is a Language used to determine the location of a part in an XML document; XPath is based on XML tree structures, providing the ability to find nodes in a data structure tree.

Pearson Correlation Coefficient (Pearson Correlation Coefficient) is used to measure whether two data sets are on a line, and is used to measure the linear relation between distance variables.

The invention discloses a cross-site scripting attack recognition method based on deep learning.

As an optimal mode, the Wold2Vec model comprises a continuous bag-of-words model and a Skip-gram model. The Word vector converted by the Word2Vec model not only can represent words into distributed Word vectors, but also can capture the similarity between words, and the words with the similarity can be calculated by using algebraic operation on the words.

The invention relates to a deep learning-based cross-site scripting attack recognition method, which is used as an optimal mode, wherein a Skip-gram model consists of an input layer, a projection layer and an output layer.

The invention has the following advantages:

(1) the identification mode of the cross-site script is more flexible and diversified;

(2) the dependence degree on personnel is reduced;

(3) the recognition result does not depend on the experience of the relevant person. The identification accuracy is greatly improved;

(4) the identification accuracy is greatly improved;

(5) and the identification efficiency of the cross-site script is further improved.

Drawings

FIG. 1 is a flow chart of a cross-site scripting attack identification method based on deep learning;

FIG. 2 is a data feature extraction flow chart of a cross-site scripting attack identification method based on deep learning.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

Example 1

As shown in fig. 1, a method for identifying cross-site scripting attack based on deep learning includes the following steps:

s1, webpage data collection: constructing a target range containing cross-site scripting attack holes, collecting related data containing cross-site scripting attack by using scanners such as wvs and appscan and a manual infiltration mode, and carrying out classification and labeling on the related data; the related data comprises request parameters, a request method, response contents and a response state; the classification label comprises a cross-site scripting attack class and a non-cross-site scripting attack class;

s2, data feature extraction: carrying out data cleaning on the related data, carrying out word segmentation on the related data through a word segmentation device, and outputting feature data after word segmentation; as shown in fig. 2, the method comprises the following steps:

s25, performing word segmentation on the related data through a word segmentation device, and outputting feature data after word segmentation;

s3, constructing a data set: constructing a data set by adopting a Wold2Vec model; word2Vec provides two models, namely a continuous bag-of-words model and a Skip-gram model, to train a Word vector, and a Skip-gram model is adopted to construct a data set, wherein the Skip-gram model consists of an input layer, a projection layer and an output layer;

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A cross-site scripting attack identification method based on deep learning is characterized by comprising the following steps: the method comprises the following steps:

s1, webpage data collection: constructing a target range containing a cross-site scripting attack vulnerability, collecting related data containing the cross-site scripting attack by using a scanner and a manual infiltration mode, and carrying out classification and labeling on the related data;

s5, Softmax layer classification: and (3) after the detection result is subjected to Softmax function operation, obtaining probability distribution of cross-site scripting attack on all target information, and judging whether the related data is the cross-site scripting attack.

2. The method for identifying cross-site scripting attack based on deep learning of claim 1, wherein: the related data comprises request parameters, request methods, response contents and response states.

3. The method for identifying cross-site scripting attack based on deep learning of claim 1, wherein: the classification labels comprise a cross-site scripting attack class and a non-cross-site scripting attack class.

4. The method for identifying cross-site scripting attack based on deep learning of claim 1, wherein: the scanner is wvs or appscan.

5. The deep learning-based cross-site scripting attack identification method according to claim 2, characterized in that: step S2 further includes the steps of:

and S25, performing word segmentation on the related data through a word segmentation device, and outputting feature data after word segmentation.

6. The deep learning-based cross-site scripting attack identification method according to claim 5, characterized in that: the word segmentation device is a jieba word segmentation device or an nltk word segmentation device.

7. The method for identifying cross-site scripting attack based on deep learning of claim 1, wherein: the Wold2Vec model includes a continuous bag-of-words model and a Skip-gram model.

8. The method for identifying cross-site scripting attack based on deep learning of claim 1, wherein: the Skip-gram model is composed of an input layer, a projection layer and an output layer.