CN112580050A

CN112580050A - XSS intrusion identification method based on semantic analysis and vectorization big data

Info

Publication number: CN112580050A
Application number: CN202011567690.0A
Authority: CN
Inventors: 张海军; 陈映辉
Original assignee: Jiaying University
Current assignee: Jiaying University
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-03-30

Abstract

The invention discloses an XSS intrusion identification method based on semantic analysis and vectorization big data, which comprises the following steps of firstly, carrying out data preprocessing such as data acquisition, data cleaning, data sampling, feature extraction and the like by utilizing a natural language processing method; secondly, a word vectorization algorithm based on a neural network realizes word vectorization to obtain word vector big data; thirdly, the safety protection detection is realized by using a deep neural network intelligent detection algorithm with different depths; and finally, designing different hyper-parameters and training the model to obtain results such as a maximum recognition rate, a minimum recognition rate, a recognition rate mean value, a variance, a standard deviation, a recognition rate change process curve graph, a loss error change process curve graph, a word vector sample cosine distance change curve graph and the like. The result proves that the XSS intrusion identification method provided by the invention has high identification rate, good stability and excellent overall performance.

Description

XSS intrusion identification method based on semantic analysis and vectorization big data

Technical Field

The invention belongs to the technical field of intrusion identification detection, and particularly relates to an XSS intrusion identification method based on semantic analysis and vectorization big data.

Background

In recent years, with the development of big data technology, the situation of network space is more and more severe while generating a large amount of data. WEB application based attacks are becoming the dominant attack, for example Cross-Site Scripting (XSS) is the most common. The traditional detection method at present is to utilize the existing characteristics in the virus characteristic library, extract the characteristics of corresponding samples, search and compare whether there are matched characteristics in the virus library to determine the virus. However, the detection-based method has the following limitations: the establishment and maintenance of the rule base need to consume manpower and material resources, the method is suitable for detecting known viruses and is difficult to detect new viruses, and the detection method greatly influences the detection efficiency in the large data security.

With the continuous development of machine learning, the strong self-adaptability and self-learning capability of the deep learning network become a mainstream trend in network security monitoring, and the attack behavior of unknown characteristics can be detected, so that the detection rate is improved.

Therefore, how to provide a more advanced intrusion identification method aiming at XSS attack in order to make up for the deficiencies of the conventional algorithm when large data is oriented is a problem to be solved urgently at present.

Disclosure of Invention

Aiming at the existing problems, the invention aims to provide an XSS intrusion recognition method based on semantic analysis and vectorization big data, which designs a deep neural network algorithm to realize safety protection detection by utilizing the strong adaptivity and self-learning capability of a deep learning network, and realizes intelligent detection by a big data training model.

In order to realize the purpose of the invention, the technical solution of the invention is as follows:

an XSS intrusion identification method based on semantic analysis and vectorization big data is characterized by comprising the following steps:

step 1: acquiring data to be detected, and performing text cutting, cleaning, word segmentation, part of speech tagging and word stop removal to obtain preprocessed data;

step 2: constructing a word vectorization model realized based on a continuous bag-of-words model CBOW, and mapping the preprocessed data into distributed word vectors by adopting a word vectorization algorithm;

and step 3: counting word vectorization samples to obtain a positive sample data set and a negative sample data set, merging the two data sets to obtain a word vectorization large data sample set, and randomly dividing the large data sample set into a training set and a testing set, wherein the number ratio is 7: 3;

and 4, step 4: inputting samples in a word vectorization big data sample set into deep neural networks DNN with different depths for training, and determining each hyper-parameter in each deep neural network DNN;

and 5: and collecting data of the HTTP request in real time, carrying out attack detection on the HTTP request, and identifying an intrusion attack behavior.

Compared with the prior art, the method has the following beneficial effects:

the invention uses natural language processing method to carry out data preprocessing such as data acquisition, data cleaning, data sampling, feature extraction and the like; a word vectorization algorithm based on a neural network is designed, and word vectorization is realized to obtain word vector big data; carrying out hyper-parameter adjustment by utilizing deep neural networks DNN with different depths, and intelligently monitoring XSS attack by utilizing a deep neural network DNN detection algorithm; the experimental result shows that the detection method provided by the invention has the advantages of high recognition rate, good stability, excellent overall performance and the like.

Drawings

FIG. 1 is a schematic diagram of intrusion intelligent detection based on semantic context analysis and machine learning;

FIG. 2 is a graph of the recognition rate obtained from 20 experiments with a large class I dataset based on different learning rates μ;

FIG. 3 is a graph of the recognition rate obtained from 20 experiments on a class II big dataset based on different learning rates μ;

FIG. 4 is a graph of the recognition rate obtained from 20 experiments on a large class I dataset based on different BatchSize;

FIG. 5 is a graph of the recognition rate obtained from 20 experiments on a class II large dataset based on different BatchSize;

FIG. 6 is a graph of the recognition rate obtained from 20 experiments on a class I big dataset based on the addition of an embedding layer;

FIG. 7 is a graph of the recognition rate obtained from 20 experiments on a class II big dataset based on the addition of an embedding layer;

FIG. 8 is a bar graph of the recognition rate mean based on different learning rates μ for class I and II large data sets;

FIG. 9 is a bar graph of standard deviation based on different learning rates μ for class I and II large data sets;

FIG. 10 is a bar graph of the recognition rate mean based on different BatchSize for class I and II large datasets;

FIG. 11 is a bar graph of standard deviation based on different BatchSize for class I and II large datasets;

FIG. 12 is a bar graph of the recognition rate mean based on the addition of an embedding layer for class I and II large data sets;

FIG. 13 is a bar graph based on standard deviation of an embedded layer for class I and II large datasets;

FIG. 14 is a graph of loss error variation;

fig. 15 is a cosine distance change graph.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the following further describes the technical solution of the present invention with reference to the drawings and the embodiments.

1. Big data processing and modeling

The Web intrusion detection is essentially to analyze the corpus big data of the access flow. Firstly, natural language processing is carried out, and data are processed and modeled; secondly, performing word vectorization, mapping the processed data to a vector space, converting the attack message into a matrix similar to image data, namely pixels, and converting a character string sequence sample into a vector with a certain dimension value; thirdly, carrying out numerical feature extraction and analysis on the word vectors; finally, model training, numerical analysis, user behavior analysis, network traffic analysis and fraud detection are achieved, and the process is shown in the attached drawing 1.

1.1 corpus big data acquisition

The experimental data included: the method comprises the steps of (1) correcting sample big data (with attack behaviors), and utilizing a crawler tool to select from a websitehttp:// xssed.com/Crawling is carried out, and Payload data is formed; secondly, negative sample big data (normal network request) is collected for embodying the specificity and the universality, and two parts of data are collected, wherein one part of the big data is from the access log of the unit network center from the last 5 months to 12 months; the other part is obtained from various network platforms through a web crawler, and the web platforms are all unprocessed corpus big data.

1.2 big data processing

The method is characterized in that a Word vectorization (Word2vec) tool based on a neural network, namely a continuous Word Bag Model (CBOW), is utilized to realize big data corpus processing, text cutting, cleaning, Word segmentation, part of speech tagging, stop Word removal and Word vectorization are carried out, Word vectors of One-hot Encoded are mapped into Word vectors in a distributed form, the dimension is reduced, sparseness is reduced, and meanwhile, the relevance between any words can be obtained by solving the Euclidean distance or the cosine value of an included angle between the vectors. The specific treatment process is as follows:

firstly, traversing a data set, replacing all numbers with '0', replacing HTTP/, HTTP/, HTTPs/, and HTTPS with 'HTTP:/', and then performing word segmentation according to an html tag, a JavaScript functional body, the HTTP:// and a parameter rule; constructing a vocabulary (vocabularies) based on the diary document, and then carrying out One-hot encoding (One-hot Encoded) on the words;

secondly, constructing a word vectorization model, inputting a sample, and obtaining a distributed word vector;

thirdly, counting a positive sample word set, forming a word bank by the words with the highest word frequency, and then performing multiple iterations;

because the length of the characters occupied by each piece of data is different, the length of the characters occupied by each piece of data is the maximum standard, the defects are filled with-1, when the label is designed for the data set, One-Hot coding is used, a positive sample label belongs to an attack sample and is represented by 1, and a negative sample label is represented by 0, namely a normal network request.

Finally, through the above processing, 40637 positive sample data sets can be obtained, and 105912 negative sample data sets and 200129 negative sample data sets are large in number and high in computational complexity, and are large data.

2. Algorithm implementation

2.1 word vectorization Algorithm design

The word vector, i.e., the probability that a known context word predicts the occurrence of the current word, is implemented using CBOW. Thus maximizing the log-likelihood function:

wherein w is a word in the corpus C;

the conditional probability of w is first calculated using the Hierarchical Softmax method as follows:

wherein p is^wRepresents a path,/^wRepresenting the number of nodes;

representing each node in the path;

encoding of the expression w;

representing the code corresponding to the jth node in the path,

representing a parameter vector corresponding to a non-leaf node on the path;

and each term on the right side in the above formula is a logistic regression, and the following can be obtained:

since d takes only 0 and 1, the above formula can be expressed as:

and the formula (2) is substituted into the formula (1) to obtain:

for each term in the above formula can be written as:

maximizing each term, equation (4), respectively, yields:

where s (x) is a sigmoid function, s' (x) s (x) [1-s (x) ], and substitution thereof into the formula (5) can yield:

then it is right

Iterative evaluation, namely:

wherein h is the learning rate;

due to X_wAnd

is symmetrical, so can be derived about X_wThe partial derivatives of (a) are:

due to X_wThe word vector sum of the context is used, and the whole updating value is applied to the word vector of each word of the context during processing, so that a word vector processing model can be obtained:

wherein v (w%) represents a context word vector;

based on the established model, the original corpus is used as input, and word vectorization processing can be realized.

2.2 Deep Neural Network (Deep Neural Network, DNN)

For the classification learning of big data, compared with the traditional neural network or other ML algorithms, the deep neural network algorithm has the advantages of higher recognition rate, stronger robustness, better generalization and the like. The invention realizes safety protection detection through a deep neural network algorithm and realizes intelligent detection through a big data training model.

Firstly, the mean square error is designed as follows:

secondly, in order to find the optimization parameters, a gradient descent method is used to minimize the function, and the partial derivatives found are respectively set as the 'residual error' of each unit and recorded as d_i ^(l)The residual error of the output layer unit can be obtained as：

Thirdly, solving for l ═ n_l-1,

n

_l2, 2 is the residual of the unit of the respective layer, e.g. l ═ n_lThe residual solving formula of each unit of layer 1 is as follows:

wherein W represents weight, b represents bias, (x, y) represents training sample, h_W,b(x) Represents the final output, f (-) represents the activation function;

n in the above formula_l-1 and n_lReplacing the relationship of l with l +1, the following can be obtained:

the residual error of each unit can be solved by the formula, and further the partial derivative based on variables such as weight value and the like is solved:

the change process of the weight value can be obtained as follows:

the change process of the bias term is as follows:

thus, DNN learning and training can be achieved.

Examples

1. Experimental data

40637 positive sample data sets are obtained in the large data processing of the 1.2, and 105912 negative sample data sets and 200129 negative sample data sets are large in quantity and high in calculation complexity and are large data. In order to improve the training effect, two types of negative sample sets are respectively combined with the positive sample set, and the two types of negative sample sets are all combined by 7:3 into training and testing sets, and recording as I and II big data sets.

2. Procedure of experiment

In order to prove the effectiveness of the algorithm, DNNs with 3, 4, 5, 6, 7 layers and other different depths are firstly constructed; secondly, designing different hyper-parameters, sample block size (BatchSize), learning rate m, the number of neurons contained in each layer and the like; finally, inputting a word vector big data set sample for training and testing; to verify the stability of the system, 20 experiments were performed for each type of data.

3. Results and analysis of the experiments

(1) Different hyper-parameters are designed based on each deep DNN, for example, the learning rates m are 0.001, 0.01 and 0.1, and the recognition rates obtained by performing 20 experiments on the class i big data set are shown in table 1.

TABLE 1 recognition rates of deep DNN for class I big data sets obtained from 20 experiments based on different learning rates μ

As can be seen from table 1, when the learning rates are both 0.001 and 0.01, the recognition rates are very high, wherein the lowest recognition rate also reaches 0.9889, the highest recognition rate is 0.9966, and the recognition rate increases with the increase of the training times and finally tends to be stable; when the learning rate is 0.1, the recognition rate is very low, and the average is about 0.2783, because the learning rate is designed to be too large, the gradient is reduced too fast during training to get over the optimal value, in contrast, when the learning rate is small, the global optimum or near optimum can be obtained, and for the curve display with high recognition rate, as shown in fig. 2;

(2) the recognition rates obtained by performing 20 experiments on the class ii big data set based on different learning rates are shown in table 2.

TABLE 2 recognition rates obtained by 20 experiments with deep DNN on class II big data sets based on different learning rates, μ

It can be seen that when the learning rates are 0.001 and 0.01, the recognition rates are very high, wherein the lowest recognition rate is 0.9877, the highest recognition rate is 0.9990, and the recognition rates increase with the increase of the training times and finally tend to be stable; also, when the learning rate is 0.1, the recognition rate is relatively lower, although it is increased, and the average is about 0.8311, and the curve for high recognition rate is shown in fig. 3.

(3) Different hyper-parameters were designed based on each deep DNN, such as sample block size (BatchSize) of 50, 100 and 500, and the class i big data set was subjected to 20 experiments to obtain the recognition rate, as shown in table 3.

TABLE 3 recognition rates obtained by deep DNN on the class I big data set in 20 experiments based on different BatchSize

As can be seen from Table 3, the recognition rate is very high for different BatchSize, wherein the lowest recognition rate is 0.9895, the highest recognition rate is 0.9968, and the recognition rate increases with the increase of the training times and finally becomes stable; for the three BatchSize settings, the effect is best when the middle value is taken as 100, and the recognition rate curves of the three BatchSize settings are shown as shown in the attached figure 4;

(4) the identification obtained from 20 experiments on the class ii large dataset based on different BatchSize is shown in table 4.

TABLE 4 recognition rates for deep DNN on class II big data sets in 20 experiments based on different BatchSize

As can be seen from Table 4, the recognition rate is also very high for different BatchSize, wherein the lowest recognition rate is 0.9877, the highest recognition rate is 0.9991, and the recognition rate increases with the increase of the training times and finally tends to be stable; for the three types of BatchSize, the recognition effect is almost the same, the effect is the best when the middle value is taken as 100, and the recognition rate curves of the three types of BatchSize are shown as shown in the attached figure 5;

(5) in order to further verify the corresponding characteristics of the system, a Dropout (Dp) layer, a BatchNormalization (Bn) layer and a plurality of noise (N) layers are embedded in the DNN, wherein the Dp layer is used for improving the generalization and preventing overfitting, the Bn layer is used for ensuring the consistent data distribution and preventing the gradient dispersion, so that the activation value is amplified, and the N layer is used for generating a noise value and influencing the recognition result. Thus, two kinds of embedded designs are performed: embedding Dp and Bn layers and embedding N and N layers, and then performing 20 experiments on the class i big data set to obtain the recognition rate, as shown in table 5.

TABLE 5 recognition rates obtained by 20 experiments with deep DNN on class I big data sets based on the addition of an embedding layer

As can be seen from table 5, when embedding the Dp layer and the Bn layer, a high recognition rate is obtained, which is 0.989335 on average, and when embedding the N layer and the N layer, the recognition rate is lower, which is only 0.721650 on average, which respectively proves the effect of the added embedding layer, and the recognition rates are basically kept unchanged with the increase of the training times, and their recognition rate curves are shown as fig. 6;

the results obtained with 20 more experiments on the class ii large dataset, again based on DNN with the embedded layer, are shown in table 6.

TABLE 6 recognition rates obtained by 20 experiments with deep DNN on class II big data sets based on the addition of an embedding layer

As can be seen from table 6, when embedding the Dp layer and the Bn layer, a high recognition rate was obtained, which averaged 0.996400, and when embedding the N layer and the N layer, the recognition rate was also low, which averaged 0.831425, which respectively demonstrate the effect of the added embedding layer, and the recognition rates remained substantially constant with the increase of the training times, and their recognition rate curves are shown as fig. 7.

Further, through experiments, the average recognition rate of each deep DNN for the class i large data set based on different learning rates was about 99.4385%, the variance was about 0.000001, and the standard deviation was about 0.001246, as shown in table 7.

TABLE 7 mean recognition rate, variance and standard deviation of each deep DNN pair class I big data set based on different learning rates μ

Similarly, the average recognition rate of each deep DNN for the class ii large dataset based on different learning rates was found to be about 99.7710%, the variance was found to be about 0.000005, and the standard deviation was found to be about 0.002329, as shown in table 8.

TABLE 8 mean recognition rate, variance and standard deviation of each deep DNN for class II big data set based on different learning rates μ

Through experiments, the average recognition rate of each deep DNN to the class i big data set based on different BatchSize is about 99.5655%, the variance is 0.000001, and the standard deviation is about 0.001129, as shown in table 9.

TABLE 9 mean discrimination, variance and standard deviation of each deep DNN against class I big data set based on different BatchSize

Similarly, the average recognition rate of each deep DNN for the class ii big data set based on different BatchSize was about 99.8200%, the variance was about 0.000001, and the standard deviation was about 0.001231, as shown in table 10.

TABLE 10 mean discrimination, variance and standard deviation of each deep DNN against class II big data set based on different BatchSize

Through experiments, the average recognition rate of each deep DNN for the class i big data set based on the embedding layer is about 98.9335%, the variance is about 0.000010, and the standard deviation is about 0.000226, as shown in table 11.

TABLE 11 mean recognition rate, variance and standard deviation of deep DNN pairs for class I big data set based on embedding layer

Similarly, the average recognition rate of each deep DNN for the class ii large dataset based on the embedded layer was found to be about 99.6400%, the variance was found to be about 0.000001, and the standard deviation was found to be about 0.000934, as shown in table 12.

TABLE 12 mean recognition rate, variance and standard deviation of each deep DNN pair of class II big data set based on the addition of the embedding layer

The bar graph of the mean recognition rate based on different learning rates m for the class i and ii big data sets is shown in fig. 8, and the bar graph of the standard deviation is shown in fig. 9.

The mean bar graph of the recognition rate based on different BatchSize for the class I and II large data sets is shown in FIG. 10 and the standard deviation bar graph is shown in FIG. 11.

The mean bar graph of the recognition rate based on the embedded layer for the class i and ii large datasets is shown in fig. 12 and the standard deviation bar graph is shown in fig. 13.

In order to describe the recognition change process of the system, a loss function change curve chart is obtained, as shown in fig. 14, and it can be seen that the loss error is continuously reduced and tends to be stable as the training is carried out, and the loss error is continuously increased and tends to be stable and consistent with the recognition rate.

Similarly, a cosine distance change curve of the word vector sample is obtained, as shown in fig. 15, it can be seen that the cosine distance is continuously reduced and tends to be stable as the training is performed, which reflects that the word vector sample has stronger and stronger correlation and is consistent with the continuously increased recognition rate.

Those not described in detail in this specification are within the skill of the art. Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes in the embodiments and modifications of the invention can be made, and equivalents of some features of the invention can be substituted, and any changes, equivalents, improvements and the like, which fall within the spirit and principle of the invention, are intended to be included within the scope of the invention.

Claims

1. An XSS intrusion identification method based on semantic analysis and vectorization big data is characterized by comprising the following steps:

2. The XSS intrusion identification method based on semantic analysis and vectorized big data according to claim 1, wherein the specific operation steps of the step 1 comprise:

step 11: traversing the data set, replacing the number with '0', and replacing HTTP/, HTTP/, HTTPs/, HTTPS with 'HTTP:/';

step 12: performing word segmentation according to the html tag, the JavaScript function body, the http:// and the parameter rule;

step 13: and constructing a vocabulary table based on the diary document, and then carrying out single hot coding on the words to obtain processed sample data.

3. The XSS intrusion recognition method according to claim 2, wherein the specific operation steps of the word vector algorithm in step 2 include:

s21: setting a maximum log-likelihood function of a word vectorization model, wherein the formula is as follows:

wherein w is a word in the corpus C;

wherein p is^wRepresents a path,/^wRepresenting the number of nodes;

representing each node in the path;

encoding of the expression w;

representing the code corresponding to the jth node in the path,

representing a parameter vector corresponding to a non-leaf node on the path;

s22: the calculation formula for deriving the context word vector by deducing the formula (1) is as follows:

wherein v (w%) represents a context word vector;

s23: the sample data is input into formula (7), and a word vector of the data can be obtained.

4. The XSS intrusion identification method based on semantic analysis and vectorized big data according to claim 3, wherein the specific operation steps of the step 4 comprise:

s41: the mean square error defining the deep neural network DNN settings is:

s42: for the purpose of finding the optimization parameters, a gradient descent method is used to minimize the function, and the partial derivatives found are respectively set as the 'residual' of each unit and recorded as

The residual error of the available output layer unit is:

s43: solving for l ═ n_l-1,n_l2, 2 is the residual of the unit of the respective layer, e.g. l ═ n_lThe residual solving formula of each unit of layer 1 is as follows:

s44: n in the above formula_l-1 and n_lReplacing the relationship of l with l +1, the following can be obtained:

s45: the change process of the weight value can be obtained according to the formula:

the change process of the bias term is as follows:

s46: and comparing the input sample value with the big data sample value until the mean square error of the network training meets the requirement, and determining the hyper-parameter of the network.

5. The XSS intrusion identification method according to claim 4, wherein the hyper-parameters comprise a block size, a learning rate, and a number of neurons in each layer.