CN112580050A - XSS intrusion identification method based on semantic analysis and vectorization big data - Google Patents
XSS intrusion identification method based on semantic analysis and vectorization big data Download PDFInfo
- Publication number
- CN112580050A CN112580050A CN202011567690.0A CN202011567690A CN112580050A CN 112580050 A CN112580050 A CN 112580050A CN 202011567690 A CN202011567690 A CN 202011567690A CN 112580050 A CN112580050 A CN 112580050A
- Authority
- CN
- China
- Prior art keywords
- word
- data
- vectorization
- big data
- recognition rate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000004458 analytical method Methods 0.000 title claims abstract description 15
- 239000013598 vector Substances 0.000 claims abstract description 26
- 238000012549 training Methods 0.000 claims abstract description 24
- 238000001514 detection method Methods 0.000 claims abstract description 19
- 238000013528 artificial neural network Methods 0.000 claims abstract description 17
- 230000008859 change Effects 0.000 claims abstract description 11
- 230000008569 process Effects 0.000 claims abstract description 9
- 238000004140 cleaning Methods 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 9
- 230000006399 behavior Effects 0.000 claims description 5
- 230000011218 segmentation Effects 0.000 claims description 5
- 238000012360 testing method Methods 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000005520 cutting process Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 2
- 238000011478 gradient descent method Methods 0.000 claims description 2
- 210000002569 neuron Anatomy 0.000 claims description 2
- 238000005457 optimization Methods 0.000 claims description 2
- 238000000605 extraction Methods 0.000 abstract description 3
- 238000003058 natural language processing Methods 0.000 abstract description 3
- 238000007781 pre-processing Methods 0.000 abstract description 2
- 238000005070 sampling Methods 0.000 abstract description 2
- 238000002474 experimental method Methods 0.000 description 24
- 238000012545 processing Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 6
- 241000700605 Viruses Species 0.000 description 5
- 238000013461 design Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Computer Security & Cryptography (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computer Hardware Design (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Probability & Statistics with Applications (AREA)
- Virology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an XSS intrusion identification method based on semantic analysis and vectorization big data, which comprises the following steps of firstly, carrying out data preprocessing such as data acquisition, data cleaning, data sampling, feature extraction and the like by utilizing a natural language processing method; secondly, a word vectorization algorithm based on a neural network realizes word vectorization to obtain word vector big data; thirdly, the safety protection detection is realized by using a deep neural network intelligent detection algorithm with different depths; and finally, designing different hyper-parameters and training the model to obtain results such as a maximum recognition rate, a minimum recognition rate, a recognition rate mean value, a variance, a standard deviation, a recognition rate change process curve graph, a loss error change process curve graph, a word vector sample cosine distance change curve graph and the like. The result proves that the XSS intrusion identification method provided by the invention has high identification rate, good stability and excellent overall performance.
Description
Technical Field
The invention belongs to the technical field of intrusion identification detection, and particularly relates to an XSS intrusion identification method based on semantic analysis and vectorization big data.
Background
In recent years, with the development of big data technology, the situation of network space is more and more severe while generating a large amount of data. WEB application based attacks are becoming the dominant attack, for example Cross-Site Scripting (XSS) is the most common. The traditional detection method at present is to utilize the existing characteristics in the virus characteristic library, extract the characteristics of corresponding samples, search and compare whether there are matched characteristics in the virus library to determine the virus. However, the detection-based method has the following limitations: the establishment and maintenance of the rule base need to consume manpower and material resources, the method is suitable for detecting known viruses and is difficult to detect new viruses, and the detection method greatly influences the detection efficiency in the large data security.
With the continuous development of machine learning, the strong self-adaptability and self-learning capability of the deep learning network become a mainstream trend in network security monitoring, and the attack behavior of unknown characteristics can be detected, so that the detection rate is improved.
Therefore, how to provide a more advanced intrusion identification method aiming at XSS attack in order to make up for the deficiencies of the conventional algorithm when large data is oriented is a problem to be solved urgently at present.
Disclosure of Invention
Aiming at the existing problems, the invention aims to provide an XSS intrusion recognition method based on semantic analysis and vectorization big data, which designs a deep neural network algorithm to realize safety protection detection by utilizing the strong adaptivity and self-learning capability of a deep learning network, and realizes intelligent detection by a big data training model.
In order to realize the purpose of the invention, the technical solution of the invention is as follows:
an XSS intrusion identification method based on semantic analysis and vectorization big data is characterized by comprising the following steps:
step 1: acquiring data to be detected, and performing text cutting, cleaning, word segmentation, part of speech tagging and word stop removal to obtain preprocessed data;
step 2: constructing a word vectorization model realized based on a continuous bag-of-words model CBOW, and mapping the preprocessed data into distributed word vectors by adopting a word vectorization algorithm;
and step 3: counting word vectorization samples to obtain a positive sample data set and a negative sample data set, merging the two data sets to obtain a word vectorization large data sample set, and randomly dividing the large data sample set into a training set and a testing set, wherein the number ratio is 7: 3;
and 4, step 4: inputting samples in a word vectorization big data sample set into deep neural networks DNN with different depths for training, and determining each hyper-parameter in each deep neural network DNN;
and 5: and collecting data of the HTTP request in real time, carrying out attack detection on the HTTP request, and identifying an intrusion attack behavior.
Compared with the prior art, the method has the following beneficial effects:
the invention uses natural language processing method to carry out data preprocessing such as data acquisition, data cleaning, data sampling, feature extraction and the like; a word vectorization algorithm based on a neural network is designed, and word vectorization is realized to obtain word vector big data; carrying out hyper-parameter adjustment by utilizing deep neural networks DNN with different depths, and intelligently monitoring XSS attack by utilizing a deep neural network DNN detection algorithm; the experimental result shows that the detection method provided by the invention has the advantages of high recognition rate, good stability, excellent overall performance and the like.
Drawings
FIG. 1 is a schematic diagram of intrusion intelligent detection based on semantic context analysis and machine learning;
FIG. 2 is a graph of the recognition rate obtained from 20 experiments with a large class I dataset based on different learning rates μ;
FIG. 3 is a graph of the recognition rate obtained from 20 experiments on a class II big dataset based on different learning rates μ;
FIG. 4 is a graph of the recognition rate obtained from 20 experiments on a large class I dataset based on different BatchSize;
FIG. 5 is a graph of the recognition rate obtained from 20 experiments on a class II large dataset based on different BatchSize;
FIG. 6 is a graph of the recognition rate obtained from 20 experiments on a class I big dataset based on the addition of an embedding layer;
FIG. 7 is a graph of the recognition rate obtained from 20 experiments on a class II big dataset based on the addition of an embedding layer;
FIG. 8 is a bar graph of the recognition rate mean based on different learning rates μ for class I and II large data sets;
FIG. 9 is a bar graph of standard deviation based on different learning rates μ for class I and II large data sets;
FIG. 10 is a bar graph of the recognition rate mean based on different BatchSize for class I and II large datasets;
FIG. 11 is a bar graph of standard deviation based on different BatchSize for class I and II large datasets;
FIG. 12 is a bar graph of the recognition rate mean based on the addition of an embedding layer for class I and II large data sets;
FIG. 13 is a bar graph based on standard deviation of an embedded layer for class I and II large datasets;
FIG. 14 is a graph of loss error variation;
fig. 15 is a cosine distance change graph.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the following further describes the technical solution of the present invention with reference to the drawings and the embodiments.
1. Big data processing and modeling
The Web intrusion detection is essentially to analyze the corpus big data of the access flow. Firstly, natural language processing is carried out, and data are processed and modeled; secondly, performing word vectorization, mapping the processed data to a vector space, converting the attack message into a matrix similar to image data, namely pixels, and converting a character string sequence sample into a vector with a certain dimension value; thirdly, carrying out numerical feature extraction and analysis on the word vectors; finally, model training, numerical analysis, user behavior analysis, network traffic analysis and fraud detection are achieved, and the process is shown in the attached drawing 1.
1.1 corpus big data acquisition
The experimental data included: the method comprises the steps of (1) correcting sample big data (with attack behaviors), and utilizing a crawler tool to select from a websitehttp:// xssed.com/Crawling is carried out, and Payload data is formed; secondly, negative sample big data (normal network request) is collected for embodying the specificity and the universality, and two parts of data are collected, wherein one part of the big data is from the access log of the unit network center from the last 5 months to 12 months; the other part is obtained from various network platforms through a web crawler, and the web platforms are all unprocessed corpus big data.
1.2 big data processing
The method is characterized in that a Word vectorization (Word2vec) tool based on a neural network, namely a continuous Word Bag Model (CBOW), is utilized to realize big data corpus processing, text cutting, cleaning, Word segmentation, part of speech tagging, stop Word removal and Word vectorization are carried out, Word vectors of One-hot Encoded are mapped into Word vectors in a distributed form, the dimension is reduced, sparseness is reduced, and meanwhile, the relevance between any words can be obtained by solving the Euclidean distance or the cosine value of an included angle between the vectors. The specific treatment process is as follows:
firstly, traversing a data set, replacing all numbers with '0', replacing HTTP/, HTTP/, HTTPs/, and HTTPS with 'HTTP:/', and then performing word segmentation according to an html tag, a JavaScript functional body, the HTTP:// and a parameter rule; constructing a vocabulary (vocabularies) based on the diary document, and then carrying out One-hot encoding (One-hot Encoded) on the words;
secondly, constructing a word vectorization model, inputting a sample, and obtaining a distributed word vector;
thirdly, counting a positive sample word set, forming a word bank by the words with the highest word frequency, and then performing multiple iterations;
because the length of the characters occupied by each piece of data is different, the length of the characters occupied by each piece of data is the maximum standard, the defects are filled with-1, when the label is designed for the data set, One-Hot coding is used, a positive sample label belongs to an attack sample and is represented by 1, and a negative sample label is represented by 0, namely a normal network request.
Finally, through the above processing, 40637 positive sample data sets can be obtained, and 105912 negative sample data sets and 200129 negative sample data sets are large in number and high in computational complexity, and are large data.
2. Algorithm implementation
2.1 word vectorization Algorithm design
The word vector, i.e., the probability that a known context word predicts the occurrence of the current word, is implemented using CBOW. Thus maximizing the log-likelihood function:
wherein w is a word in the corpus C;
the conditional probability of w is first calculated using the Hierarchical Softmax method as follows:
wherein p iswRepresents a path,/wRepresenting the number of nodes;representing each node in the path;encoding of the expression w;representing the code corresponding to the jth node in the path,representing a parameter vector corresponding to a non-leaf node on the path;
and each term on the right side in the above formula is a logistic regression, and the following can be obtained:
since d takes only 0 and 1, the above formula can be expressed as:
and the formula (2) is substituted into the formula (1) to obtain:
for each term in the above formula can be written as:
maximizing each term, equation (4), respectively, yields:
where s (x) is a sigmoid function, s' (x) s (x) [1-s (x) ], and substitution thereof into the formula (5) can yield:
wherein h is the learning rate;
due to XwThe word vector sum of the context is used, and the whole updating value is applied to the word vector of each word of the context during processing, so that a word vector processing model can be obtained:
wherein v (w%) represents a context word vector;
based on the established model, the original corpus is used as input, and word vectorization processing can be realized.
2.2 Deep Neural Network (Deep Neural Network, DNN)
For the classification learning of big data, compared with the traditional neural network or other ML algorithms, the deep neural network algorithm has the advantages of higher recognition rate, stronger robustness, better generalization and the like. The invention realizes safety protection detection through a deep neural network algorithm and realizes intelligent detection through a big data training model.
Firstly, the mean square error is designed as follows:
secondly, in order to find the optimization parameters, a gradient descent method is used to minimize the function, and the partial derivatives found are respectively set as the 'residual error' of each unit and recorded as di (l)The residual error of the output layer unit can be obtained as:
Thirdly, solving for l ═ nl-1, n l2, 2 is the residual of the unit of the respective layer, e.g. l ═ nlThe residual solving formula of each unit of layer 1 is as follows:
wherein W represents weight, b represents bias, (x, y) represents training sample, hW,b(x) Represents the final output, f (-) represents the activation function;
n in the above formulal-1 and nlReplacing the relationship of l with l +1, the following can be obtained:
the residual error of each unit can be solved by the formula, and further the partial derivative based on variables such as weight value and the like is solved:
the change process of the weight value can be obtained as follows:
the change process of the bias term is as follows:
thus, DNN learning and training can be achieved.
Examples
1. Experimental data
40637 positive sample data sets are obtained in the large data processing of the 1.2, and 105912 negative sample data sets and 200129 negative sample data sets are large in quantity and high in calculation complexity and are large data. In order to improve the training effect, two types of negative sample sets are respectively combined with the positive sample set, and the two types of negative sample sets are all combined by 7:3 into training and testing sets, and recording as I and II big data sets.
2. Procedure of experiment
In order to prove the effectiveness of the algorithm, DNNs with 3, 4, 5, 6, 7 layers and other different depths are firstly constructed; secondly, designing different hyper-parameters, sample block size (BatchSize), learning rate m, the number of neurons contained in each layer and the like; finally, inputting a word vector big data set sample for training and testing; to verify the stability of the system, 20 experiments were performed for each type of data.
3. Results and analysis of the experiments
(1) Different hyper-parameters are designed based on each deep DNN, for example, the learning rates m are 0.001, 0.01 and 0.1, and the recognition rates obtained by performing 20 experiments on the class i big data set are shown in table 1.
TABLE 1 recognition rates of deep DNN for class I big data sets obtained from 20 experiments based on different learning rates μ
As can be seen from table 1, when the learning rates are both 0.001 and 0.01, the recognition rates are very high, wherein the lowest recognition rate also reaches 0.9889, the highest recognition rate is 0.9966, and the recognition rate increases with the increase of the training times and finally tends to be stable; when the learning rate is 0.1, the recognition rate is very low, and the average is about 0.2783, because the learning rate is designed to be too large, the gradient is reduced too fast during training to get over the optimal value, in contrast, when the learning rate is small, the global optimum or near optimum can be obtained, and for the curve display with high recognition rate, as shown in fig. 2;
(2) the recognition rates obtained by performing 20 experiments on the class ii big data set based on different learning rates are shown in table 2.
TABLE 2 recognition rates obtained by 20 experiments with deep DNN on class II big data sets based on different learning rates, μ
It can be seen that when the learning rates are 0.001 and 0.01, the recognition rates are very high, wherein the lowest recognition rate is 0.9877, the highest recognition rate is 0.9990, and the recognition rates increase with the increase of the training times and finally tend to be stable; also, when the learning rate is 0.1, the recognition rate is relatively lower, although it is increased, and the average is about 0.8311, and the curve for high recognition rate is shown in fig. 3.
(3) Different hyper-parameters were designed based on each deep DNN, such as sample block size (BatchSize) of 50, 100 and 500, and the class i big data set was subjected to 20 experiments to obtain the recognition rate, as shown in table 3.
TABLE 3 recognition rates obtained by deep DNN on the class I big data set in 20 experiments based on different BatchSize
As can be seen from Table 3, the recognition rate is very high for different BatchSize, wherein the lowest recognition rate is 0.9895, the highest recognition rate is 0.9968, and the recognition rate increases with the increase of the training times and finally becomes stable; for the three BatchSize settings, the effect is best when the middle value is taken as 100, and the recognition rate curves of the three BatchSize settings are shown as shown in the attached figure 4;
(4) the identification obtained from 20 experiments on the class ii large dataset based on different BatchSize is shown in table 4.
TABLE 4 recognition rates for deep DNN on class II big data sets in 20 experiments based on different BatchSize
As can be seen from Table 4, the recognition rate is also very high for different BatchSize, wherein the lowest recognition rate is 0.9877, the highest recognition rate is 0.9991, and the recognition rate increases with the increase of the training times and finally tends to be stable; for the three types of BatchSize, the recognition effect is almost the same, the effect is the best when the middle value is taken as 100, and the recognition rate curves of the three types of BatchSize are shown as shown in the attached figure 5;
(5) in order to further verify the corresponding characteristics of the system, a Dropout (Dp) layer, a BatchNormalization (Bn) layer and a plurality of noise (N) layers are embedded in the DNN, wherein the Dp layer is used for improving the generalization and preventing overfitting, the Bn layer is used for ensuring the consistent data distribution and preventing the gradient dispersion, so that the activation value is amplified, and the N layer is used for generating a noise value and influencing the recognition result. Thus, two kinds of embedded designs are performed: embedding Dp and Bn layers and embedding N and N layers, and then performing 20 experiments on the class i big data set to obtain the recognition rate, as shown in table 5.
TABLE 5 recognition rates obtained by 20 experiments with deep DNN on class I big data sets based on the addition of an embedding layer
As can be seen from table 5, when embedding the Dp layer and the Bn layer, a high recognition rate is obtained, which is 0.989335 on average, and when embedding the N layer and the N layer, the recognition rate is lower, which is only 0.721650 on average, which respectively proves the effect of the added embedding layer, and the recognition rates are basically kept unchanged with the increase of the training times, and their recognition rate curves are shown as fig. 6;
the results obtained with 20 more experiments on the class ii large dataset, again based on DNN with the embedded layer, are shown in table 6.
TABLE 6 recognition rates obtained by 20 experiments with deep DNN on class II big data sets based on the addition of an embedding layer
As can be seen from table 6, when embedding the Dp layer and the Bn layer, a high recognition rate was obtained, which averaged 0.996400, and when embedding the N layer and the N layer, the recognition rate was also low, which averaged 0.831425, which respectively demonstrate the effect of the added embedding layer, and the recognition rates remained substantially constant with the increase of the training times, and their recognition rate curves are shown as fig. 7.
Further, through experiments, the average recognition rate of each deep DNN for the class i large data set based on different learning rates was about 99.4385%, the variance was about 0.000001, and the standard deviation was about 0.001246, as shown in table 7.
TABLE 7 mean recognition rate, variance and standard deviation of each deep DNN pair class I big data set based on different learning rates μ
Similarly, the average recognition rate of each deep DNN for the class ii large dataset based on different learning rates was found to be about 99.7710%, the variance was found to be about 0.000005, and the standard deviation was found to be about 0.002329, as shown in table 8.
TABLE 8 mean recognition rate, variance and standard deviation of each deep DNN for class II big data set based on different learning rates μ
Through experiments, the average recognition rate of each deep DNN to the class i big data set based on different BatchSize is about 99.5655%, the variance is 0.000001, and the standard deviation is about 0.001129, as shown in table 9.
TABLE 9 mean discrimination, variance and standard deviation of each deep DNN against class I big data set based on different BatchSize
Similarly, the average recognition rate of each deep DNN for the class ii big data set based on different BatchSize was about 99.8200%, the variance was about 0.000001, and the standard deviation was about 0.001231, as shown in table 10.
TABLE 10 mean discrimination, variance and standard deviation of each deep DNN against class II big data set based on different BatchSize
Through experiments, the average recognition rate of each deep DNN for the class i big data set based on the embedding layer is about 98.9335%, the variance is about 0.000010, and the standard deviation is about 0.000226, as shown in table 11.
TABLE 11 mean recognition rate, variance and standard deviation of deep DNN pairs for class I big data set based on embedding layer
Similarly, the average recognition rate of each deep DNN for the class ii large dataset based on the embedded layer was found to be about 99.6400%, the variance was found to be about 0.000001, and the standard deviation was found to be about 0.000934, as shown in table 12.
TABLE 12 mean recognition rate, variance and standard deviation of each deep DNN pair of class II big data set based on the addition of the embedding layer
The bar graph of the mean recognition rate based on different learning rates m for the class i and ii big data sets is shown in fig. 8, and the bar graph of the standard deviation is shown in fig. 9.
The mean bar graph of the recognition rate based on different BatchSize for the class I and II large data sets is shown in FIG. 10 and the standard deviation bar graph is shown in FIG. 11.
The mean bar graph of the recognition rate based on the embedded layer for the class i and ii large datasets is shown in fig. 12 and the standard deviation bar graph is shown in fig. 13.
In order to describe the recognition change process of the system, a loss function change curve chart is obtained, as shown in fig. 14, and it can be seen that the loss error is continuously reduced and tends to be stable as the training is carried out, and the loss error is continuously increased and tends to be stable and consistent with the recognition rate.
Similarly, a cosine distance change curve of the word vector sample is obtained, as shown in fig. 15, it can be seen that the cosine distance is continuously reduced and tends to be stable as the training is performed, which reflects that the word vector sample has stronger and stronger correlation and is consistent with the continuously increased recognition rate.
Those not described in detail in this specification are within the skill of the art. Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes in the embodiments and modifications of the invention can be made, and equivalents of some features of the invention can be substituted, and any changes, equivalents, improvements and the like, which fall within the spirit and principle of the invention, are intended to be included within the scope of the invention.
Claims (5)
1. An XSS intrusion identification method based on semantic analysis and vectorization big data is characterized by comprising the following steps:
step 1: acquiring data to be detected, and performing text cutting, cleaning, word segmentation, part of speech tagging and word stop removal to obtain preprocessed data;
step 2: constructing a word vectorization model realized based on a continuous bag-of-words model CBOW, and mapping the preprocessed data into distributed word vectors by adopting a word vectorization algorithm;
and step 3: counting word vectorization samples to obtain a positive sample data set and a negative sample data set, merging the two data sets to obtain a word vectorization large data sample set, and randomly dividing the large data sample set into a training set and a testing set, wherein the number ratio is 7: 3;
and 4, step 4: inputting samples in a word vectorization big data sample set into deep neural networks DNN with different depths for training, and determining each hyper-parameter in each deep neural network DNN;
and 5: and collecting data of the HTTP request in real time, carrying out attack detection on the HTTP request, and identifying an intrusion attack behavior.
2. The XSS intrusion identification method based on semantic analysis and vectorized big data according to claim 1, wherein the specific operation steps of the step 1 comprise:
step 11: traversing the data set, replacing the number with '0', and replacing HTTP/, HTTP/, HTTPs/, HTTPS with 'HTTP:/';
step 12: performing word segmentation according to the html tag, the JavaScript function body, the http:// and the parameter rule;
step 13: and constructing a vocabulary table based on the diary document, and then carrying out single hot coding on the words to obtain processed sample data.
3. The XSS intrusion recognition method according to claim 2, wherein the specific operation steps of the word vector algorithm in step 2 include:
s21: setting a maximum log-likelihood function of a word vectorization model, wherein the formula is as follows:
wherein w is a word in the corpus C;
the conditional probability of w is first calculated using the Hierarchical Softmax method as follows:
wherein p iswRepresents a path,/wRepresenting the number of nodes;representing each node in the path;encoding of the expression w;representing the code corresponding to the jth node in the path,representing a parameter vector corresponding to a non-leaf node on the path;
s22: the calculation formula for deriving the context word vector by deducing the formula (1) is as follows:
wherein v (w%) represents a context word vector;
s23: the sample data is input into formula (7), and a word vector of the data can be obtained.
4. The XSS intrusion identification method based on semantic analysis and vectorized big data according to claim 3, wherein the specific operation steps of the step 4 comprise:
s41: the mean square error defining the deep neural network DNN settings is:
s42: for the purpose of finding the optimization parameters, a gradient descent method is used to minimize the function, and the partial derivatives found are respectively set as the 'residual' of each unit and recorded asThe residual error of the available output layer unit is:
s43: solving for l ═ nl-1,nl2, 2 is the residual of the unit of the respective layer, e.g. l ═ nlThe residual solving formula of each unit of layer 1 is as follows:
wherein W represents weight, b represents bias, (x, y) represents training sample, hW,b(x) Represents the final output, f (-) represents the activation function;
s44: n in the above formulal-1 and nlReplacing the relationship of l with l +1, the following can be obtained:
the residual error of each unit can be solved by the formula, and further the partial derivative based on variables such as weight value and the like is solved:
s45: the change process of the weight value can be obtained according to the formula:
the change process of the bias term is as follows:
s46: and comparing the input sample value with the big data sample value until the mean square error of the network training meets the requirement, and determining the hyper-parameter of the network.
5. The XSS intrusion identification method according to claim 4, wherein the hyper-parameters comprise a block size, a learning rate, and a number of neurons in each layer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011567690.0A CN112580050A (en) | 2020-12-25 | 2020-12-25 | XSS intrusion identification method based on semantic analysis and vectorization big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011567690.0A CN112580050A (en) | 2020-12-25 | 2020-12-25 | XSS intrusion identification method based on semantic analysis and vectorization big data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112580050A true CN112580050A (en) | 2021-03-30 |
Family
ID=75139887
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011567690.0A Pending CN112580050A (en) | 2020-12-25 | 2020-12-25 | XSS intrusion identification method based on semantic analysis and vectorization big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112580050A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113312891A (en) * | 2021-04-22 | 2021-08-27 | 北京墨云科技有限公司 | Automatic payload generation method, device and system based on generative model |
CN113536678A (en) * | 2021-07-19 | 2021-10-22 | 中国人民解放军国防科技大学 | XSS risk analysis method and device based on Bayesian network and STRIDE model |
CN114169432A (en) * | 2021-12-06 | 2022-03-11 | 南京墨网云瑞科技有限公司 | Cross-site scripting attack identification method based on deep learning |
CN114844696A (en) * | 2022-04-28 | 2022-08-02 | 西安交通大学 | Network intrusion dynamic monitoring method, system, equipment and readable storage medium based on risk pool minimization |
CN116186698A (en) * | 2022-12-16 | 2023-05-30 | 广东技术师范大学 | Machine learning-based secure data processing method, medium and equipment |
-
2020
- 2020-12-25 CN CN202011567690.0A patent/CN112580050A/en active Pending
Non-Patent Citations (3)
Title |
---|
张海军等: "基于类图像处理与向量化的大数据脚本攻击智能检测", 《计算机工程》 * |
张海军等: "类图像处理面向大数据XSS入侵智能检测研究", 《计算机应用与软件》 * |
张海军等: "语义分析及向量化大数据跨站脚本攻击智检", 《山东大学学报(工学版)》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113312891A (en) * | 2021-04-22 | 2021-08-27 | 北京墨云科技有限公司 | Automatic payload generation method, device and system based on generative model |
CN113312891B (en) * | 2021-04-22 | 2022-08-26 | 北京墨云科技有限公司 | Automatic payload generation method, device and system based on generative model |
CN113536678A (en) * | 2021-07-19 | 2021-10-22 | 中国人民解放军国防科技大学 | XSS risk analysis method and device based on Bayesian network and STRIDE model |
CN113536678B (en) * | 2021-07-19 | 2022-04-19 | 中国人民解放军国防科技大学 | XSS risk analysis method and device based on Bayesian network and STRIDE model |
CN114169432A (en) * | 2021-12-06 | 2022-03-11 | 南京墨网云瑞科技有限公司 | Cross-site scripting attack identification method based on deep learning |
CN114844696A (en) * | 2022-04-28 | 2022-08-02 | 西安交通大学 | Network intrusion dynamic monitoring method, system, equipment and readable storage medium based on risk pool minimization |
CN114844696B (en) * | 2022-04-28 | 2023-01-17 | 西安交通大学 | Network intrusion dynamic monitoring method, system, equipment and readable storage medium based on risk pool minimization |
CN116186698A (en) * | 2022-12-16 | 2023-05-30 | 广东技术师范大学 | Machine learning-based secure data processing method, medium and equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112580050A (en) | XSS intrusion identification method based on semantic analysis and vectorization big data | |
Wang et al. | HAST-IDS: Learning hierarchical spatial-temporal features using deep neural networks to improve intrusion detection | |
Paula et al. | Deep learning anomaly detection as support fraud investigation in brazilian exports and anti-money laundering | |
Li et al. | A hybrid malicious code detection method based on deep learning | |
Fan et al. | Robust deep auto-encoding Gaussian process regression for unsupervised anomaly detection | |
CN113596007B (en) | Vulnerability attack detection method and device based on deep learning | |
US11699160B2 (en) | Method, use thereof, computer program product and system for fraud detection | |
CN112887325B (en) | Telecommunication network fraud crime fraud identification method based on network flow | |
Guo et al. | Self-trained prediction model and novel anomaly score mechanism for video anomaly detection | |
CN112784010A (en) | Chinese sentence similarity calculation method based on multi-model nonlinear fusion | |
Li et al. | Fast similarity search via optimal sparse lifting | |
Khoshraftar et al. | Dynamic graph embedding via lstm history tracking | |
Yin et al. | Intrusion detection for capsule networks based on dual routing mechanism | |
CN112671703A (en) | Cross-site scripting attack detection method based on improved fastText | |
Çetin et al. | A comprehensive review on data preprocessing techniques in data analysis | |
Omar et al. | Text-defend: Detecting adversarial examples using local outlier factor | |
Wang et al. | An improved deep learning based intrusion detection method | |
CN115982722B (en) | Vulnerability classification detection method based on decision tree | |
CN110245666B (en) | Multi-target interval value fuzzy clustering image segmentation method based on dual-membership-degree driving | |
CN111107082A (en) | Immune intrusion detection method based on deep belief network | |
Panday et al. | A metaheuristic autoencoder deep learning model for intrusion detector system | |
CN115187266B (en) | Credit card fraud detection method and system based on memory variation self-coding model | |
CN115242539B (en) | Network attack detection method and device for power grid information system based on feature fusion | |
Li et al. | An LSTM based cross-site scripting attack detection scheme for Cloud Computing environments | |
CN112651422A (en) | Time-space sensing network flow abnormal behavior detection method and electronic device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210330 |