CN112580050A - XSS intrusion identification method based on semantic analysis and vectorization big data - Google Patents

XSS intrusion identification method based on semantic analysis and vectorization big data Download PDF

Info

Publication number
CN112580050A
CN112580050A CN202011567690.0A CN202011567690A CN112580050A CN 112580050 A CN112580050 A CN 112580050A CN 202011567690 A CN202011567690 A CN 202011567690A CN 112580050 A CN112580050 A CN 112580050A
Authority
CN
China
Prior art keywords
word
data
vectorization
big data
recognition rate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011567690.0A
Other languages
Chinese (zh)
Inventor
张海军
陈映辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiaying University
Original Assignee
Jiaying University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiaying University filed Critical Jiaying University
Priority to CN202011567690.0A priority Critical patent/CN112580050A/en
Publication of CN112580050A publication Critical patent/CN112580050A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Computer Security & Cryptography (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Hardware Design (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Virology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an XSS intrusion identification method based on semantic analysis and vectorization big data, which comprises the following steps of firstly, carrying out data preprocessing such as data acquisition, data cleaning, data sampling, feature extraction and the like by utilizing a natural language processing method; secondly, a word vectorization algorithm based on a neural network realizes word vectorization to obtain word vector big data; thirdly, the safety protection detection is realized by using a deep neural network intelligent detection algorithm with different depths; and finally, designing different hyper-parameters and training the model to obtain results such as a maximum recognition rate, a minimum recognition rate, a recognition rate mean value, a variance, a standard deviation, a recognition rate change process curve graph, a loss error change process curve graph, a word vector sample cosine distance change curve graph and the like. The result proves that the XSS intrusion identification method provided by the invention has high identification rate, good stability and excellent overall performance.

Description

XSS intrusion identification method based on semantic analysis and vectorization big data
Technical Field
The invention belongs to the technical field of intrusion identification detection, and particularly relates to an XSS intrusion identification method based on semantic analysis and vectorization big data.
Background
In recent years, with the development of big data technology, the situation of network space is more and more severe while generating a large amount of data. WEB application based attacks are becoming the dominant attack, for example Cross-Site Scripting (XSS) is the most common. The traditional detection method at present is to utilize the existing characteristics in the virus characteristic library, extract the characteristics of corresponding samples, search and compare whether there are matched characteristics in the virus library to determine the virus. However, the detection-based method has the following limitations: the establishment and maintenance of the rule base need to consume manpower and material resources, the method is suitable for detecting known viruses and is difficult to detect new viruses, and the detection method greatly influences the detection efficiency in the large data security.
With the continuous development of machine learning, the strong self-adaptability and self-learning capability of the deep learning network become a mainstream trend in network security monitoring, and the attack behavior of unknown characteristics can be detected, so that the detection rate is improved.
Therefore, how to provide a more advanced intrusion identification method aiming at XSS attack in order to make up for the deficiencies of the conventional algorithm when large data is oriented is a problem to be solved urgently at present.
Disclosure of Invention
Aiming at the existing problems, the invention aims to provide an XSS intrusion recognition method based on semantic analysis and vectorization big data, which designs a deep neural network algorithm to realize safety protection detection by utilizing the strong adaptivity and self-learning capability of a deep learning network, and realizes intelligent detection by a big data training model.
In order to realize the purpose of the invention, the technical solution of the invention is as follows:
an XSS intrusion identification method based on semantic analysis and vectorization big data is characterized by comprising the following steps:
step 1: acquiring data to be detected, and performing text cutting, cleaning, word segmentation, part of speech tagging and word stop removal to obtain preprocessed data;
step 2: constructing a word vectorization model realized based on a continuous bag-of-words model CBOW, and mapping the preprocessed data into distributed word vectors by adopting a word vectorization algorithm;
and step 3: counting word vectorization samples to obtain a positive sample data set and a negative sample data set, merging the two data sets to obtain a word vectorization large data sample set, and randomly dividing the large data sample set into a training set and a testing set, wherein the number ratio is 7: 3;
and 4, step 4: inputting samples in a word vectorization big data sample set into deep neural networks DNN with different depths for training, and determining each hyper-parameter in each deep neural network DNN;
and 5: and collecting data of the HTTP request in real time, carrying out attack detection on the HTTP request, and identifying an intrusion attack behavior.
Compared with the prior art, the method has the following beneficial effects:
the invention uses natural language processing method to carry out data preprocessing such as data acquisition, data cleaning, data sampling, feature extraction and the like; a word vectorization algorithm based on a neural network is designed, and word vectorization is realized to obtain word vector big data; carrying out hyper-parameter adjustment by utilizing deep neural networks DNN with different depths, and intelligently monitoring XSS attack by utilizing a deep neural network DNN detection algorithm; the experimental result shows that the detection method provided by the invention has the advantages of high recognition rate, good stability, excellent overall performance and the like.
Drawings
FIG. 1 is a schematic diagram of intrusion intelligent detection based on semantic context analysis and machine learning;
FIG. 2 is a graph of the recognition rate obtained from 20 experiments with a large class I dataset based on different learning rates μ;
FIG. 3 is a graph of the recognition rate obtained from 20 experiments on a class II big dataset based on different learning rates μ;
FIG. 4 is a graph of the recognition rate obtained from 20 experiments on a large class I dataset based on different BatchSize;
FIG. 5 is a graph of the recognition rate obtained from 20 experiments on a class II large dataset based on different BatchSize;
FIG. 6 is a graph of the recognition rate obtained from 20 experiments on a class I big dataset based on the addition of an embedding layer;
FIG. 7 is a graph of the recognition rate obtained from 20 experiments on a class II big dataset based on the addition of an embedding layer;
FIG. 8 is a bar graph of the recognition rate mean based on different learning rates μ for class I and II large data sets;
FIG. 9 is a bar graph of standard deviation based on different learning rates μ for class I and II large data sets;
FIG. 10 is a bar graph of the recognition rate mean based on different BatchSize for class I and II large datasets;
FIG. 11 is a bar graph of standard deviation based on different BatchSize for class I and II large datasets;
FIG. 12 is a bar graph of the recognition rate mean based on the addition of an embedding layer for class I and II large data sets;
FIG. 13 is a bar graph based on standard deviation of an embedded layer for class I and II large datasets;
FIG. 14 is a graph of loss error variation;
fig. 15 is a cosine distance change graph.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the following further describes the technical solution of the present invention with reference to the drawings and the embodiments.
1. Big data processing and modeling
The Web intrusion detection is essentially to analyze the corpus big data of the access flow. Firstly, natural language processing is carried out, and data are processed and modeled; secondly, performing word vectorization, mapping the processed data to a vector space, converting the attack message into a matrix similar to image data, namely pixels, and converting a character string sequence sample into a vector with a certain dimension value; thirdly, carrying out numerical feature extraction and analysis on the word vectors; finally, model training, numerical analysis, user behavior analysis, network traffic analysis and fraud detection are achieved, and the process is shown in the attached drawing 1.
1.1 corpus big data acquisition
The experimental data included: the method comprises the steps of (1) correcting sample big data (with attack behaviors), and utilizing a crawler tool to select from a websitehttp:// xssed.com/Crawling is carried out, and Payload data is formed; secondly, negative sample big data (normal network request) is collected for embodying the specificity and the universality, and two parts of data are collected, wherein one part of the big data is from the access log of the unit network center from the last 5 months to 12 months; the other part is obtained from various network platforms through a web crawler, and the web platforms are all unprocessed corpus big data.
1.2 big data processing
The method is characterized in that a Word vectorization (Word2vec) tool based on a neural network, namely a continuous Word Bag Model (CBOW), is utilized to realize big data corpus processing, text cutting, cleaning, Word segmentation, part of speech tagging, stop Word removal and Word vectorization are carried out, Word vectors of One-hot Encoded are mapped into Word vectors in a distributed form, the dimension is reduced, sparseness is reduced, and meanwhile, the relevance between any words can be obtained by solving the Euclidean distance or the cosine value of an included angle between the vectors. The specific treatment process is as follows:
firstly, traversing a data set, replacing all numbers with '0', replacing HTTP/, HTTP/, HTTPs/, and HTTPS with 'HTTP:/', and then performing word segmentation according to an html tag, a JavaScript functional body, the HTTP:// and a parameter rule; constructing a vocabulary (vocabularies) based on the diary document, and then carrying out One-hot encoding (One-hot Encoded) on the words;
secondly, constructing a word vectorization model, inputting a sample, and obtaining a distributed word vector;
thirdly, counting a positive sample word set, forming a word bank by the words with the highest word frequency, and then performing multiple iterations;
because the length of the characters occupied by each piece of data is different, the length of the characters occupied by each piece of data is the maximum standard, the defects are filled with-1, when the label is designed for the data set, One-Hot coding is used, a positive sample label belongs to an attack sample and is represented by 1, and a negative sample label is represented by 0, namely a normal network request.
Finally, through the above processing, 40637 positive sample data sets can be obtained, and 105912 negative sample data sets and 200129 negative sample data sets are large in number and high in computational complexity, and are large data.
2. Algorithm implementation
2.1 word vectorization Algorithm design
The word vector, i.e., the probability that a known context word predicts the occurrence of the current word, is implemented using CBOW. Thus maximizing the log-likelihood function:
Figure BDA0002861442490000061
wherein w is a word in the corpus C;
the conditional probability of w is first calculated using the Hierarchical Softmax method as follows:
Figure BDA0002861442490000062
wherein p iswRepresents a path,/wRepresenting the number of nodes;
Figure BDA0002861442490000063
representing each node in the path;
Figure BDA0002861442490000064
encoding of the expression w;
Figure BDA0002861442490000065
representing the code corresponding to the jth node in the path,
Figure BDA0002861442490000066
representing a parameter vector corresponding to a non-leaf node on the path;
and each term on the right side in the above formula is a logistic regression, and the following can be obtained:
Figure BDA0002861442490000067
since d takes only 0 and 1, the above formula can be expressed as:
Figure BDA0002861442490000068
and the formula (2) is substituted into the formula (1) to obtain:
Figure BDA0002861442490000071
for each term in the above formula can be written as:
Figure BDA0002861442490000072
maximizing each term, equation (4), respectively, yields:
Figure BDA0002861442490000073
where s (x) is a sigmoid function, s' (x) s (x) [1-s (x) ], and substitution thereof into the formula (5) can yield:
Figure BDA0002861442490000074
then it is right
Figure BDA0002861442490000075
Iterative evaluation, namely:
Figure BDA0002861442490000076
wherein h is the learning rate;
due to XwAnd
Figure BDA0002861442490000077
is symmetrical, so can be derived about XwThe partial derivatives of (a) are:
Figure BDA0002861442490000078
due to XwThe word vector sum of the context is used, and the whole updating value is applied to the word vector of each word of the context during processing, so that a word vector processing model can be obtained:
Figure BDA0002861442490000079
wherein v (w%) represents a context word vector;
based on the established model, the original corpus is used as input, and word vectorization processing can be realized.
2.2 Deep Neural Network (Deep Neural Network, DNN)
For the classification learning of big data, compared with the traditional neural network or other ML algorithms, the deep neural network algorithm has the advantages of higher recognition rate, stronger robustness, better generalization and the like. The invention realizes safety protection detection through a deep neural network algorithm and realizes intelligent detection through a big data training model.
Firstly, the mean square error is designed as follows:
Figure BDA0002861442490000081
secondly, in order to find the optimization parameters, a gradient descent method is used to minimize the function, and the partial derivatives found are respectively set as the 'residual error' of each unit and recorded as di (l)The residual error of the output layer unit can be obtained as:
Figure BDA0002861442490000082
Thirdly, solving for l ═ nl-1, n l2, 2 is the residual of the unit of the respective layer, e.g. l ═ nlThe residual solving formula of each unit of layer 1 is as follows:
Figure BDA0002861442490000091
wherein W represents weight, b represents bias, (x, y) represents training sample, hW,b(x) Represents the final output, f (-) represents the activation function;
n in the above formulal-1 and nlReplacing the relationship of l with l +1, the following can be obtained:
Figure BDA0002861442490000092
the residual error of each unit can be solved by the formula, and further the partial derivative based on variables such as weight value and the like is solved:
Figure BDA0002861442490000093
the change process of the weight value can be obtained as follows:
Figure BDA0002861442490000101
the change process of the bias term is as follows:
Figure BDA0002861442490000102
thus, DNN learning and training can be achieved.
Examples
1. Experimental data
40637 positive sample data sets are obtained in the large data processing of the 1.2, and 105912 negative sample data sets and 200129 negative sample data sets are large in quantity and high in calculation complexity and are large data. In order to improve the training effect, two types of negative sample sets are respectively combined with the positive sample set, and the two types of negative sample sets are all combined by 7:3 into training and testing sets, and recording as I and II big data sets.
2. Procedure of experiment
In order to prove the effectiveness of the algorithm, DNNs with 3, 4, 5, 6, 7 layers and other different depths are firstly constructed; secondly, designing different hyper-parameters, sample block size (BatchSize), learning rate m, the number of neurons contained in each layer and the like; finally, inputting a word vector big data set sample for training and testing; to verify the stability of the system, 20 experiments were performed for each type of data.
3. Results and analysis of the experiments
(1) Different hyper-parameters are designed based on each deep DNN, for example, the learning rates m are 0.001, 0.01 and 0.1, and the recognition rates obtained by performing 20 experiments on the class i big data set are shown in table 1.
TABLE 1 recognition rates of deep DNN for class I big data sets obtained from 20 experiments based on different learning rates μ
Figure BDA0002861442490000111
As can be seen from table 1, when the learning rates are both 0.001 and 0.01, the recognition rates are very high, wherein the lowest recognition rate also reaches 0.9889, the highest recognition rate is 0.9966, and the recognition rate increases with the increase of the training times and finally tends to be stable; when the learning rate is 0.1, the recognition rate is very low, and the average is about 0.2783, because the learning rate is designed to be too large, the gradient is reduced too fast during training to get over the optimal value, in contrast, when the learning rate is small, the global optimum or near optimum can be obtained, and for the curve display with high recognition rate, as shown in fig. 2;
(2) the recognition rates obtained by performing 20 experiments on the class ii big data set based on different learning rates are shown in table 2.
TABLE 2 recognition rates obtained by 20 experiments with deep DNN on class II big data sets based on different learning rates, μ
Figure BDA0002861442490000112
It can be seen that when the learning rates are 0.001 and 0.01, the recognition rates are very high, wherein the lowest recognition rate is 0.9877, the highest recognition rate is 0.9990, and the recognition rates increase with the increase of the training times and finally tend to be stable; also, when the learning rate is 0.1, the recognition rate is relatively lower, although it is increased, and the average is about 0.8311, and the curve for high recognition rate is shown in fig. 3.
(3) Different hyper-parameters were designed based on each deep DNN, such as sample block size (BatchSize) of 50, 100 and 500, and the class i big data set was subjected to 20 experiments to obtain the recognition rate, as shown in table 3.
TABLE 3 recognition rates obtained by deep DNN on the class I big data set in 20 experiments based on different BatchSize
Figure BDA0002861442490000121
As can be seen from Table 3, the recognition rate is very high for different BatchSize, wherein the lowest recognition rate is 0.9895, the highest recognition rate is 0.9968, and the recognition rate increases with the increase of the training times and finally becomes stable; for the three BatchSize settings, the effect is best when the middle value is taken as 100, and the recognition rate curves of the three BatchSize settings are shown as shown in the attached figure 4;
(4) the identification obtained from 20 experiments on the class ii large dataset based on different BatchSize is shown in table 4.
TABLE 4 recognition rates for deep DNN on class II big data sets in 20 experiments based on different BatchSize
Figure BDA0002861442490000131
As can be seen from Table 4, the recognition rate is also very high for different BatchSize, wherein the lowest recognition rate is 0.9877, the highest recognition rate is 0.9991, and the recognition rate increases with the increase of the training times and finally tends to be stable; for the three types of BatchSize, the recognition effect is almost the same, the effect is the best when the middle value is taken as 100, and the recognition rate curves of the three types of BatchSize are shown as shown in the attached figure 5;
(5) in order to further verify the corresponding characteristics of the system, a Dropout (Dp) layer, a BatchNormalization (Bn) layer and a plurality of noise (N) layers are embedded in the DNN, wherein the Dp layer is used for improving the generalization and preventing overfitting, the Bn layer is used for ensuring the consistent data distribution and preventing the gradient dispersion, so that the activation value is amplified, and the N layer is used for generating a noise value and influencing the recognition result. Thus, two kinds of embedded designs are performed: embedding Dp and Bn layers and embedding N and N layers, and then performing 20 experiments on the class i big data set to obtain the recognition rate, as shown in table 5.
TABLE 5 recognition rates obtained by 20 experiments with deep DNN on class I big data sets based on the addition of an embedding layer
Figure BDA0002861442490000141
As can be seen from table 5, when embedding the Dp layer and the Bn layer, a high recognition rate is obtained, which is 0.989335 on average, and when embedding the N layer and the N layer, the recognition rate is lower, which is only 0.721650 on average, which respectively proves the effect of the added embedding layer, and the recognition rates are basically kept unchanged with the increase of the training times, and their recognition rate curves are shown as fig. 6;
the results obtained with 20 more experiments on the class ii large dataset, again based on DNN with the embedded layer, are shown in table 6.
TABLE 6 recognition rates obtained by 20 experiments with deep DNN on class II big data sets based on the addition of an embedding layer
Figure BDA0002861442490000142
As can be seen from table 6, when embedding the Dp layer and the Bn layer, a high recognition rate was obtained, which averaged 0.996400, and when embedding the N layer and the N layer, the recognition rate was also low, which averaged 0.831425, which respectively demonstrate the effect of the added embedding layer, and the recognition rates remained substantially constant with the increase of the training times, and their recognition rate curves are shown as fig. 7.
Further, through experiments, the average recognition rate of each deep DNN for the class i large data set based on different learning rates was about 99.4385%, the variance was about 0.000001, and the standard deviation was about 0.001246, as shown in table 7.
TABLE 7 mean recognition rate, variance and standard deviation of each deep DNN pair class I big data set based on different learning rates μ
Figure BDA0002861442490000151
Similarly, the average recognition rate of each deep DNN for the class ii large dataset based on different learning rates was found to be about 99.7710%, the variance was found to be about 0.000005, and the standard deviation was found to be about 0.002329, as shown in table 8.
TABLE 8 mean recognition rate, variance and standard deviation of each deep DNN for class II big data set based on different learning rates μ
Figure BDA0002861442490000152
Through experiments, the average recognition rate of each deep DNN to the class i big data set based on different BatchSize is about 99.5655%, the variance is 0.000001, and the standard deviation is about 0.001129, as shown in table 9.
TABLE 9 mean discrimination, variance and standard deviation of each deep DNN against class I big data set based on different BatchSize
Figure BDA0002861442490000161
Similarly, the average recognition rate of each deep DNN for the class ii big data set based on different BatchSize was about 99.8200%, the variance was about 0.000001, and the standard deviation was about 0.001231, as shown in table 10.
TABLE 10 mean discrimination, variance and standard deviation of each deep DNN against class II big data set based on different BatchSize
Figure BDA0002861442490000162
Through experiments, the average recognition rate of each deep DNN for the class i big data set based on the embedding layer is about 98.9335%, the variance is about 0.000010, and the standard deviation is about 0.000226, as shown in table 11.
TABLE 11 mean recognition rate, variance and standard deviation of deep DNN pairs for class I big data set based on embedding layer
Figure BDA0002861442490000163
Similarly, the average recognition rate of each deep DNN for the class ii large dataset based on the embedded layer was found to be about 99.6400%, the variance was found to be about 0.000001, and the standard deviation was found to be about 0.000934, as shown in table 12.
TABLE 12 mean recognition rate, variance and standard deviation of each deep DNN pair of class II big data set based on the addition of the embedding layer
Figure BDA0002861442490000171
The bar graph of the mean recognition rate based on different learning rates m for the class i and ii big data sets is shown in fig. 8, and the bar graph of the standard deviation is shown in fig. 9.
The mean bar graph of the recognition rate based on different BatchSize for the class I and II large data sets is shown in FIG. 10 and the standard deviation bar graph is shown in FIG. 11.
The mean bar graph of the recognition rate based on the embedded layer for the class i and ii large datasets is shown in fig. 12 and the standard deviation bar graph is shown in fig. 13.
In order to describe the recognition change process of the system, a loss function change curve chart is obtained, as shown in fig. 14, and it can be seen that the loss error is continuously reduced and tends to be stable as the training is carried out, and the loss error is continuously increased and tends to be stable and consistent with the recognition rate.
Similarly, a cosine distance change curve of the word vector sample is obtained, as shown in fig. 15, it can be seen that the cosine distance is continuously reduced and tends to be stable as the training is performed, which reflects that the word vector sample has stronger and stronger correlation and is consistent with the continuously increased recognition rate.
Those not described in detail in this specification are within the skill of the art. Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes in the embodiments and modifications of the invention can be made, and equivalents of some features of the invention can be substituted, and any changes, equivalents, improvements and the like, which fall within the spirit and principle of the invention, are intended to be included within the scope of the invention.

Claims (5)

1. An XSS intrusion identification method based on semantic analysis and vectorization big data is characterized by comprising the following steps:
step 1: acquiring data to be detected, and performing text cutting, cleaning, word segmentation, part of speech tagging and word stop removal to obtain preprocessed data;
step 2: constructing a word vectorization model realized based on a continuous bag-of-words model CBOW, and mapping the preprocessed data into distributed word vectors by adopting a word vectorization algorithm;
and step 3: counting word vectorization samples to obtain a positive sample data set and a negative sample data set, merging the two data sets to obtain a word vectorization large data sample set, and randomly dividing the large data sample set into a training set and a testing set, wherein the number ratio is 7: 3;
and 4, step 4: inputting samples in a word vectorization big data sample set into deep neural networks DNN with different depths for training, and determining each hyper-parameter in each deep neural network DNN;
and 5: and collecting data of the HTTP request in real time, carrying out attack detection on the HTTP request, and identifying an intrusion attack behavior.
2. The XSS intrusion identification method based on semantic analysis and vectorized big data according to claim 1, wherein the specific operation steps of the step 1 comprise:
step 11: traversing the data set, replacing the number with '0', and replacing HTTP/, HTTP/, HTTPs/, HTTPS with 'HTTP:/';
step 12: performing word segmentation according to the html tag, the JavaScript function body, the http:// and the parameter rule;
step 13: and constructing a vocabulary table based on the diary document, and then carrying out single hot coding on the words to obtain processed sample data.
3. The XSS intrusion recognition method according to claim 2, wherein the specific operation steps of the word vector algorithm in step 2 include:
s21: setting a maximum log-likelihood function of a word vectorization model, wherein the formula is as follows:
Figure FDA0002861442480000021
wherein w is a word in the corpus C;
the conditional probability of w is first calculated using the Hierarchical Softmax method as follows:
Figure FDA0002861442480000022
wherein p iswRepresents a path,/wRepresenting the number of nodes;
Figure FDA0002861442480000023
representing each node in the path;
Figure FDA0002861442480000024
encoding of the expression w;
Figure FDA0002861442480000025
representing the code corresponding to the jth node in the path,
Figure FDA0002861442480000026
representing a parameter vector corresponding to a non-leaf node on the path;
s22: the calculation formula for deriving the context word vector by deducing the formula (1) is as follows:
Figure FDA0002861442480000027
wherein v (w%) represents a context word vector;
s23: the sample data is input into formula (7), and a word vector of the data can be obtained.
4. The XSS intrusion identification method based on semantic analysis and vectorized big data according to claim 3, wherein the specific operation steps of the step 4 comprise:
s41: the mean square error defining the deep neural network DNN settings is:
Figure FDA0002861442480000031
s42: for the purpose of finding the optimization parameters, a gradient descent method is used to minimize the function, and the partial derivatives found are respectively set as the 'residual' of each unit and recorded as
Figure FDA0002861442480000032
The residual error of the available output layer unit is:
Figure FDA0002861442480000033
s43: solving for l ═ nl-1,nl2, 2 is the residual of the unit of the respective layer, e.g. l ═ nlThe residual solving formula of each unit of layer 1 is as follows:
Figure FDA0002861442480000041
wherein W represents weight, b represents bias, (x, y) represents training sample, hW,b(x) Represents the final output, f (-) represents the activation function;
s44: n in the above formulal-1 and nlReplacing the relationship of l with l +1, the following can be obtained:
Figure FDA0002861442480000042
the residual error of each unit can be solved by the formula, and further the partial derivative based on variables such as weight value and the like is solved:
Figure FDA0002861442480000043
s45: the change process of the weight value can be obtained according to the formula:
Figure FDA0002861442480000044
the change process of the bias term is as follows:
Figure FDA0002861442480000051
s46: and comparing the input sample value with the big data sample value until the mean square error of the network training meets the requirement, and determining the hyper-parameter of the network.
5. The XSS intrusion identification method according to claim 4, wherein the hyper-parameters comprise a block size, a learning rate, and a number of neurons in each layer.
CN202011567690.0A 2020-12-25 2020-12-25 XSS intrusion identification method based on semantic analysis and vectorization big data Pending CN112580050A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011567690.0A CN112580050A (en) 2020-12-25 2020-12-25 XSS intrusion identification method based on semantic analysis and vectorization big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011567690.0A CN112580050A (en) 2020-12-25 2020-12-25 XSS intrusion identification method based on semantic analysis and vectorization big data

Publications (1)

Publication Number Publication Date
CN112580050A true CN112580050A (en) 2021-03-30

Family

ID=75139887

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011567690.0A Pending CN112580050A (en) 2020-12-25 2020-12-25 XSS intrusion identification method based on semantic analysis and vectorization big data

Country Status (1)

Country Link
CN (1) CN112580050A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312891A (en) * 2021-04-22 2021-08-27 北京墨云科技有限公司 Automatic payload generation method, device and system based on generative model
CN113536678A (en) * 2021-07-19 2021-10-22 中国人民解放军国防科技大学 XSS risk analysis method and device based on Bayesian network and STRIDE model
CN114169432A (en) * 2021-12-06 2022-03-11 南京墨网云瑞科技有限公司 Cross-site scripting attack identification method based on deep learning
CN114844696A (en) * 2022-04-28 2022-08-02 西安交通大学 Network intrusion dynamic monitoring method, system, equipment and readable storage medium based on risk pool minimization
CN116186698A (en) * 2022-12-16 2023-05-30 广东技术师范大学 Machine learning-based secure data processing method, medium and equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张海军等: "基于类图像处理与向量化的大数据脚本攻击智能检测", 《计算机工程》 *
张海军等: "类图像处理面向大数据XSS入侵智能检测研究", 《计算机应用与软件》 *
张海军等: "语义分析及向量化大数据跨站脚本攻击智检", 《山东大学学报(工学版)》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312891A (en) * 2021-04-22 2021-08-27 北京墨云科技有限公司 Automatic payload generation method, device and system based on generative model
CN113312891B (en) * 2021-04-22 2022-08-26 北京墨云科技有限公司 Automatic payload generation method, device and system based on generative model
CN113536678A (en) * 2021-07-19 2021-10-22 中国人民解放军国防科技大学 XSS risk analysis method and device based on Bayesian network and STRIDE model
CN113536678B (en) * 2021-07-19 2022-04-19 中国人民解放军国防科技大学 XSS risk analysis method and device based on Bayesian network and STRIDE model
CN114169432A (en) * 2021-12-06 2022-03-11 南京墨网云瑞科技有限公司 Cross-site scripting attack identification method based on deep learning
CN114844696A (en) * 2022-04-28 2022-08-02 西安交通大学 Network intrusion dynamic monitoring method, system, equipment and readable storage medium based on risk pool minimization
CN114844696B (en) * 2022-04-28 2023-01-17 西安交通大学 Network intrusion dynamic monitoring method, system, equipment and readable storage medium based on risk pool minimization
CN116186698A (en) * 2022-12-16 2023-05-30 广东技术师范大学 Machine learning-based secure data processing method, medium and equipment

Similar Documents

Publication Publication Date Title
CN112580050A (en) XSS intrusion identification method based on semantic analysis and vectorization big data
Wang et al. HAST-IDS: Learning hierarchical spatial-temporal features using deep neural networks to improve intrusion detection
Paula et al. Deep learning anomaly detection as support fraud investigation in brazilian exports and anti-money laundering
Li et al. A hybrid malicious code detection method based on deep learning
Fan et al. Robust deep auto-encoding Gaussian process regression for unsupervised anomaly detection
CN113596007B (en) Vulnerability attack detection method and device based on deep learning
US11699160B2 (en) Method, use thereof, computer program product and system for fraud detection
CN112887325B (en) Telecommunication network fraud crime fraud identification method based on network flow
Guo et al. Self-trained prediction model and novel anomaly score mechanism for video anomaly detection
CN112784010A (en) Chinese sentence similarity calculation method based on multi-model nonlinear fusion
Li et al. Fast similarity search via optimal sparse lifting
Khoshraftar et al. Dynamic graph embedding via lstm history tracking
Yin et al. Intrusion detection for capsule networks based on dual routing mechanism
CN112671703A (en) Cross-site scripting attack detection method based on improved fastText
Çetin et al. A comprehensive review on data preprocessing techniques in data analysis
Omar et al. Text-defend: Detecting adversarial examples using local outlier factor
Wang et al. An improved deep learning based intrusion detection method
CN115982722B (en) Vulnerability classification detection method based on decision tree
CN110245666B (en) Multi-target interval value fuzzy clustering image segmentation method based on dual-membership-degree driving
CN111107082A (en) Immune intrusion detection method based on deep belief network
Panday et al. A metaheuristic autoencoder deep learning model for intrusion detector system
CN115187266B (en) Credit card fraud detection method and system based on memory variation self-coding model
CN115242539B (en) Network attack detection method and device for power grid information system based on feature fusion
Li et al. An LSTM based cross-site scripting attack detection scheme for Cloud Computing environments
CN112651422A (en) Time-space sensing network flow abnormal behavior detection method and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210330