CN116527357A

CN116527357A - Web attack detection method based on gate control converter

Info

Publication number: CN116527357A
Application number: CN202310460958.8A
Authority: CN
Inventors: 鲍张军; 易秀双; 王宇; 张晓燕; 于北溟
Original assignee: 东北大学
Priority date: 2023-04-26
Filing date: 2023-04-26
Publication date: 2023-08-01

Abstract

The invention provides a Web attack detection method based on a gate control converter, and relates to the technical field of network maintenance. The network model based on the gating transformers is provided by the method, the transformers are combined with the gating convolution module, global semantic information of different space dimensions is extracted by the transformers through a multi-head self-attention mechanism, local space information is extracted by the gating convolution through a one-dimensional convolution kernel, and text information is screened and filtered by the gating mechanism. The invention can effectively extract the multi-dimensional global features and the local features, and the mixed word vector table can contain more accurate and rich semantic information; the method can automatically extract the information characteristics of the effective data in the text sequence, and does not need to manually carry out information screening and vocabulary replacement; the accuracy of multi-class attack detection of the model is further improved, the false alarm rate is reduced, and the safety of the Web server system can be fully protected.

Description

Web attack detection method based on gate control converter

Technical Field

The invention relates to the technical field of network maintenance, in particular to a Web attack detection method based on a gate control Transformer.

Background

With the rapid development of science and technology, the optimal channel for acquiring information from devices such as computers and mobile phone terminals is certainly Web application, and a browser is the most commonly used application of people in various industries almost every day. Various safety problems are caused while convenience is brought. The attack method for the Web application is continuously updated, once the attack is successful, the daily application of the user can be directly threatened, even the privacy security problem of information disclosure can be caused, if the information is not protected, the serious loss can be caused, and common Injection type attacks comprise Sql Injection, XSS attack, command Injection attack and the like.

With the continuous development of machine learning and big data analysis technologies, deep learning technologies have been gradually applied to the field of attack detection, but some existing methods have many disadvantages, such as that some deep learning models, such as CNN, have strong capability in local sequence feature extraction, but have poor capability in global information sensing, text sequence information modeling and other aspects, models, such as LSTM, RNN and the like, have poor performance in long-distance global dependency features, and text of HTTP messages is lengthy and complex. In the field of attack detection, the model needs to output results in a short time as much as possible, and the transducer model breaks through the limitation that models such as LSTM, RNN and the like cannot be calculated in parallel, so that the results can be obtained in a shorter time. For relatively long HTTP messages, which contain many information such as symbols and number that do not contain information, how to automatically screen and extract key effective information from complex information and ignore irrelevant information is also a key problem for further improving the accuracy of attack detection.

The Chinese patent CN113691542A provides a Web attack detection method based on HTTP request text and related equipment, and the method comprises the steps of replacing expert dictionary and special dictionary table to reduce dictionary space, and finally classifying multi-category attack detection. The multi-head attention mechanism BiLSTM model in the patent is poor in the aspect of long-distance global dependence characteristics and cannot break through the limitation of parallel operation. The text of the HTTP message is lengthy and complex, and contains many invalid information, such as symbols, number numbers and the like, and methods of filtering and screening the invalid information by various word list replacement, regular matching and the like depend on expert rule bases and replacement word lists, so that extraction, screening and filtering of effective information can not be truly performed from semantic level.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a Web attack detection method based on a gate control converter aiming at the defects of the prior art, so that the safety of a Web server system is effectively protected.

In order to solve the technical problems, the invention adopts the following technical scheme:

a Web attack detection method based on a gate control converter comprises the following steps:

step 1: flow collection is carried out through an sniff module of a python scapy library, a pcap flow file is collected, and application layer data is extracted from the pcap flow file;

step 2: and (3) URL decoding is carried out on the text of the message, and the text information of the URL, the parameter list and the user-agent, cookie, referer field is segmented through predefined special characters.

Step 3: the mixed word embedding module enhances the robustness of vector representation by fusing word embedding tables generated in two different modes; the two Word Embedding tables of the mixed Word Embedding module are a Word2Vec Word Embedding table based on a continuous Word bag model Cbow and a Word Embedding table based on an Embedding layer respectively; the word Embedding table based on the Embedding layer is initialized through Xavier_uniform distribution, and the distributed representation of the word vector is continuously updated in the training process; word2Vec Word embedding tables based on continuous Word bag models Cbow need to be generated before the models enter a training stage; word Embedding tables based on an Embedding layer and Word2Vec Word Embedding tables based on a continuous Word bag model Cbow map Word vectors to different discrete spaces respectively, so that distributed vector representation of words is performed;

step 4: the text information of the HTTP message is processed by the mixed word embedding module and then is converted into a series of word vectors; inputting a series of word vectors into a Transformer Encoder model for global attention feature extraction; the Transformer Encoder model includes a 3-part structure: the device comprises a position coding module, a multi-head self-attention module and a residual error layer normalization module; the Transformer Encoder model firstly adds position coding information into the vector processed by the mixed word embedding module, extracts multidimensional sequence characteristics through the multi-head self-attention mechanism module and finally inputs the multidimensional sequence characteristics into the feedforward neural network module;

step 5: inputting the output of the Transformer Encoder model into a gating convolution model, and extracting local characteristics of data in a local receptive field range by the gating convolution model; dynamically screening and filtering non-critical data through a gating convolution model;

step 6: classifying the output result of the gating convolution model through a final classifier module; the softmax function is converted into probability distribution, each dimension in the probability distribution corresponds to an attack category, and the attack category corresponding to the index of the probability maximum value in the probability distribution is the classification result of final attack detection.

Further, the specific method of the step 1 is as follows:

step 1.1: starting an sniff network card sniffing module of a python scapy library, collecting flow from a network and storing the flow as a pcap file;

step 1.2: reading and analyzing the collected pcap file through a rd_pcap module of a python scapy library, extracting application layer data from the pcap file and performing data text analysis, wherein the text information comprises URL (uniform resource locator), parameter list and user-agent, cookie, referer fields;

further, in the word segmentation process in the step 2, since the number of words of the input sequence that can be processed by the Transformer Encoder model is limited, the upper limit of the number of words after text word segmentation is caused, the problem is solved by setting the maximum length of the sentence, the part of words exceeding the maximum length of the sentence can be removed, and if the number of the words of the sentence is less than the maximum length, the filling is performed by padding.

Further, the specific method of the step 3 is as follows:

step 3.1: one-hot encoding X ε R of an input HTTP text word ^V ；

Step 3.2: the single thermal code X of each word is respectively combined with the input weight matrix W E R ^V×N Multiplying, sharing input weight matrix W for all input words, and adding and averaging the obtained vectors to obtain hidden layer vector H E R ^N ；

Step 3.3: the hidden layer vector is multiplied by the output weight matrix W' E R ^V×N Obtaining an output vector, converting the output vector into probability distribution through a softmax activation function, wherein the index position of the maximum probability is the predicted central word; in the training stage, a cross entropy loss function is adopted to carry out model training and the Word2Vec model is iteratively updated;

step 3.4: multiplying each input word with a shared input weight matrix W to obtain word embedding vectors of the words, and adopting the matrix W as a word embedding table T of the mixed word embedding module _a ；

Step 3.5: initializing an Embedding layer-based word Embedding table T with Xavier_uniform uniform distribution _b The method comprises the steps of carrying out a first treatment on the surface of the Word embedding table T _a Training is completed before the gated transducer model is trained; and T is _b Simultaneously carrying out iterative updating along with the training process of the gate control transducer model; final mixed word embedding table T _f Embedding a table T from two words _a And T _b The result of the average pooling is generated as shown in formula (1);

T _f ＝(T _a +T _b )/2 (1)。

further, the specific method in the step 4 is as follows:

step 4.1: the text of the HTTP data is processed by the mixed word embedding module in the step 3 before being input into the Transformer Encoder model, and the text words are converted into the distributed numerical vector expression X _embedding ；

Step 4.2: since the word sequence position information of the word is not considered when a series of word vectors are input, position coding information is periodically added to the text word by adopting a sine and cosine function; fusing position coding information generated for each word at each position into the original text word, and fusing word vector X after position coding information _embedding-pe The generation method of the (B) is shown as a formula (2) and a formula (3);

X _embedding-pe ＝X _embedding +X _pos (2)

wherein pos represents the word sequence position of a word in the text, and the value range of pos is an integer between 0 and the maximum length of the sequence; to be able to add position-coding information, a position-coding vector X is generated for a word _pos The dimension of the word is the same as the dimension of the word vector, and the word vector X processed by the mixed word embedding module _embedding Dimension and position encoding vector X _pos Are all d in dimension _emb Wherein 2i+1 and 2i respectively represent the word vector X _embedding And a position-coding vector X _pos Odd and even positions in (a)The value range of i isd _emb Representative word vector X _embedding Is a dimension of (2);

step 4.3: the multi-head self-attention module extracts global sequence features from a plurality of dimensions of the text, the dimension of an output result is the same as the dimension of input data, and each word in the text is fused with the global features; to word vector X _embedding-pe By three different linear mapping matrices W _Q 、W _K 、W _V Generating three key information, including information (Q) to be queried, a word key (K) and a word value (V), as shown in a formula (4);

when global attention feature extraction is carried out, attention score calculation is carried out only through the problem information Q to be extracted and K corresponding to each word in a sentence, the attention score value calculation essentially comprises the steps of calculating the correlation coefficient between the words, and then carrying out weighted summation on the V by taking the attention score between the words as a weight, wherein the process is the principle of a self-attention mechanism; the scaling dot product adopted by the attention score calculation method is shown in a formula (5);

wherein the denominator isIn order to prevent the dot product from being too large in value, which in turn results in too extreme values after passing the softmax function, the subscript k represents the dimension of the Q, K, V matrix;

the multi-head self-attention mechanism is for word vector X _embedding-pe Performing calculation of self-attention mechanism from different subspaces of multiple dimensions; when it is desired to select from h different dimensionsWhen subspace performs self-attention mechanism calculation, the linear mapping matrix is split into h blocks, and the split h blocks of linear mapping matrix respectively correspond to the calculation of h different subspace self-attention mechanismsWhere s represents the attention of a subspace, s.epsilon.1, h]；

The output of the multi-head self-attention module is to extract the HTTP text words from the h different dimension subspaces by a global attention mechanism and output the self-attention of the h different heads to head _s Feature stitching is carried out, and a multi-head self-attention mechanism outputs X _multihead The calculation method of (2) is shown in a formula (6), wherein s represents the attention mechanism of a certain subspace;

step 4.4: residual connection and layer normalization module;

residual connection embeds a position-coded word before entering a multi-headed self-attention module into vector X _embedding-pe Adding the output result of the multi-head self-attention module; carrying out standardization processing on the output data of the multi-head self-attention module by adopting a layer normalization (LayerNorm) method; applying residual connection and layer normalization modules to multi-head attention mechanism output X _multihead The calculation formula of (1) is shown as (7), and the output X of the multi-head self-attention module _multihead Results X after processing through residual error connection and layer normalization _multihead-rn Namely, the residual connection and the final output result of the layer normalization module;

X _multihead-rn ＝LayerNorm(X _embedding-pe +X _multihead ) (7)

step 4.5: output result X of residual error connection and layer normalization module through fully-connected neural network _multihead-rn Further processing, extracting richer semantic information; finally, the output of the Transformer Encoder model is X _encoded Neural network computational formula (8)Shown;

wherein Relu is a nonlinear activation function.

Further, the specific method in the step 5 is as follows:

step 5.1: output X after global sequence feature extraction of Transformer Encoder model _encoded The information is input into a gating convolution module for information filtering and screening, and the gating convolution module comprises c one-dimensional convolution kernels Kernel with different scales _j (j∈[1,c]) The calculation formula of the single convolution kernel is shown as formula (9);

g _j ＝Relu(Conv(Kernel _j∈c ,X _encoded )+b _j ) (9)

wherein g _j The output of a single convolution block; relu is a nonlinear activation function; conv represents the convolution operation process; b _j Kernel is a convolution Kernel _j Corresponding offset;

step 5.2: characteristic splicing is carried out on output results of different scale convolutions, and a multiscale gating convolution value is mapped to a range of 0-1 through a Sigmoid activation function, namely a gating value Gatesv, wherein the gating value Gatesv is calculated according to a formula (10), and the value range of Gatesv is between 0 and 1; a gating value close to 0 represents information which is hardly important and is ignored by filtering; a gating value of approximately 1 represents that the part of data is key information and is reserved completely;

output X of Transformer Encoder model _encoded Multiplication of element level is carried out on the gating value Gatesv, and the coding information X is completed _encoded Filtering and screening to obtain the output result X of the gating convolution module _gated The information filtering method is shown in a formula (11);

X _gated ＝X _encoded ⊙Gatesv，Gatesv∈[0,1] (11)

wherein, the symbol as follows represents element level multiplication;

further, the specific method in the step 6 is as follows:

output result X of gating convolution module _gated Inputting the data into a Classifier formed by two layers of fully-connected networks, converting the output result of the Classifier network into probability distribution through a softmax function, wherein each dimension in the probability distribution corresponds to an attack category, and the attack category corresponding to the index of the maximum probability value in the probability distribution is the classification result X of final attack detection _pred The attack detection classification process is shown in formula (12);

X _pred ＝argmax(Softmax(Classifier(X _gated ))) (12)。

the beneficial effects of adopting above-mentioned technical scheme to produce lie in: the invention provides a Web attack detection method based on a gate control Transformer, which provides a network model capable of effectively extracting multi-dimensional global features and local features, improves a word embedding method and enables a mixed word vector table to contain more accurate and rich semantic information; the method can automatically extract the information characteristics of the effective data in the text sequence, and does not need to manually carry out information screening and vocabulary replacement; the accuracy of multi-class attack detection of the model is further improved, the false alarm rate is reduced, and the safety of the Web server system can be fully protected.

Drawings

FIG. 1 is a diagram of an overall architecture of a network model according to an embodiment of the present invention;

FIG. 2 is a diagram of a mixed word embedding table model structure provided by an embodiment of the present invention;

FIG. 3 is a diagram of a gating convolutional network model according to an embodiment of the present invention;

FIG. 4 is a diagram of a simulation architecture of a network attack data set according to an embodiment of the present invention;

fig. 5 is a diagram of an example of network attack load provided in an embodiment of the present invention.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

As shown in fig. 1, an overall network configuration diagram is presented by the method of the present embodiment, and the method of the present embodiment is as follows.

Step 1: traffic collection is performed through the sniff module of the python scapy library, and the pcap traffic file is collected and the application layer data is extracted from the pcap traffic file. The specific method comprises the following steps:

step 1.1: starting an sniff network card sniffing module of the scapy library, setting a monitoring network card iface and a processing function write_cap, collecting some flow from a network and storing the flow as a pcap file.

Step 1.2: and reading and analyzing the collected pcap file by using an rd_pcap module of a python scapy library, and extracting application layer data from the collected pcap file. The application layer data text to be analyzed includes URL, parameter list, user-agent, cookie, referer fields. Since injection attacks may also exist in the user-agent, cookie, referer field, the request parameter information for such field is likewise not negligible.

Step 2: preprocessing the text of the message, and segmenting the URL, the parameter list and the user-agent, cookie, referer field by predefined special characters.

URL decoding of the text of the message is performed by "/& =:? Special characters such as +/-/< >% () _ "word the URL, parameter list, user-agent, cookie, referer fields. The number of words in the text word segmentation process needs to be considered to be limited, because the Transformer Encoder model can process limited numbers of words in the input sequence. This problem is solved by setting the maximum length of the sentence, and parts of the words exceeding the maximum length of the sentence are removed, and if the number of sentence words is smaller than the maximum length, padding is performed by padding. The maximum length is set to 60 and the padding is 0 in this embodiment.

Step 3: the mixed word embedding module enhances the robustness of the vector representation by fusing two differently generated word embedding tables. The two Word Embedding tables of the mixed Word Embedding module are a Word2Vec Word Embedding table based on a continuous Word bag model (Cbow) and a Word Embedding table based on an Embedding layer respectively. The Word Embedding table based on the Embedding layer is initialized through Xavier_unique distribution, and the distributed representation of the Word vectors is continuously updated in the training process, and the Word Embedding table based on Word2Vec needs to be generated before the model enters the training stage. Word Embedding tables based on the Embedding layer and Word Embedding tables based on the Word2Vec map Word vectors to different discrete spaces respectively so as to perform distributed vector representation of words. The word embedding mode has lower dimensionality and contains more semantic information compared with the traditional single-hot coding mode. The mixed word embedding table is adopted to more effectively learn the semantic relation among HTTP text keywords, and the semantic relation contains richer semantic information, as shown in figure 2. The specific method comprises the following steps:

step 3.1: one-hot encoding X ε R of an input HTTP text word ^V ；

Step 3.3: the hidden layer vector is multiplied by the output weight matrix W' E R ^V×N And obtaining an output vector, converting the output vector into probability distribution through a softmax activation function, and obtaining the index position where the maximum probability is located, namely the predicted central word. In the training stage, a cross entropy loss function is adopted to carry out model training and the Word2Vec model is iteratively updated;

step 3.4: each input word is multiplied by a shared input weight matrix W to obtain a word embedding vector of the word, and the matrix W is used as a word embedding table T of the mixed word embedding module _a . In this embodiment, the vocabulary Ta words number 2000, the word vector length 300, min_count=10, window=3. The method comprises the steps of carrying out a first treatment on the surface of the

Step 3.5: initializing an Embedding layer-based word Embedding table T with Xavier_uniform uniform distribution _b And T is _b Iterative updates are performed simultaneously with the training process of the gated transducer model, but the word embedding table T _a Is trained before the gated transducer model is trained. Final mixed word embedding table T _f Embedding a table T from two words _a And T _b The result of the averaging pooling is generated as shown in formula (1). In this embodiment, the word Embedding table based on the Embedding layer also maps words into 300-dimensional vector representations, resulting in word vector dimensions identical to Ta.

T _f ＝(T _a +T _b )/2 (1)

Step 4: the text information of the HTTP message is processed by the mixed word embedding module and then is converted into a series of word vectors; a series of word vectors are input to the Transformer Encoder model for global attention feature extraction. The Transformer Encoder model includes a 3-part structure: the device comprises a position coding module, a multi-head self-attention module and a residual error layer normalization module. The Transformer Encoder model firstly adds position coding information into the vector processed by the mixed word embedding module, extracts multidimensional sequence characteristics through the multi-head self-attention mechanism module, and finally inputs the multidimensional sequence characteristics into the feedforward neural network module. In this embodiment, the number of layers of the Transformer Encoder model is two. The method comprises the following specific steps:

step 4.1: before text input of HTTP data into Transformer Encoder model, the text word is converted into distributed numerical vector X through mixed word embedding module _embedding . X in the present embodiment _embedding Is a 300-dimensional vector;

step 4.2: since the word sequence position information of the word is not considered when a series of word vectors are input, position coding information is periodically added to the text word by adopting a sine and cosine function; fusing position coding information generated for each word at each position into the original text word, and fusing word vector X after position coding information _embedding-pe The generation method of (2) is shown as formula (2) and formula (3). Where pos represents the ordinal position of a word in the text and its value ranges from 0 to the integer between the maximum length of the sequence. To be able to add position-coding information, a position-coding vector X is generated for a word _pos The dimension of the word is the same as the dimension of the word vector, and the word vector X processed by the mixed word embedding module _embedding Dimension and position encoding X of (2) _pos Are all d in dimension _emb Wherein 2i+1 and 2i respectively represent the word vector X _embedding And a position-coding vector X _pos The odd and even positions in (a) so that the value range of i isWherein d is _emb Representative word vector X _embedding Is a dimension of (2);

X _embedding-pe ＝X _embedding +X _pos (2)

step 4.3: the multi-head self-attention mechanism module extracts global sequence features from multiple dimensions, and the output result is the same as the dimension of input data but each word in the text is fused with the global features. Calculation method of attention mechanism refers to search query thought and aims at word vector X _embedding-pe By three different linear mapping matrices W _Q 、W _K 、W _V Three key information are generated: information to be queried (Q), keys of words (K), values of words (V), as shown in formula (4).

When global attention feature extraction is performed, attention score calculation is performed only through K corresponding to each word in a sentence by the problem information Q to be extracted, attention score value calculation is essentially to calculate a correlation coefficient between words, and then the attention score between the words is used as a weight to weight and sum V, so that the process is the principle of a self-attention mechanism. The attention score calculation method employs a scaling dot product as shown in equation (5), in which the denominatorTo prevent the dot product from having a value that is too large, and thus that after passing the softmax function, the subscript k represents the dimension of the Q, K, V matrix.

The multi-head self-attention mechanism is for word vector X _embedding-pe The calculation of the self-attention mechanism is performed from different subspaces of multiple dimensions. When self-attention mechanism calculation is required to be carried out from subspaces with h different dimensions, the linear mapping matrix is split into h blocks respectively, and the split h blocks of linear mapping matrix correspond to the calculation of the self-attention mechanism of the h different subspaces respectivelyWhere s represents the attention of a subspace, s.epsilon.1, h]；

The output of the multi-head self-attention module is to extract the HTTP text words from the h different dimension subspaces by a global attention mechanism and output the self-attention of the h different heads to head _s Feature stitching is carried out, and a multi-head self-attention mechanism outputs X _multihead The calculation method of (2) is shown in formula (6), wherein s represents the attention mechanism of a certain subspace. In the present embodiment, the multi-head attention head number h=6.

Step 4.4: a residual join and layer normalization (LayerNorm) module;

residual connection embeds a position-coded word before entering a multi-headed self-attention module into vector X _{emultihead-pe} Output result X of multi-head self-attention module _multihead Adding, residual connection can directly transfer gradient when model gradient counter-propagates, thereby avoiding too deep model hierarchyThe problem of instability of the gradient is generated. In order to accelerate the model convergence speed and prevent gradient disappearance and gradient explosion, a layer normalization (LayerNorm) method is adopted to normalize data, and a residual error connection and layer normalization module is applied to a multi-head attention mechanism output X _multihead The calculation formula of (1) is shown as (7), and the output X of the multi-head self-attention module _multihead Results X after processing through residual error connection and layer normalization _multihead-rn I.e. the residual connection and layer normalization module finally outputs the result.

X _multihead-rn ＝LayerNorm(X _embedding-pe +X _multihead ) (7)

Step 4.5: finally, outputting a result X of the multi-head self-attention module through the full-connection neural network Linear _multihead Further processing, extracting more abundant semantic information. Finally, transformer Encoder has an output of X _encoded The neural network calculation formula is shown in formula (8), wherein Relu is a nonlinear activation function.

Step 5: the HTTP text processed by the mixed word embedding module is subjected to global attention feature extraction through a Transformer Encoder model, then the output of the Transformer Encoder model is input into a gating convolution model again, and the gating convolution model is used for carrying out local feature extraction on data in a local receptive field range; the non-key information is dynamically screened and filtered through the gating convolution model, as shown in fig. 3, the model effectively solves the problem that all dimensions of input data are regarded as effective data through traditional convolution, and can automatically filter the data and further improve the attack detection accuracy. The method comprises the following specific steps:

step 5.1: the output of the Transformer Encoder model after global sequence feature extraction is X _encoded X is again taken _encoded The information is input into a gating convolution module for information filtering and screening, and the gating convolution module comprises c one-dimensional convolution kernels Kernel with different scales _j (j∈[1,c]) The calculation formula of the single convolution kernel is shown as formula (9), wherein g _j The output of a single convolution block; relu is a nonlinear activation function; conv represents the convolution operation process; b _j Kernel is a convolution Kernel _j Corresponding offset; in this embodiment, convolution kernels of three different dimensions, 10, 15, 25, are used.

g _j ＝Relu(Conv(Kernel _j∈c ,X _encoded )+b _j ) (9)

Step 5.2: output result g of convolving different scales _i And performing feature splicing, and mapping the value of the multi-scale gating convolution to be a gating value Gatesv within a range of 0-1 through a Sigmoid activation function, wherein the gating value Gatesv is calculated according to a formula (10), and the value range of the Gatesv is between 0 and 1. A gating value close to 0 represents information which is hardly important and is ignored by filtering; a value near1 represents that the portion of data is critical information and will be fully retained. Output X of Transformer Encoder model _encoded Multiplication of element level is carried out on the gating value Gatesv, and then the coding information X can be finished _encoded Filtering and screening to obtain the output result X of the gating convolution module _gated The information filtering method is shown in formula (11), wherein the symbol # indicates the element level multiplication.

X _gated ＝X _encoded ⊙Gatesv，Gatesv∈[0,1] (11)

Step 6: output result X of gating convolution module _gated The data dimension is increased and then decreased by inputting the data dimension into a Classifier class consisting of two layers of fully connected networks Linear1 and Linear2, the meaning of the dimension increase is to combine various features, and the dimension decrease represents the process of information fusion of the combined features. Finally, converting the output of the fully connected network layer into probability distribution through a softmax function, wherein each dimension in the probability distribution corresponds to an attack category, and the index of the maximum probability in the probability distribution corresponds to the attack categoryI.e. the classification result X of the final attack detection _pred The attack detection classification prediction method is shown in formula (12). In this implementation, the attack classification task is 10 classification.

X _pred ＝argmax(Softmax(Lineart1(Lineart2(X _gated )))) (12)

The network model based on the gating transformers combines the transformers with the gating convolution module, the transformers extract global semantic information of different space dimensions through a multi-head self-attention mechanism, the gating convolution extracts information of local space through a one-dimensional convolution kernel, and the gating mechanism is adopted to screen and filter text information. The model mainly has the following advantages:

(1) The following advantages are achieved by using a transducer model and improving it: the word embedding layer of the Transformer is initialized in two ways and is subjected to average pooling operation, namely word2vec and xavier_unique initialization based on cbow respectively, so that word vector training can be more fully performed and robustness is improved; the multi-head self-attention mechanism can effectively extract multi-dimensional global sequence features, has no time sequence dependence compared with other sequence models such as RNN and LSTM, can utilize GPU parallel computation to shorten training time, and does not generate too much computation complexity.

(2) The gating convolution module further extracts the local characteristics of the n-gram and performs effective information shielding and screening. Has the following advantages: the convolution gating unit processes data output by a Transformer by adopting convolution operation, a multi-head self-attention mechanism of the Transformer extracts global information features from a plurality of space dimensions, and the extraction of local features is possibly somewhat deficient, and because parameters in the URL are n-gram local features of a parameter=value type, gating convolution of shared parameters can more fully extract local feature information; the plurality of one-dimensional convolution kernels with different scales can more effectively cope with words with different lengths, and finally the output vectors of the convolution kernels with different scales are spliced, so that the situation that the words with different lengths cannot be fully matched due to the fact that the single-scale convolution kernels are not fully matched is avoided, and information loss caused by insufficient local feature extraction is avoided; for longer HTTP messages, which contain a plurality of information such as symbols and digital numbers without information, a substitution rule dictionary is not needed to reduce the word list space, and key effective information is automatically screened and extracted from complex information to ignore irrelevant information, so that the accuracy of attack detection is further improved.

This example conducted test experiments on the public data set CSIC2010 and the network traffic collected by the simulated network attack. The HTTP CSIC2010 dataset is given as an attachment in the paper by spanish's highest scientific research institute CSIC, and the collected web traffic is a record of normal access and web attacks of a certain e-commerce website, which contains 36000 normal requests and 25000 attack requests. The exception request samples comprise attack samples of SQL injection, file traversal, CRLF injection, XSS, SSI and the like. In order to improve generalization of the model and verify the effect of the model in an actual network environment, various attack type features can be learned more fully, a kali linux is adopted to carry out network attack simulation experiments, and traffic generated by attack is collected, wherein a main simulation access target is a web site, and the web site comprises a normal access request and an attack request. As shown in fig. 4, the attacker always opens the wireshark to collect and store the traffic. An attacker attacks a website built by a target host, mainly adopts tools such as sqlmap, nmap, metasploit framework and the like, and comprises 50000 normal requests and 100000 attack requests, and 10 kinds of data classification labels are respectively: normal, abnormal, sql injection, buffer overflow, formatting string, SSI, XPATH, XSS, CRLFi, LDAP injection, detailed attack payload is shown in fig. 5.

In this embodiment, three performance evaluation indexes commonly used in attack detection systems are adopted, wherein the indexes include an Accuracy rate (Accuracy), a Recall rate (Recall), and an F1 Score (F1 Score), and the calculation formulas are shown in formulas (11) - (13).

On the network flow collected by the data set CSIC2010 and the simulated network attack, comparison experiments are respectively carried out with a plurality of baseline models such as CNN, LSTM, biLSTM and the like, three indexes of accuracy, recall rate and F1 score of a plurality of models are evaluated in the experiments, the ten-classification experimental results of attack detection are shown in the table 1 and the table 2, and experimental data are respectively the flow generated by the data set for the CSIC2010 and the simulated attack.

Table 1 results of test for attack based on CSIC2010 dataset (Classification 10)

Table 2 results of the test for testing of a simulation dataset attack based on a network attack (10 categories)

Model	Accuracy	F1	Rec
				DT	87.87％	89.35％	94.54％
Linear SVM	87.23％	88.51％	88.15％
				BiLSTM+CNN	94.54％	94.12％	94.98％
BiLSTM	93.15％	91.34％	93.46％
				CNN	92.87％	93.61％	94.31％
LSTM+CNN	93.43％	93.51％	93.63％
				LSTM	91.71％	92.8％	92.96％
LSTM+GatedCNN	93.15％	92.31％	93.54％
				Transformer	94.43％	94.32％	94.45％
Gated transducer	96.64％	96.51％	97.54％

The method of the invention is superior to the comparison test model in all three indexes, which proves that the invention has great improvement in the aspect of attack detection effect, and the comparison test shows that the improved gate control Transformer network model can effectively extract global features and local features, can automatically extract effective features in text sequences, omits the step of replacing manual HTTP keyword list, and can effectively protect the safety of a Web server system.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions, which are defined by the scope of the appended claims.

Claims

1. A Web attack detection method based on a gate control transducer is characterized by comprising the following steps: the method comprises the following steps:

step 2: URL decoding is carried out on the message text, and text information of the URL, parameter list and user-agent, cookie, referer fields is segmented through predefined special characters;

2. The method for detecting Web attack based on gated fransformer according to claim 1, wherein the method comprises the steps of: the specific method of the step 1 is as follows:

step 1.2: and reading and analyzing the collected pcap file through a rd_pcap module of the python scapy library, extracting application layer data from the pcap file and carrying out data text analysis, wherein the text information comprises URL (uniform resource locator), parameter list and user-agent, cookie, referer fields.

3. The method for detecting Web attack based on gated fransformer according to claim 1, wherein the method comprises the steps of: in the word segmentation process in the step 2, the number of words of the input sequence which can be processed by the Transformer Encoder model is limited, so that the upper limit of the number of words after text word segmentation is caused, the problem is solved by setting the maximum length of a sentence, partial words exceeding the maximum length of the sentence are removed, and if the number of the words of the sentence is smaller than the maximum length, padding is performed by padding.

4. The method for detecting Web attack based on gated fransformer according to claim 1, wherein the method comprises the steps of: the specific method of the step 3 is as follows:

step 3.1: one-hot encoding X ε R of an input HTTP text word ^V ；

Step 3.3: the hidden layer vector is multiplied by the output weight matrix W' E R ^V×N Obtaining an output vector, converting the output vector into probability distribution through a softmax activation function, and obtaining probabilityThe index position of the maximum value is the predicted central word; in the training stage, a cross entropy loss function is adopted to carry out model training and the Word2Vec model is iteratively updated;

T _f ＝(T _a +T _b )/2 (1)。

5. the method for detecting Web attack based on gated fransformer according to claim 1, wherein the method comprises the steps of: the specific method of the step 4 is as follows:

X _embedding-pe ＝X _embedding +X _pos (2)

wherein pos represents the word sequence position of a word in the text, and the value range of pos is an integer between 0 and the maximum length of the sequence; to be able to add position-coding information, a position-coding vector X is generated for a word _pos The dimension of the word is the same as the dimension of the word vector, and the word vector X processed by the mixed word embedding module _embedding Dimension and position encoding vector X _pos Are all d in dimension _emb Wherein 2i+1 and 2i respectively represent the word vector X _embedding And a position-coding vector X _pos The value range of i is thatd _emb Representative word vector X _embedding Is a dimension of (2);

the multi-head self-attention mechanism is for word vector X _embedding-pe Performing calculation of self-attention mechanism from different subspaces of multiple dimensions; when self-attention mechanism calculation is required to be carried out from subspaces with h different dimensions, the linear mapping matrix is split into h blocks respectively, and the split h blocks of linear mapping matrix correspond to the calculation of the self-attention mechanism of the h different subspaces respectivelyWhere s represents the attention of a subspace, s.epsilon.1, h]；

step 4.4: residual connection and layer normalization module;

residual connection embeds a position-coded word before entering a multi-headed self-attention module into vector X _embedding-pe Adding the output result of the multi-head self-attention module; carrying out standardization processing on the output data of the multi-head self-attention module by adopting a layer normalization (LayerNorm) method; connecting and layer normalization of residual errorsApplication of a chemosynthesis module to a multi-head attention mechanism output X _multihead The calculation formula of (1) is shown as (7), and the output X of the multi-head self-attention module _multihead Results X after processing through residual error connection and layer normalization _multihead-rn Namely, the residual connection and the final output result of the layer normalization module;

X _multihead-rn ＝LayerNorm(X _embedding-pe +X _multihead ) (7)

step 4.5: output result X of residual error connection and layer normalization module through fully-connected neural network _multihead-rn Further processing, extracting richer semantic information; finally, the output of the Transformer Encoder model is X _encoded The calculation formula of the neural network is shown in formula (8);

wherein Relu is a nonlinear activation function.

6. The method for detecting Web attack based on gated fransformer according to claim 5, wherein the method comprises the steps of: the specific method in the step 5 is as follows:

g _j ＝Relu(Conv(Kernel _j∈c ,X _encoded) +b _j ) (9)

X _gated ＝X _encoded ⊙Gatesv，Gatesv∈[0,1] (11)

wherein, the symbol as follows represents the element level multiplication.

7. The method for detecting Web attack based on gated fransformer according to claim 6, wherein the method comprises the steps of: the specific method of the step 6 is as follows:

X _pred ＝argmax(Softmax(Classifier(X _gated ))) (12)。