CN113806739B

CN113806739B - Business access data detection method based on deep learning

Info

Publication number: CN113806739B
Application number: CN202111084993.1A
Authority: CN
Inventors: 田新远
Original assignee: Beijing Huaqing Xin'an Technology Co ltd
Current assignee: Beijing Huaqing Xin'an Technology Co ltd
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2022-04-19
Anticipated expiration: 2041-09-16
Also published as: CN113806739A

Abstract

The invention discloses a service access data detection method based on deep learning, which comprises the following steps: vectorizing original request data respectively aiming at a request head and other parts, then inputting a vector matrix of the request head into a full-connection network model for training, and judging whether the request head is white data or not; the output of the current network layer in the fully-connected network model is the input of the next network layer, and the calculation formula of the current network layer is as follows:

the parameters are updated according to the formulas (II) and (III),

the method and the device detect the request head of the request data and then detect other parts of the request data, can accurately and quickly detect the white data in the service access, and have the accuracy rate of about 99 percent and the recall rate of about 97 percent.

Description

Business access data detection method based on deep learning

Technical Field

The invention relates to the technical field of network security big data. More particularly, the invention relates to a service access data detection method based on deep learning.

Background

The science and technology is the rapid development of the double-edged sword and the network technology, brings great convenience to the life of people, and simultaneously puts higher requirements on the network security technology. The clothes and food residents of people realize digitization through a network, all data can be stored in a database in a specific form by each large company, and vulnerabilities in the network are often utilized by lawless persons, and the lawless persons attack invisibly and often cause extremely serious consequences. In recent years, the research on network security models is not few, but the research is focused on learning the characteristics of malicious data, and the obtained results are not very excellent.

Disclosure of Invention

An object of the present invention is to solve at least the above problems and to provide at least the advantages described later.

The invention also aims to provide a service access data detection method based on deep learning, which detects the request head of the request data and then detects other parts of the request data, can more accurately and quickly detect the white data in service access, and has the accuracy rate of about 99 percent and the recall rate of about 97 percent.

To achieve these objects and other advantages in accordance with the purpose of the invention, there is provided a deep learning-based service access data detection method, comprising: vectorizing original request data respectively aiming at a request head and other parts, inputting a vector matrix of the request head into a full-connection network model for training, and judging whether the request head is white data or not; in the fully-connected network model, the output of the current network layer is the input of the next network layer, and the calculation formula of the current network layer is as follows:

in formula (I), y is the output of the current network layer; w is a_iIs a weight matrix, x_iThe method comprises the following steps of inputting an ith neuron, b is a bias parameter, and n is the number of the neurons, wherein n is a positive integer;

wherein each parameter is updated according to the formula (II) and the formula (III),

in the formulae (II) and (III), α is the learning rate, bi is the bias parameter of the ith neuron, bl is the bias parameter of the l-th network layer, and w is the bias parameter of the l-th network layer_iIs a weight matrix of the ith neuron, and Wl is a weight matrix of the l-th layer network layer, wherein i is a positive integer from 1 to n, and l is more than or equal to 1 and less than or equal to 4. Most of the traditional network security models are used for extracting features aiming at abnormal data, but with the rapid development of network technology, the network securityThe computation amount required by the model is greatly increased, the running speed is slower and slower, and the response time to abnormal access is influenced. In the requested data, abnormal points appear in each part of the request, and if the whole request is directly input into the model, the time consumption is high, and the memory consumption is also high. The detection model in the deep learning-based service access data detection method is divided into two parts, the network structure of the first half part is simple, the calculation speed is high, the response speed to the data is high, if the request head is abnormal, the request head is directly judged to be abnormal data, if the request head is not abnormal, the feature vectors of other parts flow into the model of the second half part for judgment, and therefore the processing speed of the whole detection model to the data is improved to a certain extent. The invention subverts the traditional concept, adopts the characteristic of focusing on learning the white data and can more quickly and accurately identify the white data in the service access.

Preferably, the deep learning-based service access data detection method further includes: when the request head of the original request data is white data, inputting the vector matrix of other parts of the original request data into a convolutional neural network model for training; the convolutional neural network model consists of a convolutional layer, a pooling layer and a full-connection layer; the formula of the convolution operation is as follows:

α_i＝f(W·X_i～i+h-1+b_j) (Ⅳ)

in the formula (IV), alpha_iRepresenting a feature vector obtained by the ith convolution operation; f represents an activation function; h represents the height of the convolution kernel; w represents a weight matrix of the convolution kernel; bj represents the bias parameter of the jth convolution kernel;

through pooling operations, a final characterization is obtained: t ═ max { α ═₁,α₂,...,α_n-h+1}

The prediction result output by the full connection layer is shown in a formula (VI):

in the formula (VI), the compound represented by the formula (VI),

the predicted value is represented by a value of the prediction,

weight matrix representing fully connected layers, T represents the final eigenvector, b_mRepresenting the bias parameters of the fully connected layer. The convolutional layer mainly extracts features, the pooling layer mainly reduces dimensions, overfitting is prevented, and a final result is output by the full-connection layer.

Preferably, the deep learning-based service access data detection method further includes: before vectorizing the original request data respectively aiming at the request header and other parts, cleaning and preprocessing the original request data, which specifically comprises the following steps: the method comprises the steps of carrying out conventional duplication removal, similarity duplication removal, replacement of 'nan' in data by a number 0, decoding, deletion of 'n' and 'r' in data, replacement of the number in data by 0, replacement of Chinese in data by 'Chinese' and word segmentation by a jieba word segmentation tool, and finally splicing processed fields. The invention can reduce the complexity of data to a certain extent by replacing numbers and deleting 'n' and 'r' to replace Chinese character strings, so that the characteristics of the data are more obvious, and the length of the processed data is generally reduced, thereby reducing the memory consumption. The quality of the data source directly affects the effect of the model, so each step of data processing is extremely important.

Preferably, the vectorization processing includes:

vector extraction is carried out on the 'refer' and the 'user-agent' in the request header by using a bert word vector model, wherein the dimensionality of a word is defined as 768 dimensions, and the text is converted into vectors of 528 multiplied by 768;

vectorization conversion is performed on "request _ body", "url", and "method" using word2vec word vectors, where the dimension of a word is defined as 128 dimensions, and the maximum length of each piece of data is defined as 1000. The method is based on the service access type, not only extracts the features of the conventional url, but also vectorizes and converts the request _ body, the url, the method and the refer and the user-agent in the request header into the model for feature extraction, and performs multi-feature extraction on the service type so as to improve the accuracy of data detection. The maximum length of each piece of data is defined as 1000, the maximum length, the minimum length and the average value of sample data are obtained and finally determined through experiments, if the data is too long, the matrix is sparse, the space is wasted, if the data is too short, fragments with characteristics can be cut off, and the model effect can be influenced.

Preferably, the number of network layers in the fully-connected network model is 4, wherein,

a first layer network: number of neurons 128, activation function "relu";

layer two: the number of neurons 64, the activation function activation ═ relu ";

drop is added to 0.2;

layer three: the number of neurons 64, the activation function activation ═ relu ";

layer four: the neuron number 2 and the activation function activation are "sigmoid". The network layer of the fully-connected network model is preferably set to be 4 layers, so that the detection efficiency is greatly improved on the premise of ensuring that the accuracy of the model is not lower than 93%.

Preferably, in the convolutional neural network model, the loss function adopts improved cross entropy based on two classes, and the formula is as follows:

in the formula (VII), the reaction mixture is,

the predicted value is represented, y represents the true value, l represents the loss function, and η represents the accuracy of the model. When the difference between the predicted value and the true value is calculated, the predicted value is multiplied by a coefficient eta, then the error is solved, and the size of the eta is selected according to the actual sceneSo that the loss function converges faster.

Preferably, in the convolutional neural network model, the convolutional layers are 3 layers, and the fully-connected layers are 2 layers;

first convolutional layer: the number of convolution kernels is 256, and the convolution kernel size is 3 x 3;

adding MaxPooling1D (padding ═ same');

a second convolutional layer: the number of convolution kernels is 64, and the convolution kernel size is 3 x 3;

adding MaxPooling1D (padding ═ same');

a third convolutional layer: the number of convolution kernels 32, the convolution kernel size 3 × 3;

adding MaxPooling1D (padding ═ same');

adding Flatten ();

dropout (0.3) is added;

first fully-connected layer: the number of neurons is 32;

second fully-connected layer: the number of neurons is 2. Along with the increase of the complexity of the model structure, the accuracy rate can be improved to a certain extent, but sometimes the accuracy rate is also reduced, the parameters to be calculated can rise exponentially, and the consumed time is longer.

Preferably, the deep learning-based service access data detection method further includes: and (3) optimizing the model, specifically comprising: in the training process, continuously adjusting each hyper-parameter of the model, and finally determining the hyper-parameter as follows:

the data size batch _ size of each batch of fed models is 128;

the sliding window size kernel _ size of the convolutional neural network model is 3;

the neuron drop rate dropout is 0.3;

loss＝"binary_crossentropy"；

the gradient descent optimization algorithm selects optimizer as "adam". The final hyper-parameter is determined through experiments, so that the accuracy rate and the recall rate of the model can achieve the best result.

Preferably, the deep learning-based service access data detection method includes the following steps:

step S1, cleaning and preprocessing original request data;

step S2, vectorizing the request header and other parts respectively;

step S3, inputting the vector matrix of the request head into the full-connection network model, judging whether the request head is white data, otherwise, judging abnormal data; if yes, go to step S4;

step S4, inputting the vector matrixes of other parts into the convolutional neural network model, judging whether the other parts are white data, if so, judging that the original request data are the white data; otherwise, judging the original request data as abnormal data.

The invention at least comprises the following beneficial effects: the service access data detection method based on deep learning emphasizes the characteristics of learning white data, and can more accurately identify the white data in the service;

in the invention, except for extracting the features of the conventional url, the request _ body, url, method and the refer and user-agent in the request head are subjected to vectorization conversion and input into a corresponding model for feature extraction;

the business access data detection method based on deep learning of the invention firstly detects the request head of the request data of business access, if the request head is normal, other parts of the request data, namely the request line and the request body, are detected, if the judgment result of the stage is normal, the data is white data, and if the judgment result of the stage is abnormal, the data is abnormal data; if the request header is abnormal, the request header is directly judged to be abnormal data. Through multiple tests, the cross validation accuracy of the method can reach about 99%, the loss value is about 0.01, the recall rate can reach about 97%, the accuracy can reach about 93% in the actual environment test process, and the recall rate can reach about 93%.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.

Drawings

Fig. 1 is a schematic flow chart of the deep learning-based service access data detection method of the present invention.

Detailed Description

The present invention is further described in detail below with reference to the attached drawings so that those skilled in the art can implement the invention by referring to the description text.

It will be understood that terms such as "having," "including," and "comprising," as used herein, do not preclude the presence or addition of one or more other elements or groups thereof.

As shown in fig. 1, the present invention provides a deep learning-based service access data detection method, which includes the following steps:

step S1, cleaning and preprocessing original request data;

step S2, vectorizing the request header and other parts respectively;

step S4, inputting the vector matrixes of other parts into the convolutional neural network model, judging whether the other parts are white data, if so, judging that the original request data are the white data; otherwise, judging the original request data as abnormal data. In step S1, in order to reduce the noise of the data set and prepare for vectorization, the collected original request data is subjected to the following operations:

s1-1, removing the weight by a conventional method;

s1-2, removing the weight of the similarity;

s1-3, replacing all 'nan' in the data with a number 0;

s1-4, decoding;

s1-5, deleting the 'n' and 'r' in the data;

s1-6, replacing all numbers in the data with 0;

s1-7, replacing all Chinese in the data with 'Chinese';

s1-8, performing word segmentation by using a jieba word segmentation tool;

and S1-9, splicing the processed fields.

In step S2, the vectorization processing specifically includes: vector extraction is carried out on the 'refer' and the 'user-agent' in the request header by using a bert word vector model, the dimensionality of a word is defined as 768 dimensions, and the text is converted into vectors of 528 multiplied by 768;

vectorization conversion is performed on "request _ body", "url", and "method" using a word2vec word vector model, the dimension of a word is defined as 128 dimensions, and the maximum length of each piece of data is defined as 1000. And different word vector models are respectively adopted for the request head and other parts of the request data to carry out vector extraction, so that the follow-up model analysis is more accurate.

In step S3, the number of network layers in the fully-connected network model is 4, wherein,

a first layer network: number of neurons 128, activation function "relu";

layer two: the number of neurons 64, the activation function activation ═ relu "; drop is added to 0.2;

layer four: neuron number 2, activation function "sigmoid";

the output of the current network layer is the input of the next network layer, and the calculation formula of the current network layer is as follows:

wherein each parameter is updated by gradient descent, each parameter is updated according to the formula (II) and the formula (III),

in the formulae (II) and (III), α is the learning rate, bi is the bias parameter of the ith neuron, bl is the bias parameter of the l-th network layer, and w is the bias parameter of the l-th network layer_iIs a weight matrix of the ith neuron, and Wl is a weight matrix of the l-th layer network layer, wherein i is a positive integer from 1 to n, and l is more than or equal to 1 and less than or equal to 4. When the number of network layers is 4, the output effect of the full-connection network model is optimal. Based on the service access type and according to the vector extraction of the request data characteristics, the number of the neurons of each layer of the network layer is adopted, so that the accuracy of the full-connection network model is greatly improved.

In step S4, the word2vec word vector model is embedded in the fully-connected network model, and the vector matrix of the other part of the request data is input for training. The method specifically comprises the following steps:

adding MaxPooling1D (padding ═ same');

adding Flatten ();

dropout (0.3) is added;

first fully-connected layer: the number of neurons is 32;

second fully-connected layer: the number of neurons is 2.

The input of the convolution layer is a vector matrix of each sentence, each sentence is provided with n words, each word is represented by a word vector with k dimensions, the dimension of the input matrix is n x k, the width is k, and k represents the dimension of the word vector and the width of a convolution kernel; after performing convolution operation on a convolution kernel W with a height h and h words, obtaining a feature vector α i by activating a function, where a bias parameter is represented by bj, the convolution operation may be represented as:

α_i＝f(W·X_i～i+h-1+b_j)

in the formula, alpha_iRepresenting a feature vector obtained by the ith convolution operation; x represents each word in the sentence; f represents an activation function; h represents the height of the convolution kernel; w represents a weight matrix of the convolution kernel; b_jA bias parameter representing a jth convolution kernel;

after multiple convolution, the vector alpha is obtained as alpha₁,α₂,...,α_n-h+1]And inputting the data into a pooling layer to perform maximum pooling operation:

t ═ max { α }, t denotes the feature vector, and n denotes the number of words in the sentence.

Obtaining the final characteristic vector T ═ T through three-layer convolution pooling₁,t₂,...,t_f]Wherein f is the number of convolution kernels.

Finally, the weight matrix of the full connection layer is

Obtaining a predicted result

The loss function adopts a binary cross entropy based on improvement:

in the formula (I), the compound is shown in the specification,

representing the predicted value, y the true value, l the loss function, η the accuracy of the model, b_mRepresenting the bias parameters of the fully connected layer.

The invention also carries out the optimization and the test of the two models; wherein the model tuning comprises: in the training process, continuously adjusting each hyper-parameter of the model, and finally determining the hyper-parameter as follows:

the data size batch _ size of each batch of fed models is 128;

CNN sliding window size kernel _ size 3;

the neuron drop rate dropout is 0.3;

loss＝"binary_crossentropy"；

the gradient descent optimization algorithm selects optimizer as "adam".

The invention discloses a deep learning-based service access data detection method, which is used for detecting service access request data by utilizing two models. Through multiple tests, the accuracy rate of the cross validation of the white model can reach about 99%, the loss value is about 0.01, the recall rate can reach about 97%, and in the actual environment test process, the accuracy rate of the model can reach about 93% and the recall rate can reach about 93%.

While embodiments of the invention have been described above, it is not limited to the applications set forth in the description and the embodiments, which are fully applicable in various fields of endeavor to which the invention pertains, and further modifications may readily be made by those skilled in the art, it being understood that the invention is not limited to the details shown and described herein without departing from the general concept defined by the appended claims and their equivalents.

Claims

1. The method for detecting the service access data based on deep learning is characterized by comprising the following steps: vectorizing original request data respectively aiming at a request head and other parts, then inputting a vector matrix of the request head into a full-connection network model for training, and judging whether the request head is white data or not; the output of the current network layer in the fully-connected network model is the input of the next network layer, and the calculation formula of the current network layer is as follows:

in the formulae (II) and (III), α is the learning rate, b_iIs the bias parameter for the ith neuron,

bias parameter of layer I network layer for the ith neuron, w_iIs the weight matrix for the ith neuron,

a weight matrix of a layer i network layer, which is an ith neuron, f denotes an activation function,

to represent the derivation; wherein i is a positive integer of 1-n, and l is more than or equal to 1 and less than or equal to 4;

further comprising: when the request head is white data, inputting the vector matrix of other parts of the original request data into a convolutional neural network model for training; the convolutional neural network model consists of a convolutional layer, a pooling layer and a full-connection layer; the convolution operation formula of the convolution layer is as follows:

α_i＝f(W·X_i～i+h-1+b_j) (Ⅳ)

in the formula (IV), alpha_iRepresenting a feature vector obtained by the ith convolution operation; f represents an activation function; h represents the height of the convolution kernel; w represents a weight matrix of the convolution kernel; x represents a window at the time of the convolution operation; b_jA bias parameter representing a jth convolution kernel;

in the formula (VI), the compound represented by the formula (VI),

the predicted value is represented by a value of the prediction,

representing a weight matrix of the fully-connected layer, T representing a final eigenvector, and b representing a bias parameter of the fully-connected layer;

further comprising: inputting a vector matrix of a request head into a full-connection network model, judging whether the request head is white data or not, and if not, judging that the request head is abnormal data;

if yes, inputting the vector matrixes of other parts into the convolutional neural network model, judging whether the other parts are white data, and if yes, judging that the original request data are the white data; if not, judging that the original request data are abnormal data; the request header comprises a user-agent and a refer in the request data, and the other parts are url, request _ body and method in the request data.

2. The deep learning-based service access data detection method of claim 1, further comprising: before vectorizing the original request data respectively aiming at the request header and other parts, cleaning and preprocessing the original request data, which specifically comprises the following steps: conventional deduplication, similarity deduplication, total replacement of nan in the data by a number 0, decoding, total deletion of \ n and \ r in the data, total replacement of the number in the data by 0, total replacement of Chinese in the data by chinese, word segmentation by using a jieba word segmentation tool, and finally splicing the processed fields.

3. The deep learning-based service access data detection method of claim 1, wherein the vectorization process comprises:

vector extraction is carried out on refer and user-agent in the request header by using a bert word vector model, wherein the dimension of a word is defined as 768 dimensions, and the text is converted into vectors of 528 times 768 dimensions;

vectorization conversion is performed on request _ body, url and method using word2vec word vectors, wherein the dimension of a word is defined as 128 dimensions, and the maximum length of each piece of data is defined as 1000.

4. The deep learning-based service access data detection method according to claim 1, wherein the number of network layers in the fully-connected network model is 4, wherein,

a first layer network: number of neurons 128, activation function "relu";

drop is added to 0.2;

layer four: the neuron number 2 and the activation function activation are "sigmoid".

5. The deep learning-based service access data detection method according to claim 1, wherein in the convolutional neural network model, the loss function adopts improved cross entropy based on two classes, and the formula is as follows:

in the formula (VII), the reaction mixture is,

the predicted value is represented, y represents the true value, l represents the loss function, and η represents the accuracy of the model.

6. The deep learning-based service access data detection method of claim 5, wherein the convolutional layer is 3 layers, and the fully-connected layer is 2 layers; wherein the content of the first and second substances,

adding MaxPooling1D (padding ═ same');

adding Flatten ();

dropout (0.3) is added;

first fully-connected layer: the number of neurons is 32;

second fully-connected layer: the number of neurons is 2.

7. The deep learning-based service access data detection method of claim 2, further comprising: and (3) optimizing the model, specifically comprising: in the training process, continuously adjusting each hyper-parameter of the model, and finally determining the hyper-parameter as follows:

the data size batch _ size of each batch of fed models is 128;

the neuron drop rate dropout is 0.3;

loss＝"binary_crossentropy"；

the gradient descent optimization algorithm selects optimizer as "adam".

8. The deep learning-based service access data detection method according to claim 1, comprising the steps of:

step S1, cleaning and preprocessing original request data;

step S2, vectorizing the request header and other parts respectively;

step S3, inputting the vector matrix of the request head into the full-connection network model, judging whether the request head is white data, otherwise, judging as abnormal data; if yes, go to step S4;