CN116366312A

CN116366312A - Web attack detection method, device and storage medium

Info

Publication number: CN116366312A
Application number: CN202310250133.3A
Authority: CN
Inventors: 韩志松; 许金旺; 洪钧煌; 方宏昊; 王鑫渊
Original assignee: Tianyi Cloud Technology Co Ltd
Current assignee: Tianyi Cloud Technology Co Ltd
Priority date: 2023-03-13
Filing date: 2023-03-13
Publication date: 2023-06-30

Abstract

The application discloses a method, a device and a storage medium for detecting Web attack, which are used for detecting Web attack. The Web attack detection method disclosed by the application comprises the following steps: training and detecting, namely determining training data, training a preset model by using the training data to obtain optimal parameters of the preset model, and applying the optimal parameters to the preset model to obtain a trained model. After training is completed, the trained model is used for detection. The application also provides a Web attack detection device and a storage medium.

Description

Web attack detection method, device and storage medium

Technical Field

The present disclosure relates to the field of computing technologies, and in particular, to a method and apparatus for detecting a Web attack, and a storage medium.

Background

Platforms on the internet are not constantly being subjected to various types of Web page (Web) attacks, such as SQL (Structured Query Language ) injection attacks, cross-site scripting (Cross Site Scripting, XSS) attacks, etc. The attack detection method in the prior art uses specific matching rules to classify and identify the attacks, however, the current Web attack modes are more diversified, the rule matching mode in the prior art can only detect the existing attack modes in the rule base, and the method is incapable of unknown vulnerabilities or unknown types of attacks. Improving the detection capability of unknown attacks is a technical problem to be solved urgently.

Disclosure of Invention

Aiming at the technical problems, the embodiment of the application provides a method, a device and a storage medium for detecting Web attacks, which are used for improving the detection capability of unknown attacks, protecting the safety of websites and guaranteeing the safety of assets.

In a first aspect, a method for detecting a Web attack provided in an embodiment of the present application includes:

determining training data;

training a preset model by using the training data to obtain optimal parameters of the preset model, and applying the optimal parameters to the preset model to obtain a trained model;

collecting a Web access request;

determining detection data according to the Web access request;

and detecting whether the attack behavior exists in the detection data through the trained model.

In the invention, the Web attack detection can be divided into two steps of training and detection, wherein the training step comprises the following steps:

determining training data, training a preset model by using the training data to obtain optimal parameters of the preset model, and applying the optimal parameters to the preset model to obtain a trained model. After training is completed, the trained model is used for detection, which can include: and acquiring a Web access request, determining detection data according to the Web access request, and detecting whether an attack exists in the detection data through a trained model.

Preferably, the determining training data includes:

collecting Web attack data;

preprocessing the attack data to obtain second attack data;

extracting word frequency document frequency TF-IDF characteristics from the second attack data;

performing dimension reduction treatment on the TF-IDF characteristics;

adding manual features on the TF-IDF to obtain training data;

preferably, the determining the detection data according to the Web access request includes:

preprocessing the Web access request to obtain second request data;

extracting word frequency document frequency TF-IDF characteristics from the second request data;

performing dimension reduction treatment on the TF-IDF characteristics;

and adding manual characteristics to the TF-IDF to obtain the detection data.

Preferably, in determining the training data or determining the detection data, the preprocessing comprises one or a combination of the following:

decoding the data;

filling NULL values into NULL strings;

discarding the same data;

and (5) manually screening abnormal data.

Preferably, the collecting Web attack data includes:

collecting logs generated by the edge nodes, and collecting attack data according to the logs;

the attack data comprises at least one of the following:

Structured query language SQL injection attack, cross site scripting attack XSS, command injection attack, directory traversal attack, and scanner attack.

Preferably, the extracted word frequency document frequency TF-IDF feature includes:

the TF-IDF characteristic consists of the product of a word frequency TF characteristic and a reverse document frequency IDF characteristic, and is expressed by the following formula:

TFIDF(t,d,D)＝TF(t,d)·IDF(t,D)；

wherein, the liquid crystal display device comprises a liquid crystal display device,

TF (t, D) denotes TF features, and IDF (t, D) denotes IDF features;

f _t,d representing the number of words t in the document D, N representing the total number of documents in the database, |{ D epsilon D: t epsilon D } | representing the number of documents in which the words t appear, and D representing the database;

t ^′ express words, f _t′d Representing the word t in document d ^′ Is a number of (3).

Preferably, the performing the dimension reduction processing on the TF-IDF feature includes:

performing dimension reduction on the TF-IDF characteristic through truncated singular value decomposition;

the singular value decomposition is represented by the following formula:

M＝UΣV ^u ；

m is an m×n matrix, U is an m×m-order unitary matrix, V is an n×n-order unitary matrix, V ^T Is the transpose of V, Σ is a rectangular diagonal matrix of mxn;

m represents the number of rows of the matrix and n represents the number of columns of the matrix.

Preferably, the manual feature comprises one or a combination of the following:

a request mode;

uniform resource locator URL length;

The number of request parameters;

requesting an average value of the parameter lengths;

requesting a variance of the parameter length;

the client identifies the type of the first word of UA;

refer, URL, and Body.

Preferably, the preset model includes:

the preset model is a Catboost model;

loss function L of the Catboost model _MultiClass The method comprises the following steps:

where N is the number of samples, M is the number of categories, a _i Is the prediction result of the model on the ith sample, t _i Is the label of the i-th sample;

i denotes the number of samples, j denotes the number of category numbers.

Preferably, the training the preset model using the training data to obtain the optimal parameters of the preset model includes:

selecting the optimal F under a preset verification rule through a preset parameter searching range and an evaluation index ₁ -the parameter corresponding to the macro value is the optimal parameter;

the F is ₁ -macro value is represented by the following formula:

wherein F is ₁ Score represents the class F of class i of the model ₁ Precision is the accuracy and recall is the recall.

In a second aspect, an embodiment of the present application further provides a Web attack detection apparatus, including:

a first data module configured to determine training data;

The training module is configured to train a preset model by using the training data to obtain optimal parameters of the preset model, and apply the optimal parameters to the preset model to obtain a trained model;

the second data module is configured to collect a Web access request and determine detection data according to the Web access request;

and the detection module is configured to detect whether the attack behavior exists in the detection data through the trained model.

In a third aspect, an embodiment of the present application further provides a Web attack detection apparatus, including: a memory, a processor, and a user interface;

the memory is used for storing a computer program;

the user interface is used for realizing interaction with a user;

the processor is used for reading the computer program in the memory, and when the processor executes the computer program, the Web attack detection method provided by the invention is realized.

In a fourth aspect, an embodiment of the present application further provides a processor readable storage medium, where a computer program is stored in the processor readable storage medium, and when the processor executes the computer program, the Web attack detection method provided by the present invention is implemented.

According to the Web attack detection method, training data is obtained through preprocessing on the basis of existing log data, a preset model is trained to obtain optimal parameters, and then whether a Web access request comprises an attack is detected according to the model using the optimal parameters. In the present invention, the preset model may be an artificial intelligence model. By the method, the attack types which do not exist in the manually formulated rule base can be identified, the attack types are classified, and the detection and identification capacity of unknown threats is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a Web attack detection method provided in an embodiment of the present application;

fig. 2 is a schematic diagram of a training data determination flow provided in an embodiment of the present application;

fig. 3 is a schematic diagram of a detection data determining process provided in an embodiment of the present application;

Fig. 4 is a schematic structural diagram of a Web attack detection device provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of another Web attack detection apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Some words appearing hereinafter are explained:

1. in the embodiment of the invention, the term "and/or" describes the association relation of the association objects, which means that three relations can exist, for example, a and/or B can be expressed as follows: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

2. The term "plurality" in the embodiments of the present application means two or more, and other adjectives are similar thereto.

3. TF-IDF, term Frequency-Inverse Document Frequency, namely word Frequency reverse document Frequency.

4. Truncated singular value decomposition by Truncated SVD, truncated Singular Value Decomposition.

5. XSS, cross Site Scripting, cross-site scripting.

6. SQL, structured Query Language, structured query language.

7. URL, uniform Resource Locator, uniform resource location.

8. UA, user-Agent, client identity.

The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

It should be noted that, the display sequence of the embodiments of the present application only represents the sequence of the embodiments, and does not represent the advantages or disadvantages of the technical solutions provided by the embodiments.

Referring to fig. 1, a schematic diagram of a Web attack detection method provided in an embodiment of the present application, as shown in fig. 1, includes steps S101 to S105:

s101, determining training data;

s102, training a preset model by using the training data to obtain optimal parameters of the preset model, and applying the optimal parameters to the preset model to obtain a trained model;

S103, collecting a Web access request;

s104, determining detection data according to the Web access request;

s105, detecting whether the attack behavior exists in the detection data through the trained model.

In the present invention, the Web attack detection method may be divided into two steps of training and detection, wherein the training step includes S101 and S102, and the detection step includes S103, S104 and S105.

In the present invention, in S101, it is determined that training data may be obtained by collecting an attack sample and a normal request sample through a log generated by an edge node server, and then performing data preprocessing operations such as data cleaning and missing value filling on the obtained data, specifically, as shown in fig. 2, including S201 to S205:

s201, collecting Web attack data;

the method comprises the steps that an attack sample and a normal request sample are collected through a log generated by an edge node server, for example, the log generated by the edge node is collected into a Kafka cluster through filebean, then a message in the Kafka cluster is consumed through LogStash, and the processed message is input into an elastic search. According to the types of the logs, 5 types of attacks including SQL injection attack, XSS attack, command injection attack, directory traversal and scanner and normal access logs can be acquired. The log contains 5 field information of http request mode, client identifier (UA), URL, reference and request Body (Body).

Wherein filecoat is a lightweight log data collector; kafka is a distributed stream processing platform and message queue system that provides the functions of message storage and message transmission; logstack is a data collection processing engine that can read and process data from Kafka before sending to an elastic search; the elastic search is an extensible, high availability search and analysis platform that can be used to search, analyze and store various types of data.

S202, preprocessing the attack data to obtain second attack data;

as a preferred example, the preprocessing includes:

the pretreatment includes one or a combination of the following:

decoding the data;

filling NULL values into NULL strings;

discarding the same data;

and (5) manually screening abnormal data.

For example, in the collected log, fields such as URL are URL-encoded, and these encoded character strings need to be decoded. Furthermore, not all requests contain all fields. If the request reference is empty, the GET request does not include a requestor, etc. For this NULL case, padding is required, for example, to pad the NULL as a "NULL" string. Finally, the same log records are discarded, and the abnormal logs are manually screened.

S203, extracting word frequency reverse document frequency TF-IDF characteristics from the second attack data;

as a preferred example, the extracted term frequency document frequency TF-IDF features include:

TFIDF (t, D) =tf (t, D) ·idf (t, D); equation 1

TF (t, D) denotes TF features, and IDF (t, D) denotes IDF features;

TF-IDF can reflect the importance of a word to an article, and its value is composed of the product of Term Frequency (TF) and inverse document Frequency (Inverse Document Frequency, IDF), which is expressed as formula 1 above. The word frequency TF is represented by the above formula 2. The inverse document frequency IDF is used to measure the amount of information that can be provided by a word, and is calculated as shown in equation 3 above.

According to the invention, TF-IDF characteristics are extracted for URL, UA, referer and Body in the same extraction mode. As a preferred example, the text of URL, UA, referer and Body is feature extracted using TF-IDF, where an analyzer is set to char and the n-gram has a value range of [1,3].

The URL is described below.

For http:// www.test.com/? For the URL a= 1&b =2, the term consists of n-grams of characters (n-grams are a technique in natural language processing that segments text data into contiguous sub-sequences of length n and processes these sub-sequences as separate elements), the n-grams range from [1,3]. Specifically, the URL is split with sliding window lengths of 1, 2 and 3, and the result is h, t, t, p, …, b, =, 2, ht, tt, tp, p:, …, b=, =2, htt, ttp, tp:, …, & b=, b=2. The analyzer (analyzer) is set to char, and when n-gram is greater than 1, space filling is not performed at the edge position. Each element after segmentation is word t of TF-IDF, the whole URL is document D, and all URLs are database D.

S204, performing dimension reduction processing on the TF-IDF characteristics;

the feature dimension extracted through the TF-IDF can reach tens of thousands to hundreds of thousands, and the feature dimension of the magnitude is inconvenient for subsequent model training, so that the feature needs to be subjected to dimension reduction firstly so as to facilitate the subsequent model training. For this purpose, the invention uses Truncated singular value decomposition (Truncated Singular Value Decomposition, truncated SVD) to perform the dimension reduction processing on the features.

the singular value decomposition is represented by the following formula:

M＝UΣV ^T ；

The method provided by the invention adopts truncated singular value decomposition for feature dimension reduction. For an M n matrix M, its SVD can be represented by equation 4:

M＝UΣV ^* equation 4

Wherein U is unitary matrix of m×m order, Σ is rectangular diagonal matrix of m×n, the elements on the diagonal are non-negative real numbers, V ^* Is the conjugate transpose of V, and V is the unitary matrix of order n x n. For the present invention, it can be ensured that the feature matrix M obtained by TF-IDF is a real matrix, and thus, U and V are orthogonal matrices, so in this case, SVD can be represented by equation 5:

M＝UΣV ^T equation 5

The elements on the sigma diagonal are singular values of matrix M, sigma can be used ₁ ,σ ₂ ,…,σ _p Representation, and sigma ₁ ≥σ ₂ ≥…≥σ _p P=min (m, n). The column vectors of U are called left singular vectors, V ^T Is a right singular vector. The top-ranked singular value occupies a large portion of the sum of all singular values, so the matrix M can be approximated by the largest k singular values with the corresponding left and right singular vectors. According to the invention, k= 16,16,8,32 is respectively taken for the TF-IDF characteristics of URL, referer, UA and Body, namely, the characteristic dimension of the TF-IDF is reduced to the order of magnitude of tens or even units from hundreds of thousands of dimensions, so that the subsequent model training is convenient while most of information is maintained.

S205, adding manual characteristics on the TF-IDF to obtain training data.

As a preferred example, the manual feature includes one or a combination of the following:

a request mode;

uniform resource locator URL length;

the number of request parameters;

requesting an average value of the parameter lengths;

requesting a variance of the parameter length;

the client identifies the type of the first word of UA;

refer, URL, and Body.

In addition to TF-IDF features, the present invention adds several manually extracted features, either one or a combination of the following: the request mode, the length of the URL, the number of request parameters, the mean and variance of the request parameter lengths, the type of UA first word (e.g., mozilla/5.0), refer and URL, and the proportion of special characters other than alphanumerics in Body.

As a preferred example, in S204, k= 16,16,8,32 is taken for TF-IDF characteristics of URL, referer, UA and Body, respectively. In the step, the request mode, the length of the URL, the number of request parameters, the mean value and variance of the lengths of the request parameters, the type of the first word of UA, refer and the URL, and the proportion of special characters except alphanumerics in Body are extracted, and 9 manual characteristics are taken in total; combining TF-IDF features with manual features, the resulting 16+16+8+32+9=81 dimensional features are used for model training.

The manual features are combined with the TF-IDF features and input models are trained as final feature representations.

As a preferred example, in S102, training a preset model using the training data to obtain an optimal parameter of the preset model, and applying the optimal parameter to the preset model to obtain a trained model. After the training data is prepared, the training data is input into a preset model for training, and the optimal parameters are obtained. The predetermined model may be an Artificial Intelligence (AI) model, for example, a CatBoost model.

Firstly determining training parameters, then determining all combinations of the training parameters, and then training each combination to obtain the score of the combination parameters. After all the combination parameters of the training parameters are trained, the parameter combination with the highest score is the optimal parameter. For example, the training model is Catboost, and the training parameters are: the number of trees, the learning rate, the maximum depth, the number of numerical feature divisions, the number of category feature divisions, an early stop (early stop) parameter that prevents overfitting, a total of 6 training parameters. Wherein, the values of each parameter are as follows: the number of the trees is 300, 500, 700, 1000, 1200, 1500, and the total number of the trees is 6; the learning rate is 0.01,0.03,0.09,0.3,0.9, and 5 values are taken; the maximum depth is 3,6 and 9, and the total depth is 3 values; the number of the numerical characteristic divisions is 16, 32 and 64, and the total number of the numerical characteristic divisions is 3; the classification characteristic segmentation number is 16, 32 and 64, and the total number is 3; an early stop (early stop) parameter to prevent overfitting is set to 100 for a total of 1 values. The number of parameter combinations for these 6 parameters is then: 6.6x6x5x3x3x3x1 =4860. Setting the measurement mode of verification as F ₁ -macro, training the 4860 combinations separately, obtaining 4860F ₁ -macro score, 4860F ₁ The highest scoring combination of macro is used as the optimal parameter.

The training method for each parameter combination may be a 10-fold cross-validation method, i.e. training data is divided into 10 equal parts, wherein 9 samples are used for training, called training data part, and 1 sample is used for validation, called validation data part. The training verification sample is alternately trained and repeated 10 times, and the final verification result is the average value of 10 times. The training process is as follows:

the first round of training, the 1 st data is used for verifying the data part, the 2 nd to 10 th data are used for training the data part, the training data part is input into the Catboost model for training, and the verification data part is used for verification, so that a first group F is obtained ₁ -macro score.

A second round of training, wherein the 2 nd data is used for the verification data part, the 1 st part, the 3 rd to the 10 th parts are used for the training data part, the training data part is input into the Catboost model for training, and the verification data part is used for verification, so that a second group F is obtained ₁ -macro score.

Third training, wherein the 3 rd data is used for verifying the data part, the 1 st to 2 nd and the 4 th to 10 th data are used for training the data part, the training data part is input into the Catboost model for training, and the verification data part is used for verification to obtain a third group F ₁ -macro score.

Fourth training, wherein the 4 th data is used for verifying the data part, the 1 st to 3 rd and the 5 th to 10 th data are used for training the data part, the training data part is input into the Catboost model for training, and the verification data part is used for verification, so as to obtain a fourth group F ₁ -macro score.

Fifth training, wherein the 5 th data is used for verifying the data part, the 1 st to 4 th data is used for training the data part, the 6 th to 10 th data is used for training the data part, the training data part is input into the Catboost model for training, and the verification data part is used for verification to obtain a fifth group F ₁ -macro score.

A sixth training, wherein the 6 th data is used for verifying the data part, the 1 st to 5 th data is used for training the data part, the 7 th to 10 th data is used for training the data part, the training data part is input into the Catboost model for training, and the verification data part is used for verification to obtainTo the sixth group F ₁ -macro score.

Seventh training, using 7 th data for verification data part, 1 st to 6 th and 8 th to 10 th for training data part, inputting training data part into Catboost model for training, and using verification data part for verification to obtain seventh group F ₁ -macro score.

Eighth training, using the 8 th data part for the verification data part, 1 st to 7 th data part, and 9 th to 10 th data part for the training data part, inputting the training data part into the Catboost model for training, and using the verification data part for verification to obtain an eighth group F ₁ -macro score.

A ninth training is performed, wherein the 9 th data is used for the verification data part, the 1 st to 8 th data are used for the training data part, the training data part is input into the Catboost model for training, and the verification data part is used for verification, so that a ninth group F is obtained ₁ -macro score.

A tenth training is performed, wherein the 10 th data is used for verifying the data part, the 1 st to 9 th data are used for training the data part, the training data part is input into the Catboost model for training, and the verification data part is used for verification, so that a tenth group F is obtained ₁ -macro score.

Then the first group F ₁ -macro score, second group F ₁ -macro score, third group F ₁ -macro score, fourth group F ₁ -macro score, fifth group F ₁ -macro score, sixth group F ₁ -macro score, seventh group F ₁ -macro score, eighth group F ₁ -macro score, ninth group F ₁ -macro score and tenth group F ₁ Dividing the sum of the macro scores by 10 to obtain F of the combined parameters ₁ -macro score.

As a preferable example, the preset model is a Catboost model;

i denotes the number of samples, j denotes the number of category numbers.

the F is ₁ -macro value is represented by the following formula:

For example, the Catboost is optimized for parameter training by using a grid parameter search and 10-fold cross validation mode, and the number of the set trees is 300, 500, 700, 1000, 1200 and 1500; the learning rate is 0.01,0.03,0.09,0.3,0.9; the maximum depth is 3,6 and 9; the number of numerical feature divisions is 16, 32, 64; the classification feature division number is 16, 32, 64, the early stop (early stop) parameter for preventing overfitting is set to 100, and the verification measurement mode is set to F ₁ -macro。

In the invention S103, a Web access request is acquired, similar to the method of collecting Web attack data in S201, for example, a log generated by an edge node is acquired to a Kafka cluster through filecoat, and then a message in the Kafka cluster is consumed through LogStash, and is processed and then is input into an elastic search. According to the types of the logs, 5 types of attacks including SQL injection attack, XSS attack, command injection attack, directory traversal and scanner and normal access logs can be acquired. The log contains 5 field information of http request mode, client identifier (UA), URL, reference and request Body (Body).

The determining detection data according to the Web access request in S104 of the present invention may include, as shown in fig. 3:

s301, preprocessing the Web access request to obtain second request data;

in this step, the method of preprocessing is the same as S202. I.e. can include:

the pretreatment includes one or a combination of the following:

decoding the data;

filling NULL values into NULL strings;

the same data is discarded.

S302, extracting word frequency reverse document frequency TF-IDF characteristics from the second request data;

in this step, the method for extracting TF-IDF features, similar to S203, includes:

TFIDF (t, D) =tf (t, D) ·idf (t, D); equation 1

TF (t, D) denotes TF features, and IDF (t, D) denotes IDF features;

S303, performing dimension reduction processing on the TF-IDF characteristics;

in this step, the method of dimension reduction processing is the same as S204, and includes:

the singular value decomposition is represented by the following formula:

M＝UΣV ^T ；

M＝UΣV ^* equation 4

Wherein U is unitary matrix of m×m order, Σ is rectangular diagonal matrix of m×n, the elements on the diagonal are non-negative real numbers, V ^* Is the conjugate transpose of V, and V is the unitary matrix of order n x n. For the present invention, it is ensured that the feature matrix M obtained from TF-IDF is realThe matrix, therefore, U and V are orthogonal matrices, so in this case, SVD can be represented by equation 5:

M＝UΣV ^T equation 5

S304, adding manual features on the TF-IDF to obtain the detection data.

a request mode;

uniform resource locator URL length;

the number of request parameters;

requesting an average value of the parameter lengths;

requesting a variance of the parameter length;

the client identifies the type of the first word of UA;

refer, URL, and Body.

The manual feature is combined with the TF-IDF feature as a final feature representation to obtain the test data.

In the invention S105, whether an attack exists in the detected data is detected by the trained model.

Through the method of the embodiment, the Web attack detection can be divided into two steps of training and detection, wherein the training step comprises:

In order to identify unknown attack threats and classify the unknown attack threats, the invention uses an artificial intelligence mode, and the characteristics of a request message are extracted, whether the request has an abnormality or not is identified through a classifier, and if the abnormality exists, the abnormal class can be output. The invention can identify the attack types which do not exist in the manually formulated rule base, classifies the attack types, and has good detection and identification capability for unknown threats (such as 0day loopholes).

Different from the traditional rule matching model, the invention uses an artificial intelligence mode to detect the request message and does not depend on specific hard rule matching, thereby being capable of finding abnormal Web request messages and classifying the abnormal Web request messages.

The method of the invention extracts the TF-IDF characteristics of URL, referer, UA and BODY, and extracts partial manual characteristics, wherein the manual characteristics can be determined according to business experience, and can be combined with the TF-IDF characteristics to obtain better detection effect.

The invention uses the measured SVD for feature dimension reduction, and reduces a great amount of training complexity while retaining most TF-IDF features. The Catboost model can well process category characteristics, so that the accuracy of the model is improved, and in addition, the optimal parameters of the model are determined by using a 10-fold cross validation and grid parameter searching mode.

Based on the same inventive concept, the embodiment of the invention also provides a Web attack detection device, as shown in fig. 4, which comprises:

a first data module 401 configured to determine training data;

the training module 402 is configured to train a preset model by using the training data to obtain an optimal parameter of the preset model, and apply the optimal parameter to the preset model to obtain a trained model;

a second data module 403 configured to collect a Web access request, and determine detection data according to the Web access request;

a detection module 404 configured to detect whether an attack exists in the detection data through the trained model.

As a preferred example, the first data module 401 is further configured for determining training data, comprising:

collecting Web attack data;

preprocessing the attack data to obtain second attack data;

extracting word frequency reverse document frequency TF-IDF characteristics from the second attack data;

performing dimension reduction treatment on the TF-IDF characteristics;

and adding manual characteristics on the TF-IDF to obtain training data.

As a preferred example, the second data module 403 is further configured to determine detection data according to the Web access request, including:

Preprocessing the Web access request to obtain second request data;

extracting word frequency reverse document frequency TF-IDF characteristics from the second request data;

performing dimension reduction treatment on the TF-IDF characteristics;

and adding manual characteristics to the TF-IDF to obtain the detection data.

As a preferred example, the first data module 401 is further configured to gather Web attack data, including:

the attack data comprises at least one of the following:

As a preferred example, the first data module 401 is further configured for preprocessing comprising:

the pretreatment includes one or a combination of the following:

decoding the data;

filling NULL values into NULL strings;

discarding the same data;

and (5) manually screening abnormal data.

As a preferred example, the first data module 401 or the second data module 403 is further configured to extract word frequency document frequency TF-IDF features, including:

TFIDF(t,d,D)＝TF(t,d)·IDF(t,D)；

TF (t, D) denotes TF features, and IDF (t, D) denotes IDF features;

As a preferred example, the first data module 401 or the second data module 403 is further configured to perform a dimension reduction process on the TF-IDF feature, including:

the singular value decomposition is represented by the following formula:

M＝UΣV ^T ；

As a preferred example, the preset model includes:

the preset model is a Catboost model;

i denotes the number of samples, j denotes the number of category numbers.

As a preferred example, the training module 402 is further configured to select the optimal F under the preset verification rule through the preset parameter search range and the evaluation index ₁ -the parameter corresponding to the macro value is the optimal parameter;

the F is ₁ -macro value is represented by the following formula:

It should be noted that, the first data module 401 provided in the present embodiment can implement all the functions included in step S101 in the first embodiment, solve the same technical problem, achieve the same technical effect, and are not described herein again;

it should be noted that, the training module 402 provided in the present embodiment can implement all the functions included in the step S102 in the first embodiment, solve the same technical problem, achieve the same technical effect, and are not described herein again;

it should be noted that, the second data module 403 provided in the present embodiment can implement all the functions included in step S103 in the first embodiment, solve the same technical problem, achieve the same technical effect, and are not described herein again;

it should be noted that, the detection module 404 provided in the present embodiment can implement all the functions included in the steps S104 and S105 in the first embodiment, solve the same technical problem, achieve the same technical effect, and are not described herein again;

it should be noted that, the device and the method belong to the same inventive concept, solve the same technical problem, achieve the same technical effect, and the device provided in the second embodiment can implement all the methods in the first embodiment, and the same points are not repeated.

Based on the same inventive concept, the embodiment of the invention also provides a Web attack detection device, as shown in fig. 5, which comprises:

including a memory 502, a processor 501, and a user interface 503;

the memory 502 is used for storing a computer program;

the user interface 503 is configured to interact with a user;

the processor 501 is configured to read a computer program in the memory 502, where the processor 501 implements:

determining training data;

collecting a Web access request;

determining detection data according to the Web access request;

Wherein in fig. 5, a bus architecture may comprise any number of interconnected buses and bridges, and in particular one or more processors represented by processor 501 and various circuits of memory represented by memory 502, linked together. The bus architecture may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., which are well known in the art and, therefore, will not be described further herein. The bus interface provides an interface. The processor 501 is responsible for managing the bus architecture and general processing, and the memory 502 may store data used by the processor 501 in performing operations.

The processor 501 may be CPU, ASIC, FPGA or CPLD, and the processor 501 may also employ a multi-core architecture.

When the processor 501 executes the computer program stored in the memory 502, any Web attack detection method in the first embodiment is implemented.

It should be noted that, the device provided in the third embodiment and the method provided in the first embodiment belong to the same inventive concept, solve the same technical problem, achieve the same technical effect, and the device provided in the third embodiment can implement all the methods in the first embodiment, and the same points are not repeated.

The present application also proposes a processor readable storage medium. The processor-readable storage medium stores a computer program, and when the processor executes the computer program, the processor implements any one of the Web attack detection methods in the first embodiment.

It should be noted that, in the embodiment of the present application, the division of the units is schematic, which is merely a logic function division, and other division manners may be implemented in actual practice. In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. A method for detecting Web attacks on a Web page, comprising:

determining training data;

collecting a Web access request;

determining detection data according to the Web access request;

2. The method of claim 1, wherein the determining training data comprises:

collecting Web attack data;

preprocessing the attack data to obtain second attack data;

performing dimension reduction treatment on the TF-IDF characteristics;

and adding manual characteristics on the TF-IDF to obtain training data.

3. The method according to claim 2, characterized in that:

the determining detection data according to the Web access request comprises:

preprocessing the Web access request to obtain second request data;

performing dimension reduction treatment on the TF-IDF characteristics;

adding manual features on the TF-IDF to obtain the detection data;

the pretreatment includes one or a combination of the following:

decoding the data;

filling NULL values into NULL strings;

discarding the same data;

manually screening abnormal data;

the collecting Web attack data includes:

the attack data comprises at least one of the following:

4. A method according to claim 3, characterized in that:

the extracted word frequency document frequency TF-IDF features comprise:

TFIDF(t,d,D)＝TF(t,d)·IDF(t,D)；

TF (t, D) denotes TF features, and IDF (t, D) denotes IDF features;

t ^′ express words, f _t′d Representing the word t in document d ^′ Is the number of (3);

the dimension reduction processing of the TF-IDF feature comprises the following steps:

the singular value decomposition is represented by the following formula:

M＝UΣV ^T ；

5. The method of claim 4, wherein the manual feature comprises one or a combination of:

A request mode;

uniform resource locator URL length;

the number of request parameters;

requesting an average value of the parameter lengths;

requesting a variance of the parameter length;

the client identifies the type of the first word of UA;

refer, URL, and Body.

6. The method of claim 1, wherein the predetermined model comprises:

the preset model is a Catboost model;

i denotes the number of samples, j denotes the number of category numbers.

7. The method of claim 6, wherein training a pre-set model using the training data to obtain optimal parameters for the pre-set model comprises:

the F is ₁ -macro value is represented by the following formula:

8. A Web attack detection apparatus, comprising:

a first data module configured to determine training data;

9. A Web attack detection device, comprising a memory, a processor and a user interface;

the memory is used for storing a computer program;

the user interface is used for realizing interaction with a user;

the processor is configured to read a computer program in the memory, and when the processor executes the computer program, implement the Web attack detection method according to one of claims 1 to 7.

10. A processor-readable storage medium, characterized in that the processor-readable storage medium stores a computer program, which when executed by the processor implements the Web attack detection method according to one of claims 1 to 7.