CN116796326B

CN116796326B - SQL injection detection method

Info

Publication number: CN116796326B
Application number: CN202311048644.3A
Authority: CN
Inventors: 刘雨蒙; 赵怡婧; 李思登; 苏毅; 王潮; 徐帆江
Original assignee: Beijing Institute of Remote Sensing Equipment
Current assignee: Beijing Institute of Remote Sensing Equipment
Priority date: 2023-08-21
Filing date: 2023-08-21
Publication date: 2023-11-14
Anticipated expiration: 2043-08-21
Also published as: CN116796326A

Abstract

The application discloses an SQL injection detection method, which comprises the following steps: collecting an SQL query data set for training and testing, wherein the SQL query data set comprises query sentences of SQL injection category and query sentences of non-SQL injection category; feature extraction is carried out on the SQL query data set by using chi-square test, and feature vectors relevant to SQL injection attack detection are selected; training the PNN model of the probabilistic neural network, taking the feature vector related to SQL injection attack detection as the input of the PNN model, and outputting the SQL query statement category. According to the application, through deep learning automatic feature extraction and mode learning in a large amount of sample data, the method has strong generalization capability and self-adaption, and can effectively identify unknown and novel SQL injection attacks. And by using chi-square test to extract the characteristics, we can select the characteristics with higher correlation with SQL injection attack detection, and improve the accuracy and performance of the model.

Description

SQL injection detection method

Technical Field

The application relates to the technical field of database system safety, in particular to an SQL injection detection method.

Background

SQL injection is considered one of the most dangerous vulnerabilities, and an attacker can bypass the authentication of an application by injecting malicious SQL code, obtain sensitive information, modify database contents, or perform other malicious operations. Traditional defense methods detect previously known SQL injection attacks using static methods, relying primarily on predefined rules to identify known SQL injection attacks. Specifically, the static method can check whether the SQL query statement contains specific keywords, symbols or grammar structures, the keywords, symbols or grammar structures are usually related to SQL injection attacks, if the modes are matched, the system marks the query as potential SQL injection attacks, but the traditional mode needs to manually adjust rules to adapt to different attack modes, is difficult to process complex attack modes, and faces the limitation that unknown or novel attacks cannot be captured.

Disclosure of Invention

The application provides an SQL injection detection method aiming at the defects in the prior art.

The embodiment of the application provides an SQL injection detection method, which comprises the following steps:

collecting an SQL query data set for training and testing, wherein the SQL query data set comprises query sentences of SQL injection categories and query sentences of non-SQL injection categories;

feature extraction is carried out on the SQL query data set by using chi-square test, and feature vectors relevant to SQL injection attack detection are selected;

training the PNN model of the probabilistic neural network, taking the feature vector related to SQL injection attack detection as the input of the PNN model, and outputting the SQL query statement category.

In some embodiments, feature extraction by using chi-square test, selecting feature vectors related to SQL injection attack detection, comprises:

the chi-square test formula is shown as formula (1), whereinRepresenting the frequency of observations>Representing the expected frequency>N is the index of each cell in the list;

(1)

and selecting the characteristic with the chi-square value larger than a preset threshold value, and determining the characteristic as a characteristic vector related to SQL injection attack detection.

In some embodiments, the PNN model has four types of layers, an input layer, a mode layer, a summation layer, and an output layer, training the PNN model, including:

the input layer acquires input data and distributes the input to neurons;

the mode layer calculates the similarity between the input data and the sample according to the Gaussian kernel function corresponding to each neuron, and generates mode output;

the gaussian kernel function is as in formula (2):(2)

wherein,representing input data and->The kernel function output of each training sample has a value between 0 and 1, a larger value indicating a higher similarity, +.>Representing the input data dimension, i.e. the feature number in the feature vector related to SQL injection attack detection obtained by chi-square test,/->Is a smoothing factor, training the smoothing factor in formula (2) by particle swarm algorithm>，/>Representing input data +.>Representation category->Middle->Training samples->Representing the Euclidean distance between the input data and the sample data;

the summation layer sums the probability outputs belonging to the same class, approximates the conditional probability density of the estimated class, and the summation layer formula is as formula (3), by summing the classIs>Summing to obtain category->Conditional probability density of>Representing input data +.>Belongs to category->Probability of (2): />(3)

The output layer selects the class with the maximum posterior probability as the final output based on the Bayesian rule, and the output layer formula is as formula (4):(4)

wherein,for new sample->Assigned category,/->For the number of categories, apply in SQL injection detection +.>For SQL injection statement and non SQL injection statement, the +.>Representative sample->Belongs to SQL injection sentences or non-injection sentences.

In some embodiments, the smoothing factor in equation (2) is trained by a particle swarm algorithmComprising:

defining the search space as one dimension, randomly generating initial positions and speeds of particles, initializing a particle group, and for the mth particle, respectively, the initial positions and speeds are、/>；

For each ofModeling a value PNN, and calculating classification accuracy through K cross verification to obtain a fitness value;

record eachThe highest fitness value corresponding to the value is taken as the individual optimal solution +.>And record all->The highest fitness value among the values is taken as the global optimal solution +.>；

The fitness function is optimized by continuously updating the position and velocity of the particles, the velocity and position update formulas are as follows (5), formula (6):

（5）

（6）

wherein,、/>、/>、/>the current position and velocity of the particle and the next iteration position and velocity, respectively, < >>Is an inertial weight for controlling the inertia of the particles to maintain the previous velocity direction, usually +.>The value range of (2) is [0,1 ]]，/>And->Is an individual learning factor and a social learning factor, +.>And->The value range of (2) is [0,1 ]],/>And->Is [0,1 ]]Random numbers in the range are used for introducing randomness, so that the exploratory property of the algorithm is increased;

iterative optimization and returnCorresponding->The value is taken as the optimal solution.

In some embodiments, the smoothing factor is trainedThe training set is further divided into K subsets, K-fold cross validation is carried out on the PNN model, and the classification accuracy is calculated to measure the current PNN model>The following properties;

wherein in each cross-validation, the PNN model is trained using a training set and validated on a validation set in the training set.

In some embodiments, training the probabilistic neural network PNN model, the feature vector related to the SQL injection attack detection is an input of the PNN model, and after outputting the SQL-query statement type, further includes:

the PNN model is evaluated with a test set in the SQL query dataset as input by the confusion matrix of the classifier.

In some embodiments, the confusion matrix of the classifier is displayed as cells in a 2x2 grid, and the evaluation of the PNN model by the confusion matrix of the classifier with the test set in the SQL query data set as input comprises:

determining the statement number of the first classification result with positive predictive value in the test set by using confusion matrix of classifierSecond classification result with true value in test set negative as positive predictive valueStatement quantity->Statement count of third classification result with true value in test set positive negative predictive value +.>Statement count of fourth classification result with negative true value in test set +.>Wherein, the true value is positive to indicate that SQL injection actually exists, the true value is negative to indicate that SQL injection actually does not exist, the predicted value is positive to indicate that SQL injection is predicted to exist, and the predicted value is negative to indicate that SQL injection is predicted to not exist;

according to the following four evaluation methods、/>、/>、/>Calculating classification accuracy->Precision->Recall->And F-Measure, as shown in the following formula:

（7）

（8）

（9）

（10）。

in some embodiments, before feature extraction by using chi-square test, further comprising:

and carrying out data preprocessing on the training set in the SQL query data set.

In some embodiments, data preprocessing the SQL query dataset comprises:

data in the training set is standardized, and noise and repeated data are removed;

and performing word segmentation processing on the query sentences in the training set, splitting the query into Token, encoding the Token, and converting the Token into a numerical form which can be understood by a model.

In some embodiments, word segmentation processing is performed on query sentences in the training set, and splitting the query into Token includes:

splitting the query sentence into a keyword Token, an identifier Token, an operator Token and a constant Token, and converting the query sentence into a lexical unit sequence.

In the embodiment of the application, the mode in a large amount of sample data is extracted and learned through the automatic feature of deep learning, so that the method has strong generalization capability and self-adaptability, and unknown and novel SQL injection attacks can be effectively identified. And by using chi-square test to extract the characteristics, we can select the characteristics with higher correlation with SQL injection attack detection, and improve the accuracy and performance of the model.

Drawings

FIG. 1 is a flowchart of an SQL injection detection method according to an embodiment of the application;

FIG. 2 is a PNN structure diagram according to an embodiment of the present application;

fig. 3 is a 2x2 grid of confusion matrix for a classifier according to an embodiment of the present application.

Detailed Description

Example embodiments will be described more fully hereinafter with reference to the accompanying drawings, but may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed rules.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Embodiments described herein may be described with reference to plan and/or cross-sectional views with the aid of idealized schematic diagrams of the present disclosure. Accordingly, the example illustrations may be modified in accordance with manufacturing techniques and/or tolerances. Thus, the embodiments are not limited to the embodiments shown in the drawings, but include modifications of the configuration formed based on the manufacturing process. Thus, the regions illustrated in the figures have schematic properties and the shapes of the regions illustrated in the figures illustrate the particular shapes of the regions of the elements, but are not intended to be limiting.

The application provides an SQL injection detection method. The following detailed description is provided with reference to the accompanying drawings of the embodiments of the application.

As shown in fig. 1, an embodiment of the present application provides an SQL injection detection method, including:

step S101, collecting an SQL query data set for training and testing, wherein the SQL query data set comprises query sentences of SQL injection categories and query sentences of non-SQL injection categories;

step S102, feature extraction is carried out on the SQL query data set by using chi-square test, and feature vectors relevant to SQL injection attack detection are selected;

and step S103, training the PNN model of the probabilistic neural network, wherein the feature vector related to SQL injection attack detection is the input of the PNN model, and outputting the SQL-query statement class.

Probabilistic Neural Networks (PNNs), based on bayesian reasoning, are used to classify and regress tasks, such as predicting prices or ranking search results. Whenever the PNN model creates a piece of text, it can better predict the words that are likely to appear next. This process is similar to how humans learn language and grammar from the environment. PNN predicts the value of one variable by inferring probability in combination with information of other variables. The kernel performs basic operations in the PNN. It is used to calculate the probability of a particular result given an input. The kernel can be seen as a function that accepts an input and maps it to another dimension (feature map). In view of the advantages of PNN, we extend it to the detection of SQL injection attacks.

In step S101, a test case set of the SQL injection vulnerability is obtained from the WASC, and the sample size of the obtained SQL query data set is large enough and diversified, so that the PNN model can be better trained and tested.

Conventional approaches typically use static rules to detect known SQL injection attacks, but this approach suffers from the limitation of being unable to capture unknown or new attacks. In contrast, deep learning has significant advantages over traditional approaches in preventing SQL injection. The deep learning model can automatically adapt to complex nonlinear relations and changes, so that different skills and strategies adopted by an attacker can be better dealt with. In addition, the deep learning model has the advantages of strong robustness, automation, complex mode capturing and the like, and can provide an SQL injection detection solution with high accuracy and reliability. In summary, deep learning can solve the limitations faced by the traditional method in terms of SQL injection defense, and provide more effective security protection.

The data preprocessing is a key stage of developing a machine learning model, and can clean and prepare original data so as to be suitable for the model.

In some embodiments, data preprocessing the SQL query dataset comprises:

(1)

The preset threshold is a larger number, and the characteristic with the chi-square value larger than the preset threshold is selected, namely, the characteristic with the larger chi-square value is selected. Chi-square test considers independence between features and objects and calculates observationsIs +.>The differences between them evaluate the importance of the feature. By selecting the feature with larger chi-square value, we can identify the feature that is most discriminative and informative for SQL injection attack detection. The feature extraction method based on chi-square test has the advantages of robustness, calculation efficiency and comprehensive test information.

In the embodiment of the application, through using chi-square test, the characteristics related to SQL injection attack detection can be more accurately selected, and the accuracy and performance of the model are improved. By reducing the dimensions of the feature space, we can better understand the relationships between features and provide a more discriminative and informative representation of the features, providing a more valuable input for subsequent model training and evaluation.

In some embodiments, the PNN model has four types of layers, an input layer, a mode layer, a summation layer, and an output layer, which cooperate to form an end-to-end classifier in the structural relationship shown in fig. 2. Training the PNN model, comprising:

the input layer acquires input data and distributes the input to neurons;

the gaussian kernel function is as in formula (2):

(2)

the summation layer outputs probabilities belonging to the same categoryThe conditional probability density of the class is approximated by summing, the summing layer formula is shown as equation (3), by applying a sum to the classIs>Summing to obtain category->Conditional probability density of>I.e. < ->Representing input data +.>Belongs to category->Probability of (2):

(3)

the output layer selects the class with the maximum posterior probability as the final output based on the Bayesian rule, and the output layer formula is as formula (4):

(4)

The smoothing factor selection in equation (2) is an important step in the Probabilistic Neural Network (PNN), smoothing factorFor controlling the diffusion degree of the kernel function in calculating the similarity between the input sample and the pattern, thereby affecting the fitting capacity and generalization capacity of the model, smoothing factor +.>The quality of the selection is directly related to the fitting effect of the PNN on the training data and the generalization capability of the PNN on the unknown data. If smoothing factor->Too small, the diffusion of the kernel is small, and the model may be overfitted on the training data, resulting in poor performance on the unknown data. If smoothing factor->Too large, the diffusion degree of the kernel function is large, and the model may be under-fitted on the training data, so that the relation between samples cannot be captured well. The particle swarm algorithm is a heuristic optimization algorithm, and inspiration is derived from group behaviors of groups such as shoal or shoal in nature. It is a global optimization algorithm for finding the optimal or near optimal solution of the function. In the particle swarm algorithm, we can solve the optimal +.>Values that enable a probability density function to better fit training data, thereby improving the modelIs a performance of the (c).

For each ofModeling the value PNN, and calculating the classification accuracy (see formula 7) through K cross validation to obtain a fitness value;

（5）

（6）

Conventional next-to-next testUsually by manually selecting a series +.>Values, then model training and evaluation are performed on each value, finally selecting the best performing +.>. This approach, while simple and easy to implement, is time consuming, especially when the search space is large, the computational cost can be very high. The particle swarm algorithm is used as a heuristic optimization algorithm, and can automatically find the optimal +.>Values. The application can gradually converge the particle swarm algorithm to the optimal solution by continuously iterating and exchanging the positions of the particles, thereby quickly finding out the proper +.>And the value greatly saves the calculation time. Optimizing it by particle swarm optimization not only improves the performance of the model, but also searches the hyper-parameter space more efficiently, thereby, in the followingFind the optimal +.>Values. The intelligent optimization method provides a novel and efficient solution for model tuning and provides powerful support for optimization and improvement of SQL injection detection algorithms.

According to the embodiment of the application, through the chi-square test method, the characteristics related to SQL injection attack detection can be more accurately selected, the accuracy and performance of the model are improved, and the optimal solution is more rapidly solved through the optimization methodValues that enable the probability density function to better fit the training data, thereby improving the performance of the model.

In some embodiments, training the probabilistic neural network PNN model, the feature vector related to the SQL injection attack detection is the input of the PNN model, and after outputting the-SQL-query statement class, further includes:

determining the statement number of the first classification result with positive predictive value in the test set by using confusion matrix of classifierStatement count of second classification result with true value in test set negative as positive predictive value +.>Statement count of third classification result with true value in test set positive negative predictive value +.>Statement count of fourth classification result with negative true value in test set +.>Wherein, the true value is positive to indicate that SQL injection actually exists, the true value is negative to indicate that SQL injection actually does not exist, the predicted value is positive to indicate that SQL injection is predicted to exist, and the predicted value is negative to indicate that SQL injection is predicted to not exist;

（7）

（8）

（9）

（10）。

in an embodiment of the application, the evaluation of the model is performed with the help of a confusion matrix. The confusion matrix provides a powerful statistical tool to measure the performance of the model by evaluating the binary classifier. It enables one to determine the effect of the classifier performing the test on the hold-out set. The Hold-out set refers to a method for training a model and evaluating the performance of the model by dividing the original dataset into two mutually exclusive subsets. Typically, a portion (e.g., 70% -80%) of the original data set is used as the training set, and another portion (the remaining 20% -30%) is used as the test set.

If the classifier does not perform well or the model is not understood, then the model is likely to perform poorly in the test set. The best way to capture and improve these types of errors is to perform analysis. This includes predicting a predicted value using the test set and comparing the predicted value to an actual value. After processing the test set, we can see that our classifier produced some false positives and false negatives. Given this information, we may decide to retrain our model or make some adjustments to improve the performance of the classifier.

As shown in fig. 3, the confusion matrix of the classifier is typically displayed as cells in a 2x2 grid. The row describes the "true state" of each example in the test set, and the column describes the "predicted state" of the PNN model based on the present application.

TP represents the statement number of the first classification result with the true value in the test set being positive and the predicted value being positive;

FN represents the number of sentences of the second classification result with positive true values and negative predicted values in the test set;

FP represents the number of sentences of the third classification result with the true value being negative and the predicted value being positive in the test set;

TN represents the number of sentences of the fourth classification result whose true value is negative and whose predicted value is negative in the test set.

Confusion matrices can help us handle data sets with class imbalance. In SQL injection detection, the number of samples of normal queries and injection attacks is typically unbalanced, which can have an impact on the performance of the model. The confusion matrix can provide detailed positive class and negative class prediction results, help us adjust the threshold value of the model and optimize the classification result, and improve the accuracy and the robustness of the model. The comprehensiveness and intuitiveness of the confusion matrix makes it a powerful tool that can help us comprehensively evaluate and improve the performance of the model. By introducing the confusion matrix, the classification result of the model can be known more accurately, the problem can be found and solved, and a better classification effect can be obtained. The innovation has important significance for improving the effect of SQL injection detection, and provides a new evaluation method and thought for research and application in related fields.

It is to be understood that the above embodiments are merely illustrative of the application of the principles of the present application, but not in limitation thereof. Various modifications and improvements may be made by those skilled in the art without departing from the spirit and substance of the application, and are also considered to be within the scope of the application.

Claims

1. An SQL injection detection method, comprising:

training a probabilistic neural network PNN model, wherein feature vectors related to SQL injection attack detection are input of the PNN model, and outputting SQL query statement types;

feature extraction by using chi-square test, selecting feature vectors related to SQL injection attack detection, comprising:

the chi-square test formula is shown as formula (1), wherein O _n Represents the observation frequency, E _n Representing the desired frequency number of the products,n is the index of each cell in the list;

selecting the characteristic with the chi-square value larger than a preset threshold value, and determining the characteristic as a characteristic vector related to SQL injection attack detection;

the PNN model has four types of layers, an input layer, a mode layer, a summation layer, and an output layer, which train the PNN model, including:

the input layer acquires input data and distributes the input to neurons;

the gaussian kernel function is as in formula (2):

wherein omega _ij The kernel function output representing the input data and the jth training sample has a value between 0 and 1, the larger the value is, the higher the similarity is, d represents the dimension of the input data, namely, the feature number in the feature vector related to SQL injection attack detection obtained by chi-square test, and sigma is a smoothing factor, and the smoothing factors sigma and X in the formula (2) are trained through a particle swarm algorithm _new The input data is represented by a representation of the input data,representing class C _i In the j-th training sample,/->Representing the Euclidean distance between the input data and the sample data;

the summation layer sums the probability outputs belonging to the same class, approximates the conditional probability density of the estimated class, and the summation layer formula is as in equation (3), by summing the set C of class i _i Summing to obtain the conditional probability density of class i, p (X _new |C _i ) Representing input data X _new Belonging to class C _i Probability of (2):

Class(X _new )=argmax _1≤i≤NC {P(X _new ∣C _i )} (4)

wherein Class (X _new ) For a new sample X _new The assigned categories, NC is the number of categories, and applied in SQL injection detection, NC is two categories of SQL injection sentences and non-SQL injection sentences, class (X _new ) Representing a new sample X _new Belongs to SQL injection sentences or non-injection sentences;

training the smoothing factor sigma in the formula (2) by a particle swarm algorithm includes:

defining search space as one dimension, randomly generating initial position and speed of particles, initializing particle group, and for mth particle, its initial position and speed are respectively X _m 、V _m ；

Modeling each sigma value PNN, and calculating classification accuracy through K cross validation to obtain a fitness value;

recording the highest fitness value corresponding to each sigma value as an individual optimal solution pbset _m Recording the highest fitness value in all sigma values as a global optimal solution gbest;

wherein,the current position and speed of the particle and the next iteration position and speed are respectively, c0 is an inertia weight used for controlling the inertia of the particle to keep the previous speed direction, and the value range of c0 is 0,1]C1 and c2 are individual learning factors and social learning factors, and the value range of c1 and c2 is [0,1 ]]R1 and r2 are [0,1 ]]Random numbers in the range are used for introducing randomness, so that the exploratory property of the algorithm is increased;

iteratively optimizing and returning a sigma value corresponding to the gbest as an optimal solution; the training set is further divided into K subsets when the smoothing factor sigma is trained, K-fold cross validation is carried out on the PNN model, and the classification accuracy is calculated to measure the performance of the PNN model under the current sigma;

in each cross verification, training a PNN model by using a training set, and verifying on a verification set in the training set; training the probabilistic neural network PNN model, wherein the feature vector related to SQL injection attack detection is the input of the PNN model, and after outputting the SQL query statement class, the method further comprises the following steps:

evaluating the PNN model by using a test set in the SQL query data set as input through a confusion matrix of the classifier;

the confusion matrix of the classifier is displayed as cells in a 2x2 grid, and the PNN model is evaluated by the confusion matrix of the classifier with the test set in the SQL query data set as input, comprising:

determining the statement number TP of a first classification result with positive predictive value in a test set, the statement number FN of a second classification result with negative predictive value in the test set, the statement number FP of a third classification result with positive predictive value in the test set, and the statement number TN of a fourth classification result with negative predictive value in the test set by using a confusion matrix of a classifier, wherein the fact value is positive and indicates that SQL injection actually exists, the fact value is negative and indicates that SQL injection actually does not exist, the predictive value is positive and indicates that SQL injection exists, and the predictive value is negative and indicates that SQL injection does not exist;

the classification Accuracy Accurcry, precision precion, recall and F-Measure were calculated according to TP, FN, FP, TN by the following four evaluation methods, as shown in the following formulas:

2. the SQL injection detection method according to claim 1, further comprising, before feature extraction by using chi-square test:

3. The SQL injection detection method according to claim 2, wherein performing data preprocessing on the SQL query dataset comprises:

4. The method of claim 3, wherein the step of word segmentation of the query sentence in the training set to split the query into Token comprises: