CN109670307A

CN109670307A - A kind of SQL injection recognition methods based on CNN and massive logs

Info

Publication number: CN109670307A
Application number: CN201811473916.3A
Authority: CN
Inventors: 谢鑫; 邓金城
Original assignee: Chengdu Zhidaochuangyu Information Technology Co Ltd
Current assignee: Chengdu Zhidaochuangyu Information Technology Co Ltd
Priority date: 2018-12-04
Filing date: 2018-12-04
Publication date: 2019-04-23

Abstract

The present invention provides a kind of SQL injection recognition methods based on CNN and massive logs, comprising the following steps: step 1: extracting URL access record from web log file, extracts the query argument part in URL；Step 2: the data extracted in step 1 are pre-processed；Step 3: building CNN network, the data obtained according to step 2 are trained CNN network, obtain CNN model；Step 4: the attack of the SQL injection based on query statement is measured in real time according to the model that step 3 obtains；The present invention can automatically extract the hiding common characteristic of SQL injection, and good to SQL injection detection effect, speed is fast, and rate of failing to report and rate of false alarm are low.

Description

A kind of SQL injection recognition methods based on CNN and massive logs

Technical field

The present invention relates to SQL injection recognition methods, and in particular to a kind of to be identified based on the SQL injection of CNN and massive logs Method.

Background technique

SQL injection is various informative, new method occurs inevitably around traditional detection；Existing detection method is mainly wrapped It includes support vector machines (SVM), canonical matching, decision tree, Nae Bayesianmethod etc.；Wherein canonical matching is widely used, effect Preferably；But SQL injection is various informative, new method inevitably occurs around canonical matching technique, simultaneously because being based on Existing rule is difficult to identify new attack；Traditional machine learning techniques (statistical learnings such as SVM, naive Bayesian and decision tree Method) need it is artificial provide feature, be difficult to obtain its best feature and cannot judge whether it is best feature and not Better feature is easily judged whether there is, causes recognition accuracy not high, a large amount of wrong reports is easy to appear and fails to report.

Summary of the invention

The present invention provides a kind of hiding common characteristic of extractable SQL injection, identifies the SQL injection attack around canonical The SQL injection recognition methods based on CNN and massive logs of flow.

The technical solution adopted by the present invention is that: a kind of SQL injection recognition methods based on CNN and massive logs, including with Lower step:

Step 1: extracting URL access record from web log file, extract the query argument part in URL；

Step 2: the data extracted in step 1 are pre-processed；

Step 3: building CNN network, the data obtained according to step 2 are trained CNN network, obtain CNN model；

Step 4: the attack of the SQL injection based on query statement is measured in real time according to the model that step 3 obtains.

Further, the URL access extracted in the step 1 is recorded as marked good having in web log file and inquires language The URL that sentence and requesting method are GET accesses record.

Further, the query argument part in URL is extracted in the step 1 also to need to carry out at duplicate removal sample later Reason.

Further, the pretreatment in the step 2 includes following procedure:

S1: the average length of every sample of statistics, i.e. character quantity；

S2: term vectorization training is carried out to sample, obtains term vector model；

S3: vectorization is carried out to sample according to the model in step 2, each sample standard deviation is converted into a two-dimensional matrix；

S4: the two-dimensional matrix that step 3 is obtained, which is converted into, specifies big minor matrix.

Further, detailed process is as follows for the step 3:

S11: positive negative sample is upset, and a part of sample is remaining to be used as test set as training set；

S12: it builds CNN network and is trained；

S13: testing the obtained model of step S12 by verifying collection, stop if error meets preset requirement, no Continue to be trained after then adjusting CNN network parameter.

The beneficial effects of the present invention are:

(1) present invention can automatically extract the hiding common characteristic of SQL injection, identify the SQL injection attack around canonical Flow, and speed is fast；

(2) the present invention is based on single characters to carry out vectorization, and word amount is less, can retain all information of query sentence, can drop Low trained difficulty and training cost.

Detailed description of the invention

Fig. 1 is flow diagram of the present invention.

Fig. 2 is ROC curve in the embodiment of the present invention.

Specific embodiment

The present invention will be further described in the following with reference to the drawings and specific embodiments.

As shown in Figure 1, a kind of SQL injection recognition methods based on CNN and massive logs, comprising the following steps:

Step 1: extracting URL access record from web log file, extract the query argument part in URL；(data cleansing)

The URL access record that marked good query statement and requesting method in web log file are GET is extracted, including just Normal SQL injection.

Such as http://www.abc.com/a/b/? query#fragment, the url with query parameter, i.e., after question mark Face, the character string before pound sign (format is key1=value1&key2=value2 ... under normal circumstances)；In order to guarantee sample With enough representativenesses, a large amount of data (hundreds of thousands of to million ranks) are needed, it, can be on time if marked log amount is too big Between uniform sampling.

Then the query argument part in URL is extracted, such as http://www.abc.com/a/b/? query#fragment, Only retain query argument section.

Repetition is likely to occur for positive and negative sample standard deviation, duplicate removal can be reduced by repeating bring redundancy, improve training sample This quality.

Step 2: (data prediction) is pre-processed to the data extracted in step 1；

Including following procedure:

S1: the average length of every sample of statistics, i.e. character quantity；Row matrix is determined by statistical sample average length Number, is approximately equal to average length, if abc length is 3.

Sample is subjected to term vectorization training using word2vec method (the term vector method of Google's open source), obtains one A term vector model；This model saves the corresponding vector of each character；It is long that vectorization dimension values are not preferably greater than sample mean Degree, and be advisable with 2 for n times side, such as 16,32,64 etc.；Using single character as a vector unit, carried out without to it URL decoding；Character quantity can be greatly increased after decoding (will appear a large amount of Chinese character, if filtering out Chinese character can lose Fall part useful information, because some SQL injections can utilize Chinese character).

Such as " a " indicates 5 element vectors as [0.1,1.2,0,1,3].

If there is no the character occurred in sample in model, filled with 1 vector, i.e. each element value of row vector takes 1；Often Sample standard deviation can be converted into a two-dimensional matrix, and columns is the vector dimension that each character determines, line number is sample character Quantity.

Such as abc vectorization schematic diagram

a[0.1,1.2,0,1,3]

b[0,1.6,10,11,2.3]

c[0.8,1.2,0.67,1.983,1.41]

S4: the two-dimensional matrix that step 3 is obtained is converted into matrix.

If matrix line number is less than designated value, big minor matrix is specified to it in one direct subsidy of bottom, 0 vector, if line number is big In designated value, then remove last several rows so that it becomes specifying big minor matrix.

Step 3: building CNN network, the data obtained according to step 2 are trained CNN network, obtain CNN model (mould Type training)；

Detailed process is as follows:

S11: positive negative sample is upset, and a part of sample is remaining as verifying collection as training set；

Due to training when be that all samples are imported into training in batches, upset positive negative sample can prevent train when occur it is whole A trained batch is positive sample or negative sample entirely；The marked positive negative sample in part is randomly selected as training set (ten to two ten ten thousand Left and right), it is remaining as verifying collection.

S12: it builds CNN network and is trained；

In general, amount of training data is the bigger the better, but training cost can also rise.

Trained model is tested by verifying collection, if error (accuracy, loss function value etc.) is acceptable Then stop in range, otherwise continue to train by adjusting neural network parameter, parameter includes CNN structure, the network number of plies, instruction Practice number, convolution kernel size, quantity, pond function, classification function, activation primitive etc.；Usually trained model collects verifying Classification accuracy rate 99% or more.

By model be deployed in server environment can real-time detection attacked based on the SQL injection of query statement, can also will The present invention is used for detection history log.

Embodiment

The ROC curve of the method for the present invention is as shown in Fig. 2, as can be seen from the figure classification performance of the present invention is good；Structure of the present invention The model built is in face of random, that is, without apparent visible properties, for can be around tradition canonical characteristic matching The SQL injection of (whether having sensitive character string in URL, so there are various fancy to bypass skill) still has detection effect well Fruit, the SQL injection that traditional canonical matching process can not be detected, the model that the present invention constructs still can detected, and speed Degree is fast.

From figure 2 it can be seen that rate of failing to report of the present invention and rate of false alarm are very low；Prove that accuracy of the present invention is more than after tested The field of image recognition average accuracy that CNN is directed to originally, is tested, accuracy is 99.7% on 3,000,000 datas.

ROC curve refers to Receiver operating curve (receiver operating characteristic curve), It is the overall target for reflecting sensibility and specificity continuous variable, is the mutual pass for disclosing sensibility and specificity with composition method System；Continuous variable by being set out multiple and different critical values by it, so that a series of sensibility and specificities are calculated, then with Sensibility is ordinate, specificity is that abscissa is depicted as curve；Area under the curve is bigger, and diagnostic accuracy is higher；In ROC song It is the higher critical value of sensibility and specificity near the upper left point of coordinate diagram on line.

Further progress test, respectively to 2,110,000 positive samples, 1,180,000 negative samples, 3,300,000 sample (positive samples 2117860, negative sample 1187504) it is tested, accuracy is respectively 99.97%, 99.9% and 99.95%.

The symbol occurred in the present invention is as follows:

SQL injection: by the way that sql command is inserted into the polling character that Web list is submitted or inputs domain name or page request String；It is finally reached the sql command that spoofing server executes malice；Specifically, it is using existing application, by (malice ) sql command be injected into background data base engine execution ability, it can pass through in Web list input (malice) SQL language Sentence obtains one, and there are the databases on the website of security breaches, rather than are intended to go to execute SQL statement according to designer.

CNN: convolutional neural networks (Convolutional Neural Network) are a kind of depth feedforward artificial neurons Network；CNN is a neural network for problem of image recognition design, it imitates the multilayer process that the mankind identify image.

URL: refer to uniform resource locator.

SQL: structured query language (Structured Query Language).

Present invention is mainly used for the behaviors for being identified by URL query statement progress SQL injection, and URL query statement is carried out Two classification are divided into normal and SQL injection；Due to having public affairs between normal query statement and the query statement of SQL injection Total characteristic difference, but can not be, it is evident that therefore can be classified according to these difference.

The query statement of SQL injection often has very big redundancy, therefore can remove the characteristic phase of redundancy with CNN Matching；There is unconspicuous hiding feature between normal query statement and the query statement of SQL injection, that is to say, that without legal person For good characteristic of division is provided, bad feature then will lead to a large amount of errors of identification；Difference between them often exists Unconspicuous feature can not be obtained by observation, and CNN, which can be automatically extracted, thinks unrecognized feature, therefore can using CNN It extracts and hides common characteristic.

Existing SQL injection detection mainly uses canonical matching to be identified, speed high using canonical technology recognition accuracy It spends fastly, but cannot identify new attack, what inevitably appearance was new bypasses method to avoid canonical, for example URL more times are compiled Code；The present invention can then automatically extract the hiding common characteristic of SQL injection, identify the SQL injection attack traffic around canonical, And speed is fast；Traditional machine learning method needs artificial screening feature, and good feature is difficult to obtain, and the quality of feature is directly Influence recognition accuracy；Some deep learning identifications are pretty troublesome to URL pretreatment, can also lose part important information, such as URL is decoded, it is possible that Chinese character, does not remove and will lead to the substantial increase of word amount, remove, can lack part SQL injection Important feature information；The present invention is based on single characters to carry out vectorization, and word amount is less, but can retain all letters of query sentence Breath, and word2vec method can also combined information between reserved character, such as location information, word amount is few can then to reduce training Difficulty and training cost.

Claims

1. a kind of SQL injection recognition methods based on CNN and massive logs, which comprises the following steps:

Step 2: the data extracted in step 1 are pre-processed；

2. a kind of SQL injection recognition methods based on CNN and massive logs according to claim 1, which is characterized in that institute State the URL access extracted in step 1 be recorded as it is marked good with query statement and requesting method is GET's in web log file URL access record.

3. a kind of SQL injection recognition methods based on CNN and massive logs according to claim 1, which is characterized in that institute The query argument part extracted in step 1 in URL is stated also to need to carry out duplicate removal processing to sample later.

4. a kind of SQL injection recognition methods based on CNN and massive logs according to claim 1, which is characterized in that institute The pretreatment stated in step 2 includes following procedure:

5. a kind of SQL injection recognition methods based on CNN and massive logs according to claim 1, which is characterized in that institute Stating step 3, detailed process is as follows:

S12: it builds CNN network and is trained；

S13: the obtained model of step S12 is tested by verifying collection, stops if error meets preset requirement, otherwise adjusts Continue to be trained after whole CNN network parameter.