CN109670307A - A kind of SQL injection recognition methods based on CNN and massive logs - Google Patents

A kind of SQL injection recognition methods based on CNN and massive logs Download PDF

Info

Publication number
CN109670307A
CN109670307A CN201811473916.3A CN201811473916A CN109670307A CN 109670307 A CN109670307 A CN 109670307A CN 201811473916 A CN201811473916 A CN 201811473916A CN 109670307 A CN109670307 A CN 109670307A
Authority
CN
China
Prior art keywords
cnn
sql injection
sample
model
recognition methods
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811473916.3A
Other languages
Chinese (zh)
Inventor
谢鑫
邓金城
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Zhidaochuangyu Information Technology Co Ltd
Original Assignee
Chengdu Zhidaochuangyu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Zhidaochuangyu Information Technology Co Ltd filed Critical Chengdu Zhidaochuangyu Information Technology Co Ltd
Priority to CN201811473916.3A priority Critical patent/CN109670307A/en
Publication of CN109670307A publication Critical patent/CN109670307A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of SQL injection recognition methods based on CNN and massive logs, comprising the following steps: step 1: extracting URL access record from web log file, extracts the query argument part in URL;Step 2: the data extracted in step 1 are pre-processed;Step 3: building CNN network, the data obtained according to step 2 are trained CNN network, obtain CNN model;Step 4: the attack of the SQL injection based on query statement is measured in real time according to the model that step 3 obtains;The present invention can automatically extract the hiding common characteristic of SQL injection, and good to SQL injection detection effect, speed is fast, and rate of failing to report and rate of false alarm are low.

Description

A kind of SQL injection recognition methods based on CNN and massive logs
Technical field
The present invention relates to SQL injection recognition methods, and in particular to a kind of to be identified based on the SQL injection of CNN and massive logs Method.
Background technique
SQL injection is various informative, new method occurs inevitably around traditional detection;Existing detection method is mainly wrapped It includes support vector machines (SVM), canonical matching, decision tree, Nae Bayesianmethod etc.;Wherein canonical matching is widely used, effect Preferably;But SQL injection is various informative, new method inevitably occurs around canonical matching technique, simultaneously because being based on Existing rule is difficult to identify new attack;Traditional machine learning techniques (statistical learnings such as SVM, naive Bayesian and decision tree Method) need it is artificial provide feature, be difficult to obtain its best feature and cannot judge whether it is best feature and not Better feature is easily judged whether there is, causes recognition accuracy not high, a large amount of wrong reports is easy to appear and fails to report.
Summary of the invention
The present invention provides a kind of hiding common characteristic of extractable SQL injection, identifies the SQL injection attack around canonical The SQL injection recognition methods based on CNN and massive logs of flow.
The technical solution adopted by the present invention is that: a kind of SQL injection recognition methods based on CNN and massive logs, including with Lower step:
Step 1: extracting URL access record from web log file, extract the query argument part in URL;
Step 2: the data extracted in step 1 are pre-processed;
Step 3: building CNN network, the data obtained according to step 2 are trained CNN network, obtain CNN model;
Step 4: the attack of the SQL injection based on query statement is measured in real time according to the model that step 3 obtains.
Further, the URL access extracted in the step 1 is recorded as marked good having in web log file and inquires language The URL that sentence and requesting method are GET accesses record.
Further, the query argument part in URL is extracted in the step 1 also to need to carry out at duplicate removal sample later Reason.
Further, the pretreatment in the step 2 includes following procedure:
S1: the average length of every sample of statistics, i.e. character quantity;
S2: term vectorization training is carried out to sample, obtains term vector model;
S3: vectorization is carried out to sample according to the model in step 2, each sample standard deviation is converted into a two-dimensional matrix;
S4: the two-dimensional matrix that step 3 is obtained, which is converted into, specifies big minor matrix.
Further, detailed process is as follows for the step 3:
S11: positive negative sample is upset, and a part of sample is remaining to be used as test set as training set;
S12: it builds CNN network and is trained;
S13: testing the obtained model of step S12 by verifying collection, stop if error meets preset requirement, no Continue to be trained after then adjusting CNN network parameter.
The beneficial effects of the present invention are:
(1) present invention can automatically extract the hiding common characteristic of SQL injection, identify the SQL injection attack around canonical Flow, and speed is fast;
(2) the present invention is based on single characters to carry out vectorization, and word amount is less, can retain all information of query sentence, can drop Low trained difficulty and training cost.
Detailed description of the invention
Fig. 1 is flow diagram of the present invention.
Fig. 2 is ROC curve in the embodiment of the present invention.
Specific embodiment
The present invention will be further described in the following with reference to the drawings and specific embodiments.
As shown in Figure 1, a kind of SQL injection recognition methods based on CNN and massive logs, comprising the following steps:
Step 1: extracting URL access record from web log file, extract the query argument part in URL;(data cleansing)
The URL access record that marked good query statement and requesting method in web log file are GET is extracted, including just Normal SQL injection.
Such as http://www.abc.com/a/b/? query#fragment, the url with query parameter, i.e., after question mark Face, the character string before pound sign (format is key1=value1&key2=value2 ... under normal circumstances);In order to guarantee sample With enough representativenesses, a large amount of data (hundreds of thousands of to million ranks) are needed, it, can be on time if marked log amount is too big Between uniform sampling.
Then the query argument part in URL is extracted, such as http://www.abc.com/a/b/? query#fragment, Only retain query argument section.
Repetition is likely to occur for positive and negative sample standard deviation, duplicate removal can be reduced by repeating bring redundancy, improve training sample This quality.
Step 2: (data prediction) is pre-processed to the data extracted in step 1;
Including following procedure:
S1: the average length of every sample of statistics, i.e. character quantity;Row matrix is determined by statistical sample average length Number, is approximately equal to average length, if abc length is 3.
S2: term vectorization training is carried out to sample, obtains term vector model;
Sample is subjected to term vectorization training using word2vec method (the term vector method of Google's open source), obtains one A term vector model;This model saves the corresponding vector of each character;It is long that vectorization dimension values are not preferably greater than sample mean Degree, and be advisable with 2 for n times side, such as 16,32,64 etc.;Using single character as a vector unit, carried out without to it URL decoding;Character quantity can be greatly increased after decoding (will appear a large amount of Chinese character, if filtering out Chinese character can lose Fall part useful information, because some SQL injections can utilize Chinese character).
Such as " a " indicates 5 element vectors as [0.1,1.2,0,1,3].
S3: vectorization is carried out to sample according to the model in step 2, each sample standard deviation is converted into a two-dimensional matrix;
If there is no the character occurred in sample in model, filled with 1 vector, i.e. each element value of row vector takes 1;Often Sample standard deviation can be converted into a two-dimensional matrix, and columns is the vector dimension that each character determines, line number is sample character Quantity.
Such as abc vectorization schematic diagram
a[0.1,1.2,0,1,3]
b[0,1.6,10,11,2.3]
c[0.8,1.2,0.67,1.983,1.41]
S4: the two-dimensional matrix that step 3 is obtained is converted into matrix.
If matrix line number is less than designated value, big minor matrix is specified to it in one direct subsidy of bottom, 0 vector, if line number is big In designated value, then remove last several rows so that it becomes specifying big minor matrix.
Step 3: building CNN network, the data obtained according to step 2 are trained CNN network, obtain CNN model (mould Type training);
Detailed process is as follows:
S11: positive negative sample is upset, and a part of sample is remaining as verifying collection as training set;
Due to training when be that all samples are imported into training in batches, upset positive negative sample can prevent train when occur it is whole A trained batch is positive sample or negative sample entirely;The marked positive negative sample in part is randomly selected as training set (ten to two ten ten thousand Left and right), it is remaining as verifying collection.
S12: it builds CNN network and is trained;
In general, amount of training data is the bigger the better, but training cost can also rise.
S13: testing the obtained model of step S12 by verifying collection, stop if error meets preset requirement, no Continue to be trained after then adjusting CNN network parameter.
Trained model is tested by verifying collection, if error (accuracy, loss function value etc.) is acceptable Then stop in range, otherwise continue to train by adjusting neural network parameter, parameter includes CNN structure, the network number of plies, instruction Practice number, convolution kernel size, quantity, pond function, classification function, activation primitive etc.;Usually trained model collects verifying Classification accuracy rate 99% or more.
Step 4: the attack of the SQL injection based on query statement is measured in real time according to the model that step 3 obtains.
By model be deployed in server environment can real-time detection attacked based on the SQL injection of query statement, can also will The present invention is used for detection history log.
Embodiment
The ROC curve of the method for the present invention is as shown in Fig. 2, as can be seen from the figure classification performance of the present invention is good;Structure of the present invention The model built is in face of random, that is, without apparent visible properties, for can be around tradition canonical characteristic matching The SQL injection of (whether having sensitive character string in URL, so there are various fancy to bypass skill) still has detection effect well Fruit, the SQL injection that traditional canonical matching process can not be detected, the model that the present invention constructs still can detected, and speed Degree is fast.
From figure 2 it can be seen that rate of failing to report of the present invention and rate of false alarm are very low;Prove that accuracy of the present invention is more than after tested The field of image recognition average accuracy that CNN is directed to originally, is tested, accuracy is 99.7% on 3,000,000 datas.
ROC curve refers to Receiver operating curve (receiver operating characteristic curve), It is the overall target for reflecting sensibility and specificity continuous variable, is the mutual pass for disclosing sensibility and specificity with composition method System;Continuous variable by being set out multiple and different critical values by it, so that a series of sensibility and specificities are calculated, then with Sensibility is ordinate, specificity is that abscissa is depicted as curve;Area under the curve is bigger, and diagnostic accuracy is higher;In ROC song It is the higher critical value of sensibility and specificity near the upper left point of coordinate diagram on line.
Further progress test, respectively to 2,110,000 positive samples, 1,180,000 negative samples, 3,300,000 sample (positive samples 2117860, negative sample 1187504) it is tested, accuracy is respectively 99.97%, 99.9% and 99.95%.
The symbol occurred in the present invention is as follows:
SQL injection: by the way that sql command is inserted into the polling character that Web list is submitted or inputs domain name or page request String;It is finally reached the sql command that spoofing server executes malice;Specifically, it is using existing application, by (malice ) sql command be injected into background data base engine execution ability, it can pass through in Web list input (malice) SQL language Sentence obtains one, and there are the databases on the website of security breaches, rather than are intended to go to execute SQL statement according to designer.
CNN: convolutional neural networks (Convolutional Neural Network) are a kind of depth feedforward artificial neurons Network;CNN is a neural network for problem of image recognition design, it imitates the multilayer process that the mankind identify image.
URL: refer to uniform resource locator.
SQL: structured query language (Structured Query Language).
Present invention is mainly used for the behaviors for being identified by URL query statement progress SQL injection, and URL query statement is carried out Two classification are divided into normal and SQL injection;Due to having public affairs between normal query statement and the query statement of SQL injection Total characteristic difference, but can not be, it is evident that therefore can be classified according to these difference.
The query statement of SQL injection often has very big redundancy, therefore can remove the characteristic phase of redundancy with CNN Matching;There is unconspicuous hiding feature between normal query statement and the query statement of SQL injection, that is to say, that without legal person For good characteristic of division is provided, bad feature then will lead to a large amount of errors of identification;Difference between them often exists Unconspicuous feature can not be obtained by observation, and CNN, which can be automatically extracted, thinks unrecognized feature, therefore can using CNN It extracts and hides common characteristic.
Existing SQL injection detection mainly uses canonical matching to be identified, speed high using canonical technology recognition accuracy It spends fastly, but cannot identify new attack, what inevitably appearance was new bypasses method to avoid canonical, for example URL more times are compiled Code;The present invention can then automatically extract the hiding common characteristic of SQL injection, identify the SQL injection attack traffic around canonical, And speed is fast;Traditional machine learning method needs artificial screening feature, and good feature is difficult to obtain, and the quality of feature is directly Influence recognition accuracy;Some deep learning identifications are pretty troublesome to URL pretreatment, can also lose part important information, such as URL is decoded, it is possible that Chinese character, does not remove and will lead to the substantial increase of word amount, remove, can lack part SQL injection Important feature information;The present invention is based on single characters to carry out vectorization, and word amount is less, but can retain all letters of query sentence Breath, and word2vec method can also combined information between reserved character, such as location information, word amount is few can then to reduce training Difficulty and training cost.

Claims (5)

1. a kind of SQL injection recognition methods based on CNN and massive logs, which comprises the following steps:
Step 1: extracting URL access record from web log file, extract the query argument part in URL;
Step 2: the data extracted in step 1 are pre-processed;
Step 3: building CNN network, the data obtained according to step 2 are trained CNN network, obtain CNN model;
Step 4: the attack of the SQL injection based on query statement is measured in real time according to the model that step 3 obtains.
2. a kind of SQL injection recognition methods based on CNN and massive logs according to claim 1, which is characterized in that institute State the URL access extracted in step 1 be recorded as it is marked good with query statement and requesting method is GET's in web log file URL access record.
3. a kind of SQL injection recognition methods based on CNN and massive logs according to claim 1, which is characterized in that institute The query argument part extracted in step 1 in URL is stated also to need to carry out duplicate removal processing to sample later.
4. a kind of SQL injection recognition methods based on CNN and massive logs according to claim 1, which is characterized in that institute The pretreatment stated in step 2 includes following procedure:
S1: the average length of every sample of statistics, i.e. character quantity;
S2: term vectorization training is carried out to sample, obtains term vector model;
S3: vectorization is carried out to sample according to the model in step 2, each sample standard deviation is converted into a two-dimensional matrix;
S4: the two-dimensional matrix that step 3 is obtained, which is converted into, specifies big minor matrix.
5. a kind of SQL injection recognition methods based on CNN and massive logs according to claim 1, which is characterized in that institute Stating step 3, detailed process is as follows:
S11: positive negative sample is upset, and a part of sample is remaining to be used as test set as training set;
S12: it builds CNN network and is trained;
S13: the obtained model of step S12 is tested by verifying collection, stops if error meets preset requirement, otherwise adjusts Continue to be trained after whole CNN network parameter.
CN201811473916.3A 2018-12-04 2018-12-04 A kind of SQL injection recognition methods based on CNN and massive logs Pending CN109670307A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811473916.3A CN109670307A (en) 2018-12-04 2018-12-04 A kind of SQL injection recognition methods based on CNN and massive logs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811473916.3A CN109670307A (en) 2018-12-04 2018-12-04 A kind of SQL injection recognition methods based on CNN and massive logs

Publications (1)

Publication Number Publication Date
CN109670307A true CN109670307A (en) 2019-04-23

Family

ID=66143542

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811473916.3A Pending CN109670307A (en) 2018-12-04 2018-12-04 A kind of SQL injection recognition methods based on CNN and massive logs

Country Status (1)

Country Link
CN (1) CN109670307A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135166A (en) * 2019-05-08 2019-08-16 北京国舜科技股份有限公司 A kind of detection method and system for the attack of service logic loophole
CN112019497A (en) * 2020-07-10 2020-12-01 上海大学 Word embedding-based multi-stage network attack detection method
CN113254942A (en) * 2021-05-24 2021-08-13 浙江网商银行股份有限公司 Data processing method, system and device
CN113821791A (en) * 2020-06-18 2021-12-21 中国电信股份有限公司 Method, system, storage medium and apparatus for detecting SQL injection
US20230169180A1 (en) * 2021-11-30 2023-06-01 Oracle International Corporation Detecting security vulnerabilities in queries via an information flow graph

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107908965A (en) * 2017-11-14 2018-04-13 北京知道创宇信息技术有限公司 The detection method and device of doubtful SQL injection type
CN108173708A (en) * 2017-12-18 2018-06-15 北京天融信网络安全技术有限公司 Anomalous traffic detection method, device and storage medium based on incremental learning
CN108259494A (en) * 2018-01-17 2018-07-06 北京邮电大学 A kind of network attack detecting method and device
CN108667816A (en) * 2018-04-19 2018-10-16 重庆邮电大学 A kind of the detection localization method and system of Network Abnormal

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107908965A (en) * 2017-11-14 2018-04-13 北京知道创宇信息技术有限公司 The detection method and device of doubtful SQL injection type
CN108173708A (en) * 2017-12-18 2018-06-15 北京天融信网络安全技术有限公司 Anomalous traffic detection method, device and storage medium based on incremental learning
CN108259494A (en) * 2018-01-17 2018-07-06 北京邮电大学 A kind of network attack detecting method and device
CN108667816A (en) * 2018-04-19 2018-10-16 重庆邮电大学 A kind of the detection localization method and system of Network Abnormal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MING ZHANG,ET AL.: "A Deep Learning Method to Detect Web Attacks Using a Specially Designed CNN", <ICONIP 2017:NEURAL INFORMATION PROCESSING> *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135166A (en) * 2019-05-08 2019-08-16 北京国舜科技股份有限公司 A kind of detection method and system for the attack of service logic loophole
CN110135166B (en) * 2019-05-08 2021-03-30 北京国舜科技股份有限公司 Detection method and system for service logic vulnerability attack
CN113821791A (en) * 2020-06-18 2021-12-21 中国电信股份有限公司 Method, system, storage medium and apparatus for detecting SQL injection
CN112019497A (en) * 2020-07-10 2020-12-01 上海大学 Word embedding-based multi-stage network attack detection method
CN112019497B (en) * 2020-07-10 2021-12-03 上海大学 Word embedding-based multi-stage network attack detection method
CN113254942A (en) * 2021-05-24 2021-08-13 浙江网商银行股份有限公司 Data processing method, system and device
US20230169180A1 (en) * 2021-11-30 2023-06-01 Oracle International Corporation Detecting security vulnerabilities in queries via an information flow graph
US11860679B2 (en) * 2021-11-30 2024-01-02 Oracle International Corporation Detecting security vulnerabilities in queries via an information flow graph

Similar Documents

Publication Publication Date Title
CN109670307A (en) A kind of SQL injection recognition methods based on CNN and massive logs
Long et al. Understanding membership inferences on well-generalized learning models
CN106096405B (en) A kind of Android malicious code detecting method abstract based on Dalvik instruction
CN110233849B (en) Method and system for analyzing network security situation
Pirscoveanu et al. Analysis of malware behavior: Type classification using machine learning
CN109547423B (en) WEB malicious request deep detection system and method based on machine learning
CN108833409B (en) Webshell detection method and device based on deep learning and semi-supervised learning
CN109005145B (en) Malicious URL detection system and method based on automatic feature extraction
CN107786575A (en) A kind of adaptive malice domain name detection method based on DNS flows
CN108920954B (en) Automatic malicious code detection platform and method
CN109859029A (en) Abnormal application detection method, device, computer equipment and storage medium
CN108200054A (en) A kind of malice domain name detection method and device based on dns resolution
CN108667816A (en) A kind of the detection localization method and system of Network Abnormal
CN109684840A (en) Based on the sensitive Android malware detection method for calling path
CN109194677A (en) A kind of SQL injection attack detection, device and equipment
CN105069355A (en) Static detection method and apparatus for webshell deformation
CN107360152A (en) A kind of Web based on semantic analysis threatens sensory perceptual system
CN108616545A (en) A kind of detection method, system and electronic equipment that network internal threatens
CN110266675A (en) A kind of xss attack automated detection method based on deep learning
CN111324797B (en) Method and device for precisely acquiring data at high speed
US10990672B2 (en) Method and apparatus for obtaining virus library, device, server, and system
CN108229170B (en) Software analysis method and apparatus using big data and neural network
CN108718306A (en) A kind of abnormal flow behavior method of discrimination and device
CN110191096A (en) A kind of term vector homepage invasion detection method based on semantic analysis
CN115361176B (en) SQL injection attack detection method based on FlexUDA model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190423