CN114117419A

CN114117419A - Template injection attack detection method, device, equipment and storage medium

Info

Publication number: CN114117419A
Application number: CN202111340270.3A
Authority: CN
Inventors: 牛彬
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2022-03-01

Abstract

The invention belongs to the technical field of injection attack detection, and particularly provides a method, a device, equipment and a storage medium for detecting template injection attack, wherein the method comprises the following steps: collecting sample data, and processing the collected data to generate a word set vector; inputting the generated word set vector and the corresponding length matrix into a hidden Markov model for training to obtain a template injection detection model; performing data processing on sample data to be detected, and injecting the processed data into a detection model to obtain the probability of corresponding parameters; and comparing the probability of the acquired parameters with a probability threshold value set by injecting the training template into the detection model, and judging the parameters with the probability lower than the probability threshold value as abnormal. The unknown template injection load is judged by establishing a model, the execution speed is high, and the accuracy is high.

Description

Template injection attack detection method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of injection attack detection, in particular to a method, a device, equipment and a storage medium for detecting template injection attack.

Background

The essence of the injection vulnerability is that the server receives the input of the user, and because the input of the user is not filtered or is not strictly filtered, the code spliced by the user is executed, and various injection attacks are caused.

The server template injection (template injection for short) is that various vulnerabilities caused by malicious user input are spliced in the server template. A template engine (specifically, a template engine for Web development) is created to separate a user interface from business data (contents), and generates a document in a specific format, and a template engine for a Web site generates a standard HTML document. The template engine can separate the interface from the data and the service code from the logic code of the (website) program, thereby greatly improving the development efficiency and enabling the code reuse to be easier due to good design. SSTI utilizes the existing website template engines, such as jinja2, mako, tornado and django of Python, and when these frameworks use rendering functions to generate html, SSTI problems occur because the legality of user input is not checked.

For the detection of template injection attack, the manual method is slow in detection and low in efficiency, and the false alarm rate and the missing report rate of the existing automatic detection tool are high.

Disclosure of Invention

Aiming at the problems that the detection of template injection attack is slow in detection and low in efficiency by adopting a manual method, and the false alarm rate and the missing report rate of the existing automatic detection tool are higher, the invention provides a template injection attack detection method based on an HMM (Hidden Markov Model, HMM) which is a statistical Model and is used for describing a Markov process containing Hidden unknown parameters.

The technical scheme of the invention is as follows:

in a first aspect, a technical solution of the present invention provides a method for detecting a template injection attack, including the steps of:

collecting sample data, wherein the sample data comprises black sample data and white sample data; the black sample data is a uniform resource locator with template injection attack, and the white sample data is a uniform resource locator without template injection attack;

processing the collected data to generate a word set vector;

inputting the generated word set vector and the corresponding length matrix into a hidden Markov model for training to obtain a template injection detection model;

performing data processing on sample data to be detected, and injecting the processed data into a detection model to obtain the probability of corresponding parameters;

and comparing the probability of the acquired parameters with a probability threshold value set by injecting the training template into the detection model, and judging the parameters with the probability lower than the probability threshold value as abnormal.

By utilizing the characteristics of high accuracy, strong fuzzy judgment capability and high execution speed of model establishment by learning samples through machine learning, the model is automatically fitted and formed by collecting a large number of template injection attack samples through an HMM algorithm, and then the existing model is called to judge the injection load of the unknown template, so that the execution speed is high, and the accuracy is high.

Preferably, the step of collecting sample data comprises:

crawling a uniform resource locator from a web log as black sample data through a crawler tool;

and crawling a normal website through a crawler tool to obtain the uniform resource locator as white sample data.

The collected data is used for establishing a template injection detection model, and the more the data is collected, the more comprehensive the template injection type is contained, and the more accurate the established model is.

Preferably, the step of processing the collected data to generate a word set vector comprises:

decoding the collected data by using a uniform resource locator;

generalizing the decoded uniform resource locator parameters;

and extracting parameters of the uniform resource locator and generating a word set vector through lexical segmentation.

In consideration of data repeatability, simple deduplication processing is required after data collection, and usability of data is improved.

Preferably, the step of generalizing the decoded uniform resource locator parameter includes:

replacing the part before the uniform resource locator parameter with U;

replacing all numbers in the uniform resource locator with O;

the comment section is deleted.

In order to reduce vector space and improve calculation speed and efficiency, the uniform resource locator parameters are subjected to generalization processing.

Preferably, the step of extracting parameters of the uniform resource locator and generating a word set vector by lexical segmentation comprises:

extracting parameters of the uniform resource locator;

constructing a vocabulary list by the extracted uniform resource locator parameters;

and traversing the vocabulary table to perform duplication removal processing to obtain a word set vector.

Through lexical segmentation, the attack payloads can be serialized into observation sequences.

In a second aspect, the technical solution of the present invention provides a device for detecting template injection attack, including a collection module, a data processing module, a test training module, a template verification module, and a result processing module;

the collecting module is used for collecting sample data, wherein the sample data comprises black sample data and white sample data; the black sample data is a uniform resource locator with template injection attack, and the white sample data is a uniform resource locator without template injection attack;

the data processing module is used for processing the collected data to generate a word set vector;

the test training module is used for inputting the generated word set vector and the corresponding length matrix into a hidden Markov model for training to obtain a template injection detection model;

the template verification module is used for carrying out data processing on the sample data to be detected and inputting the processed data into the template to be injected into the detection model to obtain the probability of corresponding parameters;

and the result processing module is used for comparing the probability of the acquired parameters with a probability threshold set by injecting the training template into the detection model, and the parameters with the probability lower than the probability threshold are judged to be abnormal.

Preferably, the collecting module is specifically configured to crawl a uniform resource locator from the web log as black sample data through a crawler tool; and crawling a normal website through a crawler tool to obtain the uniform resource locator as white sample data.

Preferably, the data processing module comprises a decoding unit, a generalization processing unit and a lexical segmentation unit;

a decoding unit for decoding the collected data by using the uniform resource locator;

the generalization processing unit is used for generalizing the decoded uniform resource locator parameter;

and the lexical division unit is used for extracting parameters of the uniform resource locator and generating a word set vector through lexical division.

Preferably, the generalization processing unit is specifically configured to replace a part before the uniform resource locator parameter with U; replacing all numbers in the uniform resource locator with O; the comment section is deleted.

Preferably, the lexical segmentation unit comprises a parameter extraction submodule, a vocabulary generation submodule and a word set vector generation submodule;

the parameter extraction submodule is used for extracting parameters of the uniform resource locator;

the vocabulary generation submodule is used for forming the extracted uniform resource locator parameters into a vocabulary;

and the word set vector generation submodule is used for traversing the vocabulary table to perform duplication removal processing to obtain a word set vector.

In a third aspect, the present invention further provides a computer device, including a processor and a memory, where the processor and the memory complete communication with each other through a bus; the memory stores program instructions executable by the processor, the processor invoking the program instructions capable of performing the method of detecting a template injection attack as described in the first aspect.

In a fourth aspect, the present invention also provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the method for detecting a template injection attack according to the first aspect.

According to the technical scheme, the invention has the following advantages: the invention provides a template injection attack detection method based on an HMM algorithm. By utilizing the characteristics of high accuracy, strong fuzzy judgment capability and high execution speed of model establishment by learning samples through machine learning, the attack samples are injected through a large number of collected templates, the HMM algorithm is adopted to automatically fit and form the model, and the existing model is called to judge the injection load of the unknown template, so that the execution speed is high, and the accuracy is high.

In addition, the invention has reliable design principle, simple structure and very wide application prospect.

Therefore, compared with the prior art, the invention has prominent substantive features and remarkable progress, and the beneficial effects of the implementation are also obvious.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention.

Fig. 2 is a schematic block diagram of an apparatus of one embodiment of the present invention.

Detailed Description

For the detection of template injection attack, the manual method is slow in detection and low in efficiency. The existing automatic detection tools mostly adopt the traditional method, and the main defect is that the false alarm rate and the missing alarm rate are high. The invention provides a template injection attack detection method based on an HMM algorithm. By utilizing the characteristics of high accuracy, strong fuzzy judgment capability and high execution speed of model establishment by learning samples through machine learning, the model is automatically fitted and formed by collecting a large number of template injection attack samples through an HMM algorithm, and then the existing model is called to judge the injection load of the unknown template, so that the execution speed is high, and the accuracy is high.

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, an embodiment of the present invention provides a method for detecting a template injection attack, including the following steps:

step 1, collecting sample data, wherein the sample data comprises black sample data and white sample data; the black sample data is a uniform resource locator with template injection attack, and the white sample data is a uniform resource locator without template injection attack;

step 2, processing the collected data to generate a word set vector;

and step 3: inputting the generated word set vector and the corresponding length matrix into a hidden Markov model for training to obtain a template injection detection model;

and 4, step 4: performing data processing on sample data to be detected, and injecting the processed data into a detection model to obtain the probability of corresponding parameters;

and 5: and comparing the probability of the acquired parameters with a probability threshold value set by injecting the training template into the detection model, and judging the parameters with the probability lower than the probability threshold value as abnormal.

In some embodiments, the data collected in step 1 is used for machine learning to build a template injection detection model. The sample data mainly comprises a black sample and a white sample, wherein the black sample is obtained by crawling and daily accumulation from weblog through a Python crawler tool in the presence of a url attacked by template injection, and the white sample is obtained by crawling a normal website through the Python crawler tool in the absence of the url attacked by the template injection. That is, the uniform resource locator is crawled from the web log as black sample data through a crawler tool; and crawling a normal website through a crawler tool to obtain the uniform resource locator as white sample data.

The collected data is used for establishing a template injection monitoring model, the more the data is collected, the more comprehensive the template injection type is contained, and the more accurate the established model is.

The following characteristics can be found by analyzing the collected uniform resource locator information: the samples are basically coded by the uniform resource locator, and some samples are possibly coded by multiple codes, so that the cyclic decoding of the uniform resource locator is needed; the path structures of the uniform resource locators are basically consistent, although the specific path information is different, the path information is represented by alphanumeric special characters and can be subjected to generalization processing; template injection attacks typically use basic objects such as strings, lists, tuples, dictionaries, collections, etc. to obtain classes, basic class objects by class, and sensitive class objects by object. Different templates have fixed syntax, such as jinja2 being {% s } } (e.g., { {5 × 5} }), mako being {% s }, and attack statements usually include __ class __, __ subclasses __, __ init __, popen, read, etc. as detection terms. For example: [] __ class __. __ bases __ [0]. __ sublases __ () [40] ('/etc/passswd'). read (). In some embodiments, in the step of processing the collected data to generate the word set vector in step 2, the data processing includes url decoding, url parameter generalization processing, and url lexical segmentation to obtain the word set vector, and the specific steps include:

step 21: decoding the collected data by using a uniform resource locator;

considering that sample data may pass through url coding, url decoding is firstly needed, accuracy of url parameter extraction is improved, and url decoding can be realized through functions such as unquote in Python. For example, the following template injection statement is obviously url-coded:

％7B％25％20for％20c％20in％20％5B％5D.__class__.__base__.__subc lasses__()％20％25％7D％0A；

after decoding: {% for c in [ ]. __ class __. __ base __. __ sublases __ ()% };

step 22: generalizing the decoded uniform resource locator parameters;

in order to reduce vector space and improve calculation speed and efficiency, the parameters of the uniform resource locator are subjected to generalization treatment, and the specific principle is as follows: replacing the part before the uniform resource locator parameter with U; replacing all numbers in the uniform resource locator with O; the comment section is deleted.

Step 23: and extracting parameters of the uniform resource locator and generating a word set vector through lexical segmentation. Through lexical segmentation, the attack payloads can be serialized into observation sequences. The word set and bag of words model is a data processing model that is very common in machine learning and is used to characterize string-type data. Specifically, the parameters of the url are extracted, and different parameters can be distinguished through spaces due to the fact that space segmentation exists between different parameters of the url. And (5) forming a vocabulary table by url extraction parameters, and obtaining a word set by traversing the vocabulary table and performing duplication removal processing. The specific step 23 of extracting parameters of the uniform resource locator and generating a word set vector by lexical segmentation includes:

step 231: extracting parameters of the uniform resource locator;

step 232: constructing a vocabulary list by the extracted uniform resource locator parameters;

step 233: and traversing the vocabulary table to perform duplication removal processing to obtain a word set vector.

It should be further explained for the above step 3 that hmmlearn is an HMM implementation under python, and is an independent item from scimit-lern. The word set vector M obtained by dividing the lexical method and the corresponding length matrix M _ len are input, and the reason why M _ len is needed is that the lengths of parameter samples may not be consistent, so that separate input is needed. And training by using hmmlearn to obtain a template injection detection model.

In step 4, data processing is performed on sample data to be detected, the processed data is input into the detection model, and the probability of the corresponding parameter is obtained, in this step, the step of data processing on the sample to be detected also comprises url decoding, url parameter generalization processing and url lexical segmentation, the specific implementation process can be the same as the specific step in the step 2, and the processed data is input into the detection model, and the probability is obtained.

Correspondingly, in step 5, the validity of the observation sequence is judged according to the obtained probability, that is, R is defined as a threshold, and the parameter with the probability lower than R is identified as abnormal, and R is usually defined to be slightly larger than the minimum value of the training set. By analyzing the characteristics of template injection vulnerabilities and combining the application of a machine learning algorithm in the field of web security, the HMM algorithm-based template injection attack detection method is provided. By utilizing the characteristics of high accuracy, strong fuzzy judgment capability and high execution speed of model establishment through sample learning in machine learning, a template injection detection model is formed by collecting a large number of template injection samples and adopting an HMM algorithm for automatic fitting.

As shown in fig. 2, an embodiment of the present invention provides a device for detecting a template injection attack, including a collection module, a data processing module, a test training module, a template verification module, and a result processing module;

In some embodiments, the collection module is specifically configured to crawl uniform resource locators from web logs as black sample data via a crawler tool; and crawling a normal website through a crawler tool to obtain the uniform resource locator as white sample data.

In some embodiments, the data processing module comprises a decoding unit, a generalization processing unit and a lexical segmentation unit;

The data processing module can discover the following characteristics by analyzing the collected url information: samples are basically subjected to URL coding, and some samples may be subjected to multiple coding, so that URL cyclic decoding is required; the url path structures are basically consistent, although the specific path information is different, the specific path information is represented by alphanumeric special characters and can be subjected to generalization processing; template injection attacks typically use basic objects such as strings, lists, tuples, dictionaries, collections, etc. to obtain classes, basic class objects by class, and sensitive class objects by object. Different templates have fixed syntax, such as jinja2 being {% s } } (e.g., { {5 × 5} }), mako being {% s }, and attack statements usually include __ class __, __ subclasses __, __ init __, popen, read, etc., which can be used as detection terms. In consideration of data repeatability, simple deduplication processing is required after data collection, and usability of data is improved.

In some cases, the url parameters are generalized to reduce the vector space and increase the computation speed and efficiency. A generalization processing unit, specifically configured to replace a part before the uniform resource locator parameter with U; replacing all numbers in the uniform resource locator with O; the comment section is deleted.

In some embodiments, the attack payload may be serialized into an observation sequence by lexical segmentation. The word set and bag of words model is a data processing model that is very common in machine learning and is used to characterize string-type data. Specifically, the parameters of the url are extracted, and different parameters can be distinguished through spaces due to the fact that space segmentation exists between different parameters of the url. The parameters are extracted by url to form a vocabulary table, and a word set can be obtained by traversing the vocabulary table and performing duplication removal processing; namely, the lexical segmentation unit comprises a parameter extraction submodule, a vocabulary generation submodule and a word set vector generation submodule;

hmmlearn is an HMM implementation under python, an item independent from scimit-lern. The word set vector M obtained by dividing the lexical method and the corresponding length matrix M _ len are input, and the reason why M _ len is needed is that the lengths of parameter samples may not be consistent, so that separate input is needed. And training by using hmmlearn to obtain a template injection detection model. The data processing module firstly performs data processing on sample data to be detected, wherein the data processing comprises url decoding, url parameter generalization processing, url lexical segmentation and the like. And injecting the processed data into a detection model to obtain probability so as to judge the legality of the observation sequence. We define R as a threshold and parameters with probabilities below R are identified as anomalies, and R will typically be defined slightly larger than the training set minimum.

The computer device provided by the embodiment of the invention can comprise: the system comprises a processor, a communication interface, a memory and a bus, wherein the processor, the communication interface and the memory are communicated with each other through the bus. The bus may be used for information transfer between the electronic device and the sensor. The processor may call logic instructions in memory to perform the following method: step 1, collecting sample data, wherein the sample data comprises black sample data and white sample data; the black sample data is a uniform resource locator with template injection attack, and the white sample data is a uniform resource locator without template injection attack; step 2, processing the collected data to generate a word set vector; and step 3: inputting the generated word set vector and the corresponding length matrix into a hidden Markov model for training to obtain a template injection detection model; and 4, step 4: performing data processing on sample data to be detected, and injecting the processed data into a detection model to obtain the probability of corresponding parameters; and 5: and comparing the probability of the acquired parameters with a probability threshold value set by injecting the training template into the detection model, and judging the parameters with the probability lower than the probability threshold value as abnormal.

In some specific embodiments, the program instructions executed by the processor may specifically implement the following steps: step 21: decoding the collected data by using a uniform resource locator; step 22: generalizing the decoded uniform resource locator parameters; step 23: and extracting parameters of the uniform resource locator and generating a word set vector through lexical segmentation.

In some specific embodiments, the program instructions executed by the processor may specifically implement the following steps: step 231: extracting parameters of the uniform resource locator; step 232: constructing a vocabulary list by the extracted uniform resource locator parameters; step 233: and traversing the vocabulary table to perform duplication removal processing to obtain a word set vector.

In addition, the logic instructions in the memory may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Embodiments of the present invention provide a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform a method provided by the above method embodiments, for example, including: step 1, collecting sample data, wherein the sample data comprises black sample data and white sample data; the black sample data is a uniform resource locator with template injection attack, and the white sample data is a uniform resource locator without template injection attack; step 2, processing the collected data to generate a word set vector; and step 3: inputting the generated word set vector and the corresponding length matrix into a hidden Markov model for training to obtain a template injection detection model; and 4, step 4: performing data processing on sample data to be detected, and injecting the processed data into a detection model to obtain the probability of corresponding parameters; and 5: and comparing the probability of the acquired parameters with a probability threshold value set by injecting the training template into the detection model, and judging the parameters with the probability lower than the probability threshold value as abnormal.

In some specific embodiments, the program instructions executed by the processor in the readable storage medium may specifically implement the following steps: step 21: decoding the collected data by using a uniform resource locator; step 22: generalizing the decoded uniform resource locator parameters; step 23: and extracting parameters of the uniform resource locator and generating a word set vector through lexical segmentation.

In some specific embodiments, the program instructions executed by the processor in the readable storage medium may specifically implement the following steps: step 231: extracting parameters of the uniform resource locator; step 232: constructing a vocabulary list by the extracted uniform resource locator parameters; step 233: and traversing the vocabulary table to perform duplication removal processing to obtain a word set vector.

Although the present invention has been described in detail by referring to the drawings in connection with the preferred embodiments, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions are within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A detection method for template injection attack is characterized by comprising the following steps:

processing the collected data to generate a word set vector;

2. The method of claim 1, wherein the step of collecting sample data comprises:

3. The method of claim 1, wherein the step of processing the collected data to generate a word set vector comprises:

decoding the collected data by using a uniform resource locator;

generalizing the decoded uniform resource locator parameters;

4. The method of claim 3, wherein the step of generalizing the decoded URL parameters comprises:

replacing the part before the uniform resource locator parameter with U;

replacing all numbers in the uniform resource locator with O;

the comment section is deleted.

5. The method of claim 3, wherein the step of performing parameter extraction on the uniform resource locator and generating a word set vector by lexical segmentation comprises:

extracting parameters of the uniform resource locator;

6. A detection device for template injection attack is characterized by comprising a collection module, a data processing module, a test training module, a template verification module and a result processing module;

7. The apparatus for detecting template injection attack as claimed in claim 6, wherein the data processing module comprises a decoding unit, a generalization processing unit, and a lexical segmentation unit;

8. The apparatus for detecting template injection attack as claimed in claim 7, wherein the lexical segmentation unit includes a parameter extraction submodule, a vocabulary generation submodule, and a word set vector generation submodule;

9. A computer device comprising a processor and a memory, wherein the processor and the memory communicate with each other via a bus; the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform a method of detecting a template injection attack as claimed in any one of claims 1 to 5.

10. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method for detecting a template injection attack according to any one of claims 1 to 5.