CN115269427A

CN115269427A - Intermediate language representation method and system for WEB injection vulnerability

Info

Publication number: CN115269427A
Application number: CN202210940451.8A
Authority: CN
Inventors: 张国栋; 刘子龙; 郭薇
Original assignee: Shenyang Aerospace University
Current assignee: Shenyang Aerospace University
Priority date: 2022-08-03
Filing date: 2022-08-03
Publication date: 2022-11-01

Abstract

The invention discloses an intermediate language representation method and system for WEB injection vulnerability. When detecting a vulnerability, if the source code is directly analyzed, a large amount of irrelevant information is included, which results in poor vulnerability detection effect. After the source code is transferred by using the intermediate language expression, semantic information related to the vulnerability can be extracted, and the extracted information does not contain noise data unrelated to the vulnerability, so that the logic code having a direct relation with the vulnerability can be more accurately described. The intermediate language representation is independent of the actual development language and development environment, and can be regarded as a language for describing the vulnerability structure. The intermediate language representation is used as input to train the Bi-LSTM network, a vulnerability detection model can be obtained, and vulnerability detection can be carried out on the source code by using the model.

Description

Intermediate language representation method and system for WEB injection vulnerability

Technical Field

The invention relates to the field of vulnerability detection of WEB programs, in particular to an intermediate language representation method and system for WEB injection vulnerabilities and a PHP (Power distribution protocol) WEB injection vulnerability detection method and system based on intermediate language representation.

Background

Since the popularity of distributed computing models (e.g., cyber-physical systems and the Internet of things), the number of Web applications has been growing rapidly, but the number of vulnerabilities has also grown. And injection attacks are widely considered among all WEB vulnerabilities as one of the most devastating threats. Injection attacks may embed scripts into user input, causing a relational database management system running behind a WEB application to execute malicious SQL statements, or to execute malicious scripts within the user's browser. According to the vulnerability distribution condition published by the national information security vulnerability sharing platform in 2020, the WEB application vulnerability occupation ratio is up to 27.7%, and half of attacks are injected into various vulnerability types. Therefore, detecting and preventing injection vulnerabilities is critical to improving the reliability and trustworthiness of modern WEB applications. Among the programming languages of various WEB applications, the PHP language has the characteristics of flexible writing, fast running, and the like, and is widely used for WEB development. According to the statistics of W3Techs, the PHP language accounts for 78.8% in the programming language of the WEB server in 2021, and still accounts for a very high percentage. Therefore, the WEB vulnerability detection for PHP language is still the current research focus.

Generally, there are two methods for detecting PHP injection holes. One is to carry out dynamic testing, and the basic idea is to simulate a hacker to attack a WEB application program, specifically to design various attack vectors, attack the system by using the attack vectors, and analyze whether the attack is successful or not. For example, liu M and other people consider that malicious sql statements have unique semantics, so that test cases with the same semantics can be generated by semantic knowledge, the generated test cases can detect SQLi vulnerabilities more comprehensively, and DeepSQLi is designed based on the idea. DeepSQLi utilizes a neural language model to mutate a given test case or normal user input, thereby generating a more diverse set of test cases. However, due to the nature of the black box test, it is impossible to perform a complete and non-exhaustive input test, such as a detection failure due to the fact that the semantic missing of the use case cannot cover all execution paths.

The other mode is static test, which essentially analyzes the dependency relationship among variables off line mainly through lexical and grammatical analysis and the like to detect whether the taint data can be propagated from taint sources to taint convergence points. In the process, the target program is not operated, and the code is not required to be modified. Static detection is also classified into two types, one is rule-based static detection. Son and the like combine pollution analysis and a control flow graph to find some semantic vulnerabilities such as loss of authorization and denial of service attack, but the detection result has high false alarm rate and cannot be directly used, and the result needs to be manually checked. Wasef et al propose a PHP web application detection method based on genetic algorithms and static analysis that reduces the false positive rate by eliminating infeasible paths in the control flow graph. Russell et al and Nguyen et al emphasize the disadvantages of static and dynamic analysis, which can result in a high percentage of errors and false positives when detecting software bugs. In addition, seokmo and the like also emphasize the problem of low detection precision when the static analysis technology is used for detecting the vulnerability. The other is static detection based on learning, and the method can avoid manual design of a rule base and can detect novel bugs. The method has the main idea that the static analysis is utilized to analyze or preprocess the source code, and then the neural network is trained, so that vulnerability detection is realized. Such as mediiros, utilize data mining techniques instead of manual auditing, thereby reducing false positives of detection. Moreover, zheng and the like perform comparative evaluation on the aspect of software vulnerability detection on the conventional machine learning and deep learning methods, and prove that the deep learning method is really better in the aspect of software vulnerability detection compared with the conventional machine method.

Since the conventional static detection method has a large amount of false alarms, it is a current development trend to introduce machine learning into vulnerability detection. If vulnerability detection is carried out by machine learning, source codes need to be preprocessed, positions of vulnerabilities are found, and vulnerability characteristics are extracted, but existing characteristic extraction and expression methods lack vulnerability related information, such as source code texts, AST characteristics, operation codes, php-tokens and the like. The source code text is the result of removing meaningless markup symbols in the source code. The AST feature is that the source code information is expressed by AST, and the special mark symbol is removed. The operation code is an intermediate code executed by the zend engine in the PHP language. The source code may be represented in the form of an opcode, and the vulnerability may be characterized in the form of an opcode. The PHP-token is to divide the source code according to the PHP mark by using a PHP built-in function token _ get _ all to obtain a token sequence of the PHP, and the sequence is used as the vulnerability characteristic representation of the source code.

However, the above intermediate representations all have problems, which are as follows: the source code text and the AST as vulnerability characteristics contain a large amount of irrelevant information, so that vulnerability detection is inaccurate; fidalgo and the like utilize operation codes to extract features, but the operation codes cannot represent excessive semantic information to cause vulnerability semantics to be deleted; fang et al redesign the php-token, preserving the function name in the source code. Therefore, parameters of the functions are discarded when the method is used for escape, but some functions have different functions according to different parameters, and vulnerability related information is lost. Also, the readability of this representation is poor, with much redundant information.

In order to solve the problems, aiming at SQL (structured query language) injection loopholes, cross-site scripting attacks and command injection loopholes of PHP (Web application based on the PHP), the invention provides an intermediate language for loophole feature extraction, and a loophole detection tool is designed based on the intermediate language. The intermediate language can extract semantic information only related to the vulnerability from the source code and perform abstract representation on the vulnerability, and the abstract representation not only reserves the semantic information of the vulnerability but also has readability. Compared with the above-mentioned representation method, the intermediate language representation of the invention is the characteristic representation special for the vulnerability, can more accurately describe the semantic information of the vulnerability, eliminates irrelevant codes and keeps the readability of the codes. Meanwhile, the invention trains the Bi-LSTM network by using the intermediate language expression as input, can obtain a vulnerability detection model, and can utilize the model to carry out vulnerability detection on the source code.

Disclosure of Invention

The invention discloses an intermediate language representation aiming at WEB injection vulnerability, which can be used for accurately describing vulnerability information in codes, and discloses a WEB injection vulnerability detection method aiming at PHP (physical layer protection protocol) based on the intermediate representation. The intermediate language representation can accurately extract information which has direct relation with the vulnerability, avoids introducing excessive noise data, and the vulnerability detection method using the intermediate language also has high accuracy.

According to one aspect of the invention, an intermediate language representation method for WEB injection vulnerability is provided, wherein the method is used for representing classification languages of various variables and functions; the method comprises the following steps:

when the source code is transferred, classified representation is carried out according to the semantics of the code, including character string operation and sensitive functions; and

and performing escape on each line in the escape process, and combining the lines according to a specific grammar after the escape is finished to be used as the intermediate representation of the target code.

Preferably, the different code semantics are classified as:

sensitive functions, clean functions, variables, comparators, character strings, character string operations, white lists, type verification, conditional statements, and loop statements;

the sensitive function is used for triggering SQL injection, XSS and command injection; the cleaning function is a function for formal verification of user input; the variables comprise polluted variables and non-polluted variables, the type verification comprises number verification, number conversion, character string verification and character string conversion, and the character string operation comprises connection, addition and replacement of character strings; the result of each classification is represented by a specific keyword.

Preferably, each statement of the source code is classified and escaped, and each word in the source code is not classified to obtain an intermediate language representation of the vulnerability; when the source code is transferred to be expressed by an intermediate language, a sentence is used as a minimum transfer object, and classification transfer is carried out according to the semantic meaning of each sentence in the source code; the escaping process is context dependent, targeted escaping is performed according to the propagation path of the pollution source, parts which are not related to the vulnerability are not escaped, and escaping of one code statement is a combination of a plurality of keywords.

Preferably, the target program is preprocessed before escaping, and the codes containing the bugs are extracted; the preprocessing is carried out, a part with a bug in the code is analyzed and extracted, and information of the part is marked to be used as a code to be transferred; the marked information comprises a vulnerability type, a file where the vulnerability is located and a code line where the vulnerability is located.

Preferably, the combination is performed according to a specific syntax when the respective keywords are combined; determining a combination mode of the keywords according to the positions of the keywords in the source codes and the semantics, and performing specific modification on the keywords according to the semantics; the content of the modification includes the length of the character string, and the position of the variable in the character string.

Preferably, the combination of keywords is a combination comprising a combination of one code statement and the entire source code; the escape result of a statement may contain a plurality of keywords, and each keyword is combined according to a specific rule; after the escape of all the sentences in the source code is finished, the combination is carried out according to the context of the sentences and the semantic features of the sentences, and the combination is used for expressing the bugs in the code.

Preferably, the position of the keyword in the source code determines the escape combination mode; the keywords after the code escape of each line are stored according to the execution sequence of the source code, the keywords are rearranged and combined according to the calling relation during combination, and the specific information of the vulnerability, including the file where the vulnerability is located, the line where the sensitive function is located and the line where the pollution source is located, is added after combination.

Preferably, the intermediate representation can directly represent the WEB injection vulnerability and is irrelevant to the language used by the source code; different types of vulnerabilities can be uniformly represented in the intermediate language for projects using different programming languages, rather than for a particular vulnerability in the case of a particular programming language.

According to another aspect of the invention, an intermediate language representation system for WEB injection vulnerability is provided, wherein the system is used for representing classification languages of various variables and functions; the system comprises:

the representing device is used for carrying out classification representation according to the semantics of the code when the source code is subjected to escape, and comprises character string operation and sensitive functions; and

and the processing device is used for performing escape on each line in the escape process, and combining the lines according to a specific grammar after the escape is finished to be used as the intermediate representation of the target code.

According to another aspect of the present invention, there is provided a method for detecting a WEB injection vulnerability of a PHP based on an intermediate language representation, including the following steps:

s1: acquiring a source code, analyzing the source code, and segmenting the source code into different code segments, wherein each code segment is regarded as an independent code segment;

s2: conducting escaping on the code pieces, conducting escaping according to the context and the semantics of the codes, and combining the escaped keywords according to specific grammar to obtain intermediate language representation of the code pieces;

s3: and vectorizing and expressing the escape result of each code sheet, and inputting the escape result into the Bi-LSTM to be used as training data for model training to obtain a vulnerability detection model.

Preferably, when the source code is segmented into code pieces, the part suspected to have the vulnerability is reserved, and the safe part is completely discarded; in the slicing process, the danger function is taken as a program inlet, the initial definition position of the function variable is analyzed and judged reversely based on the function, and a code between the definition position and the danger function is intercepted and used as a code chip.

Preferably, the code defining the position intermediate to the hazard function is derived from the data and control flows, and not from the code line positions of the two statements.

Preferably, in the process of source code escaping, the relationship between variables is analyzed, the pollution path of the variables is recorded, the escaping is carried out on the polluted sentences, the escaping result of each code sentence is stored according to the execution sequence of the code, and after all code pieces are escaped, the code pieces are combined according to a specific grammar to be used as the intermediate language expression of the code pieces.

Preferably, the escape result of each segment of code chip is not independent and is available for other code chips; there may be a function repeat call in the source code, and if the function has been escaped, the first escape result is directly inserted into the escape result at the next escape with a specific rule.

Preferably, the escape result of each slice is vectorized, the vectorized result is aligned as the input to the Bi-LSTM, and a focus mechanism is added to the Bi-LSTM, classifying the input using the softmax activation function.

According to another aspect of the present invention, there is provided a system for detecting a WEB injection vulnerability for PHP based on an intermediate language representation, the system comprising:

the acquisition device is used for acquiring a source code, analyzing the source code and dividing the source code into different code pieces, wherein each code piece is regarded as an independent code;

the escape device is used for escaping the code chip, escaping according to the context and the semantics of the code and combining the escaped keywords according to specific grammar to obtain the intermediate language expression of the code chip;

and the representing device is used for vectorizing and representing the escape result of each code sheet, inputting the escape result into the Bi-LSTM as training data to perform model training to obtain the vulnerability detection model.

According to a further aspect of the present invention, there is provided a computer-readable storage medium storing a computer program for executing the method according to any one of the above embodiments.

According to still another aspect of the present invention, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method according to any of the above embodiments.

According to a further aspect of the embodiments of the present invention, there is provided a computer program product including computer readable code, when the computer readable code runs on a device, a processor in the device executes a method for implementing any one of the embodiments. The intermediate language representation for WEB injection vulnerability provided by the invention specifically comprises the following points:

1. the intermediate language representation for the WEB injection vulnerability is a classification language for representing various variables and functions. When the source code is subjected to escaping, classified representation is carried out according to the semantics of the code, including character string operation and sensitive functions. And in the process of escaping, escaping is carried out on each line, and after escaping is finished, combination is carried out according to specific grammar.

2. The intermediate language representation aiming at WEB injection vulnerability can be classified into the following classes according to code semantics: sensitive functions (functions that cause bugs such as SQL injection, XSS, and command injection), clean functions (functions that formally validate user input), variables, comparators, strings, string operations, whitelists, type validation, conditional statements, and loop statements. Wherein the variables comprise contaminated variables and uncontaminated variables, the type verification comprises number verification, number conversion, character string verification and character string conversion, and the character string operation comprises connection, addition and replacement of character strings. The results of each classification are represented by a specific keyword.

3. After the source code is escaped to the intermediate language representation, it is combined according to a specific grammar. The combination mode of the keywords is determined according to the positions of the keywords in the source code and the semantics, and the keywords are specifically modified according to the semantics, wherein the modified contents comprise the length of the character string, the positions of variables in the character string and the like.

The implementation technical scheme provided by the invention is specifically a WEB injection vulnerability detection method aiming at PHP based on intermediate representation, and the method comprises the following steps:

s1: acquiring a source code, analyzing the source code, and segmenting the source code into different code pieces, wherein each code piece is regarded as an independent code;

s2: conducting escape on the code chip, conducting escape according to the context and the semantics of the code, and combining the escaped keywords according to specific grammar to obtain the intermediate language expression of the code chip;

The program slice is a program decomposition technology, mainly by searching relevant characteristics inside a program, decomposing the program, and then analyzing and researching the program slice obtained by decomposition. In vulnerability detection, the program slicing technique can strip out code segments containing vulnerabilities for vulnerability detection. When program slicing is carried out on a vulnerability, a danger function is taken as a program inlet, reverse analysis is carried out on the basis of the function to judge the initial definition position of a function variable, and codes between the definition position and the danger function are intercepted and used as code chips, wherein the codes are codes for processing the inlet point or codes depending on the inlet point and the danger function. Slicing is an intra-and inter-procedural analysis because it tracks entry points and their dependencies along the source code, traversing different files and functions. The analysis is also context sensitive because it takes into account the results of the function call.

The escape of the code sheet is analyzed according to the statement to be analyzed and the context thereof, and whether the variable in the statement to be analyzed exists in the pollution variable list is firstly found out, and if the variable does not exist, the escape is not carried out. If the operation exists, the analysis is carried out according to the specific operation, such as assignment statements, branch statements and other statements. And performing different escaping according to different operations, and updating the pollution variable list if new pollution variables exist after escaping. The result is recorded after the escape of the slice, and if the slice is a class, the escaped intermediate language is stored in a special location.

The escape intermediate language representation is stored in a character string form, and the feature representation model in the format cannot be directly recognized, so that the feature representation model cannot be directly used as an input variable. The invention uses word2vector to carry out vectorization processing on the intermediate language representation in the form of character string, and obtains word vector binData which can be used by the model. Since the length of the training data needs to be uniform, the binData is subjected to 0-complementing or truncation processing. If the length of the binData is smaller than a specified threshold value w (w =200 herein), performing 0 complementing operation after the binData; if the binData length is greater than the threshold value w, a truncation operation is performed from the back. And finally, storing the processed vector in a training data set, and training the model by using the data set to obtain a vulnerability detection model.

The WEB injection vulnerability detection method aiming at the PHP based on the intermediate representation provided by the invention utilizes the intermediate language representation to accurately describe vulnerability information in the source code. The intermediate language representation can extract semantic information related to the vulnerability, and the extracted information does not contain noise data irrelevant to the vulnerability information and can more accurately describe information which has a direct relation with the vulnerability. The WEB injection vulnerability detection method aiming at the PHP based on the intermediate representation is used for replacing the traditional static detection method, so that the vulnerability detection accuracy can be improved, and the false alarm rate in the detection can be reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments or prior art solutions of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a grammar rule for an intermediate language representation provided by a disclosed embodiment of the invention;

FIG. 2 is an exemplary diagram of a program slicing process provided by the disclosed embodiments of the invention;

FIG. 3 is an exemplary diagram of a process for escaping a code chip provided by a disclosed embodiment of the invention;

FIG. 4 is a process for vulnerability code analysis based on intermediate language representation provided by the disclosed embodiments of the present invention;

FIG. 5 is a flowchart of an intermediate language representation method for WEB injection vulnerabilities in accordance with an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of an intermediate language representation system for WEB injection vulnerabilities according to an embodiment of the present disclosure;

FIG. 7 is a flowchart of a method for detecting a WEB injection vulnerability of a PHP based on intermediate language representation according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a system for detecting a WEB injection vulnerability of a PHP based on an intermediate language representation according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of systems consistent with certain aspects of the invention, as detailed in the appended claims.

To design keywords for intermediate languages, the present invention studies which code elements can operate the entry points and associate vulnerabilities or prevent vulnerabilities (e.g., a function that performs cleaning or replaces characters in a string of characters). In addition, many code segments (with or without bugs) have been examined to define these code elements. The code elements that represent the PHP functions are also studied to see which of their parameters are relevant to vulnerability detection. And some code elements are represented by multiple tokens, for example, a string that connects an unclosed string, such as "cat" - "$ pointed," would be represented as conc _ char0.

Table 1 shows all keywords proposed in the present application, and the first column illustrates keywords in an intermediate language of the present application, which are respectively used to represent various vulnerability-related information such as pollution type, cleaning function, string operation, and the like. The second column illustrates the role of the respective keywords and the third column gives the respective examples. In addition to the examples in the list, there are also some other keywords that fit into it, but these keywords do not exist alone, but as a supplementary explanation of the keywords in the table. As the connection operation "find/size '", $ pointed "" for a string will be specified according to the location of the contamination source and the form of the preceding and following strings, such an example operation shall be denoted as conc _ m _ char5, where conc denotes the connection operation, and'm ' denotes the location of the contamination source in the middle of the string connection, char5 denotes that the length of the string closing the contamination source is greater than 0 and less than 5.

TABLE 1

Fig. 1 illustrates the syntax rules of the intermediate language proposed in the present application, which can convert a code slice into the intermediate language proposed in the present application. The code slice can be converted into an intermediate language according to the syntax, and the conversion process is as shown in fig. 3. The left side is the code after a slice and the right side is the escape result for each line. The first line will be identified as the source of pollution and will be escaped as directed _ var, the second line is the conditional part of the branch statement, which will be escaped as the intermediate language first in cond, this conditional statement is the format of the validation input with the filter _ var function, which is escaped as format _ v, so the second sentence is escaped as condformat _ v. The third sentence is an assignment sentence when the condition is established, essentially taking an alias for the dirty variable, and no escape is needed, since all dirty variables are identified as tasked _ var in the intermediate language. The fourth sentence and the fifth sentence can be transferred together, and are assignment statements when the condition is not satisfied, and the assignment statements reassign the pollution variable by using a constant and can be transferred to assign, so that the four five sentences are transferred to else assign. The sixth sentence is a string concatenation operation and the string length of the closed dirty variable $ tainted is 1, so the escape is addstr _ char5, if the closed string length is 0, the escape is addstr _ char0. The last sentence is a sensitive function and is a command injection vulnerability, which is escape as sink _ ci. The assembly of the intermediate language representation as a code slice after all statements have been escape is done, and the code slice as such is escape as "sink _ ci adddstr _ char5 cond format _ v else assign tagged _ var". After the escape, the row and the file where the pollution source and the sensitive function are located are added to the end as a comment, such as "sink _ ci adddstr _ char5 cond format _ v else associated tagged _ var// tagged. Php 5 and tagged. Php 15".

In summary, the intermediate language representation for the WEB injection vulnerability provided by the present invention specifically includes the following points:

1. the intermediate language representation for the WEB injection vulnerability is a classification language for representing various variables and functions. When the source code is transferred, classified expression is carried out according to the semantics of the code, including character string operation, sensitive functions and the like. And in the process of escaping, escaping is carried out on each line, and after escaping is finished, combination is carried out according to specific grammar.

2. The intermediate language representation aiming at WEB injection vulnerability can be classified into the following classes according to code semantics: sensitive functions (functions that cause bugs such as SQL injection, XSS, and command injection), clean functions (functions that formally validate user input), variables, comparators, strings, string operations, whitelists, type validation, conditional statements, and loop statements. Several of which contain various subclasses.

In addition, since the conventional static detection method has a large number of false alarms, it is a development trend nowadays to introduce machine learning into vulnerability detection. If vulnerability detection is carried out by machine learning, preprocessing is needed to be carried out on a source code, the position of a vulnerability is found, and vulnerability characteristics are extracted, but the existing characteristic extraction method lacks relevant information of the vulnerability, therefore, the embodiment provides a WEB injection vulnerability detection method aiming at PHP based on intermediate representation, which comprises the following steps:

s2: conducting escaping to the code slice, escaping according to the context and the semanteme of the code, and combining the escaped keywords according to the specific grammar to obtain the intermediate language expression of the code slice;

s3: and vectorizing and expressing the escape result of each code sheet, and inputting the escape result into the Bi-LSTM as training data for model training to obtain a vulnerability detection model.

The program slice is a program decomposition technology, mainly by searching relevant characteristics inside a program and decomposing the program, and then analyzing and researching the program slice obtained by decomposition. The program slicing process is as shown in fig. 2, the slicing operation firstly intercepts the code of the vulnerability portion, and the slicing operation intercepts the source code according to the positions of the pollution source and the sensitive function. The source of pollution for the source code illustrated in the figure is the user's input $ _ POST [ "Submit" ], and the sensitive function is mysql _ query, so the code in between is intercepted, and finally the comments and other characters are culled. Slicing the source code results in many code slices, each of which may have an offensive vulnerability. And then, performing escaping on each slice, wherein key information is further extracted by escaping, and whether a vulnerability exists is judged according to escaping results.

The escape of the code sheet can be analyzed according to the sentence to be analyzed and the context thereof, and can be decomposed into the following steps that firstly, whether the variable in the sentence to be analyzed exists in the pollution variable list needs to be found out, and if the variable does not exist, the escape is not carried out. If the operation exists, the analysis is carried out according to the specific operation, such as assignment statements, branch statements and other statements. And performing different escaping according to different operations, and updating the pollution variable list if new pollution variables exist after escaping is completed. The result is recorded after the slice is escaped, and if the slice is a class, the escaped intermediate language is stored in a special position. The specific escaping process is shown in fig. 3.

The escape intermediate language representation is stored in the form of a character string, and the feature representation in this format cannot be directly recognized by the model, so that it cannot be directly used as an input variable. The invention uses word2vector to carry out vectorization processing on the intermediate language representation in the character string form to obtain the word vector binData which can be used by the model. Since the length of the training data needs to be uniform, the binData is subjected to 0 complementing or truncation. If the length of the binData is smaller than a prescribed threshold value w (w =200 herein), a 0-complementing operation is performed after the binData; if the binData length is greater than the threshold value w, a truncation operation is performed from the back. And finally, storing the processed vector in a training data set, and using the data set to train the model to obtain the vulnerability detection model. The specific training process is shown in fig. 4.

In order to illustrate the effectiveness of the model in vulnerability detection, the application tests the model by using data from SARD (Software assessment Reference Dataset), selects CWE-78, CWE-79 and CWE-89 vulnerability types for testing, and manually checks all data sets, and modifies many errors in PHP testing cases. The present invention collates the classification results and lists the confusion matrix shown in table 2 and the evaluation index results shown in table 3. Tables 2 and 3 illustrate that vulnerability detection models using intermediate language representation as input can correctly classify different vulnerabilities, and that due to the nature of the intermediate language representation, the models will not classify a vulnerability into other types of vulnerabilities. In summary, the vulnerability detection method based on the intermediate language representation can effectively detect the vulnerability.

TABLE 2

TABLE 3

Fig. 5 is a flowchart of an intermediate language representation method for WEB injection vulnerabilities according to an embodiment of the present disclosure. The method 500 is used to represent a classification language for various variables and functions; the method 500 includes:

step 501, when source codes are subjected to escaping, classified representation is carried out according to the semantics of the codes, including character string operation and sensitive functions; and

step 502, performing escape on each line in the escape process, and combining the lines according to a specific grammar after the escape is finished to be used as the intermediate representation of the target code.

In one embodiment, different code semantics are classified as:

the sensitive function is used for triggering SQL injection, XSS and command injection; the cleaning function is a function for formal verification of user input; the variables comprise polluted variables and uncontaminated variables, the type verification comprises number verification, number conversion, character string verification and character string conversion, and the character string operation comprises connection, addition and replacement of character strings; the result of each classification is represented by a specific keyword.

In one embodiment, instead of classifying each word in the source code to obtain an intermediate language representation of a vulnerability, each sentence in the source code is classified and escape; when the source code is transferred to be expressed by an intermediate language, a sentence is used as a minimum transfer object, and classification transfer is carried out according to the semantic meaning of each sentence in the source code; the escaping process is context-dependent, the targeted escaping is carried out according to the propagation path of the pollution source, the part which is not related to the vulnerability is not escaped, and the escaping of one code statement is the combination of a plurality of keywords.

In one embodiment, the target program is preprocessed before escaping, and codes containing bugs are extracted; the preprocessing firstly analyzes and extracts a part with a bug in the code, and the part is marked with information to be used as a code to be transferred; the marked information comprises a vulnerability type, a file and a code line.

In one embodiment, the individual keywords are combined according to a particular grammar when combined; determining a combination mode of the keywords according to the positions of the keywords in the source codes and the semantics, and performing specific modification on the keywords according to the semantics; the content of the modification includes the length of the character string, and the position of the variable in the character string.

In one embodiment, the combination of keywords is a combination that includes a combination of a code statement and the entire source code; the escape result of a statement may contain a plurality of keywords, and each keyword is combined according to a specific rule; after the escape of all the sentences in the source code is finished, the sentences are combined according to the context and the semantic features of the sentences for representing the bugs in the code.

In one embodiment, the position of the keyword in the source code determines the escape combination mode; the keywords after the code escape of each line are stored according to the execution sequence of the source code, the keywords are rearranged and combined according to the calling relation during combination, and the specific information of the vulnerability, including the file where the vulnerability is located, the line where the sensitive function is located and the line where the pollution source is located, is added after combination.

In one embodiment, the intermediate representation may directly represent the WEB injection vulnerability, independent of the language used by the source code; different types of vulnerabilities can be uniformly represented in the intermediate language for projects using different programming languages, rather than for a particular vulnerability in the case of a particular programming language.

Fig. 6 is a schematic structural diagram of an intermediate language representation system for WEB injection vulnerabilities according to an embodiment of the present disclosure. The system 600 is used to represent a classification language for various variables and functions; the system 600 includes:

the representing device 601 is used for performing classified representation according to the semantics of the code when the source code is subjected to escaping, and comprises character string operation and sensitive functions; and

and a processing device 602, configured to perform escape for each line during the escape process, and combine the lines according to a specific syntax after the escape is completed as an intermediate representation of the target code.

Fig. 7 is a flowchart of a method for detecting a WEB injection vulnerability for PHP based on intermediate language representation according to an embodiment of the present disclosure. The detection method 700 includes the steps of:

step 701: acquiring a source code, analyzing the source code, and segmenting the source code into different code pieces, wherein each code piece is regarded as an independent code;

step 702: conducting escaping on the code pieces, conducting escaping according to the context and the semantics of the codes, and combining the escaped keywords according to specific grammar to obtain intermediate language representation of the code pieces;

step 703: and vectorizing and expressing the escape result of each code sheet, and inputting the escape result into the Bi-LSTM as training data for model training to obtain a vulnerability detection model.

In one embodiment, when the source code is segmented into code segments, the portions suspected of having bugs are retained, and the safe portions are discarded completely; in the slicing process, the danger function is taken as a program inlet, reverse analysis is carried out on the basis of the function to judge the initial definition position of the function variable, and a code between the definition position and the danger function is intercepted to be used as a code chip.

In one embodiment, the code defining the position intermediate to the hazard function is derived from the data and control flows, not the code line positions of the two statements.

In one embodiment, in the process of source code escaping, the relationship among variables is analyzed, the pollution path of the variables is recorded, the escaping is carried out on polluted sentences, the escaping result of each code sentence is stored according to the execution sequence of the code, and after all code pieces are escaped, the code pieces are combined according to a specific grammar to be used as the intermediate language expression of the code pieces.

In one embodiment, the escape result for each segment of code chips is not independent and is available for other code chips; there may be a function repeat call in the source code, and if the function has been escaped, the first escape result is directly inserted into the escape result at the next escape with a specific rule.

In one embodiment, the escape result for each slice is vectorized, the vectorized result is aligned as the input to the Bi-LSTM, and an attention mechanism is added to the Bi-LSTM, classifying the input using the softmax activation function.

Fig. 8 is a schematic structural diagram of a system for detecting a WEB injection vulnerability of a PHP based on intermediate language representation according to an embodiment of the present disclosure. The detection system 800 includes:

an acquiring device 801, configured to acquire a source code, analyze the source code, and divide the source code into different code pieces, where each code piece is regarded as an independent code;

an escape device 802, configured to escape the code segment, escape according to the context and semantics of the code, and combine the escaped keywords according to a specific grammar to obtain an intermediate language representation of the code segment;

and the representing device 803 is used for vectorizing and representing the escape result of each code chip, and inputting the escape result into the Bi-LSTM as training data to perform model training to obtain a vulnerability detection model.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

Claims

1. The intermediate language representation method for WEB injection vulnerability is characterized in that the method is used for representing classification languages of various variables and functions; the method comprises the following steps:

when source codes are subjected to escaping, classified representation is carried out according to the semantics of the codes, including character string operation and sensitive functions; and

and performing escape on each line in the escape process, and combining the lines according to a specific grammar after the escape is finished to be used as an intermediate representation of the target code.

2. Intermediate language representation method according to claim 1, characterized in that different code semantics are classified as:

3. The intermediate language representation method of claim 2, wherein classification and escaping are performed on each statement of the source code, instead of classifying each word in the source code to obtain the intermediate language representation of the vulnerability; when the source code is transferred to be expressed by an intermediate language, a sentence is used as a minimum transferred object, and classification and transfer are carried out according to the semantics of each sentence in the source code; the escaping process is context dependent, targeted escaping is performed according to the propagation path of the pollution source, parts which are not related to the vulnerability are not escaped, and escaping of one code statement is a combination of a plurality of keywords.

4. An intermediate language representation method as claimed in claim 3, wherein the target program is preprocessed before escaping to extract the code containing the bug; the preprocessing is carried out, a part with a bug in the code is analyzed and extracted, and information of the part is marked to be used as a code to be transferred; the marked information comprises a vulnerability type, a file where the vulnerability is located and a code line where the vulnerability is located.

5. An intermediate language representation method according to claim 1, wherein the respective keywords are combined in accordance with a specific grammar when combined; determining a combination mode of the keywords according to the positions of the keywords in the source codes and the semantics, and performing specific modification on the keywords according to the semantics; the content of the modification includes the length of the character string, and the position of the variable in the character string.

6. An intermediate language representation method according to claim 5, wherein the combination of keywords is a combination including a combination of one code sentence and the entire source code; the escape result of a statement may contain a plurality of keywords, and each keyword is combined according to a specific rule; after the escape of all the sentences in the source code is finished, the combination is carried out according to the context of the sentences and the semantic features of the sentences, and the combination is used for expressing the bugs in the code.

7. An intermediate language representation method as claimed in claim 6, wherein the position of the keyword in the source code determines the combination after its escape; the keywords after the code escape of each line are stored according to the execution sequence of the source code, the keywords are rearranged and combined according to the calling relation during combination, and the specific information of the vulnerability, including the file where the vulnerability is located, the line where the sensitive function is located and the line where the pollution source is located, is added after combination.

8. An intermediate language representation system aiming at WEB injection vulnerability, which is used for representing classification languages of various variables and functions; the system comprises:

9. A PHP-based WEB injection vulnerability detection method based on intermediate language representation is characterized by comprising the following steps:

s2: conducting escaping on the code chips, conducting escaping according to the context and the semantics of the codes, and combining the escaped keywords according to specific grammar to obtain the intermediate language expression of the code chips;

10. A system for detecting a WEB injection vulnerability for PHP based on an intermediate language representation, the system comprising:

the acquisition device is used for acquiring a source code, analyzing the source code and segmenting the source code into different code pieces, wherein each code piece is regarded as an independent code;

the escape device is used for escaping the meaning of the code chip, escaping the meaning according to the context and the semantics of the code, and combining the escaped keywords according to the specific grammar to obtain the intermediate language expression of the code chip;

and the representing device is used for vectorizing and representing the escape result of each code sheet, inputting the escape result into the Bi-LSTM as training data for model training, and obtaining the vulnerability detection model.