CN117454378A

CN117454378A - Static detection method and device for storage type XSS loopholes in modern Web application

Info

Publication number: CN117454378A
Application number: CN202311241254.8A
Authority: CN
Inventors: 苏赫; 李丰; 许丽丽; 晁会娜; 霍玮
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2023-07-17
Filing date: 2023-09-25
Publication date: 2024-01-26

Abstract

The invention discloses a static detection method and device for storage type XSS loopholes in modern Web application, wherein the method comprises the following steps: acquiring a source code of a target program; based on the source code, obtaining database operation triplet information, wherein the database operation triplet information comprises: the operation action of the database, the table name of the operation action and the field name of the operation action; and performing two-section stain analysis from a stain source to a database write operation and from a database read operation to a stain convergence point based on the database operation triplet information so as to obtain a static detection result of the storage type XSS vulnerability. The invention can help the static spot analysis tool to identify the read-write position of spot data in the database, and find the position of the storage type XSS loophole in the system after the spot read path and the spot write path are spliced.

Description

Static detection method and device for storage type XSS loopholes in modern Web application

Technical Field

The invention relates to the field of software security and vulnerability discovery, in particular to a static detection method and device for storage type XSS vulnerabilities in modern Web applications.

Background

Among the injection vulnerabilities of the Web application, if an attacker sends an attack load to a target system, the vulnerability is not directly triggered, but is stored in a database and triggered later, a secondary injection vulnerability (Second-order Vulnerabilities) is formed.

The secondary injection loopholes are stain loopholes, and the forming reason of the loopholes is that a user transmits attack loads into the system through interaction points of the system, and the attack loads finally reach a position where harm is generated after being transmitted in the system. The storage type XSS vulnerability is a typical secondary injection vulnerability in a Web application program, and the principle is that an attack load triggering XSS is stored in a database, and is read from the database and triggered in a later operation. Because the attack load can be stored in a persistent storage medium of the system permanently, the attack load has larger harm and can be applied to the scenes of user Cookie theft, keyboard monitoring, large-scale worm attack and the like.

In general, those skilled in the art refer to the system interaction point of the incoming attack load as a dirty point source, the location where the hazard is generated as a dirty point convergence point, and the path through which the attack load propagates in the system as a dirty propagation path. In the secondary injection loopholes, a complete stain propagation path is required to pass through the database, so that when the secondary injection loopholes are excavated, the read-write of the database is required to be identified, and the two stain propagation paths are connected, so that relevant information of read-write operation on the database operation in codes is required to be obtained when the loopholes are excavated.

Traditional static mining for secondary injection loopholes generally adopts a String Analysis (String Analysis) based method: the method comprises the steps of collecting character string content in a target Web application source code, statically calculating possible values of character string expressions for all character strings, and further analyzing complete SQL query sentences from the code, so that read-write position information of database operation is obtained. However, the above method is not suitable for location information acquisition for database read-write operations in modern Web applications. Because modern Web applications typically employ a development mode in which user traffic is separated from a data model in order to meet the decoupling requirements of the code, layering is performed in the entire system, mainly including a user traffic layer and a data access layer. Wherein the data access layer (Data Access Layer, DAL) is specifically responsible for model construction and operation of the database and provides an interface for the user's business layer to use. That is, in modern Web applications, the data access layer controls the read and write operations of the database, and the user service layer often calls the data access layer in a dynamic manner. The dynamic nature introduced by these dynamic calls makes it difficult for static program analysis techniques to accurately track the construction process of database query statements. In particular, the Web application needs to adapt to different databases, and the data access layer usually adopts a dynamic loading driving mode, so that complete database query sentences can be dynamically spliced at the running time. For the above reasons, conventional string parsing methods will fail in Web applications that introduce a data access layer.

Disclosure of Invention

Aiming at the problems, the invention provides a static detection method and device for storage type XSS vulnerabilities in modern Web applications, which are particularly suitable for mining storage type XSS (Stored-XSS) vulnerabilities in Web applications. The method can identify the potential reading and writing positions of the database in the source code, thereby helping a static stain analysis tool to identify the reading and writing positions of stain data in the database, and discover the positions of storage type XSS loopholes in the system after the stain reading path and the stain writing path are spliced.

A static detection method for storage type XSS loopholes in modern Web application comprises the following steps:

acquiring a source code of a target program;

based on the source code, obtaining database operation triplet information, wherein the database operation triplet information comprises: an operation action of a database, a table name of the operation action and a field name of the operation action;

and performing two-stage stain analysis from a stain source to a database write operation and from a database read operation to a stain convergence point based on the database operation triplet information so as to obtain a static detection result of the storage type XSS vulnerability.

Further, based on the source code, obtaining database operation triplet information includes:

extracting a complete SQL sentence based on a character string analysis technology aiming at the source code;

under the condition that a complete SQL sentence is obtained, carrying out semantic analysis on the complete SQL sentence to obtain database operation triplet information;

under the condition that a complete SQL sentence is not obtained, constructing a database model and an analyzable form corresponding to the target program, and obtaining database operation triplet information according to the database model and the analyzable form; wherein the analyzable form comprises: code attribute graphs or CodeQL.

Further, according to the database model and the analyzable form, obtaining database operation triplet information, including:

deducing an anchor point API based on the analyzable form, wherein the anchor point API is an API call point corresponding to a source code segment containing database operation triplet information;

and analyzing and comparing the anchor point API by combining the database model to obtain the database operation triplet information.

Further, deriving an anchor API based on the analyzable form includes:

forward data flow analysis is carried out from a stain source, backward data flow analysis is carried out from a stain converging point, and all the API calls appearing on the path are collected in combination with an analyzable form;

removing the unconditional API calls in all the API calls appearing on the path; wherein the unconditional API call includes: built-in functions, harmless treatment functions and character string treatment functions of PHP;

and carrying out call frequency statistics on the API call after the unconditional call is removed, and forming an anchor point API set according to the call frequency statistics result.

Further, the analysis and comparison of the anchor point API are performed in combination with the database model to obtain database operation triplet information, including:

finding a call point of each anchor point API in the source code to obtain a code set corresponding to the anchor point API;

PHP analysis and SQL analysis are carried out on the code set to form a token sequence corresponding to the code set;

and comparing the token sequence with a database model to obtain a position corresponding to the database operation triplet information.

Further, finding a call point of each anchor point API in the source code to obtain a code set corresponding to the anchor point API, including:

judging whether the anchor point API is a chained API or not;

under the condition that the anchor point API is a chained API, a code set corresponding to the anchor point API is obtained based on the same calling point as the initial identifier of the chain; wherein the start identifier of the chain comprises: class name or object name;

and under the condition that the anchor point API is a non-chained API, acquiring a function call point of the anchor point API, and carrying out backward data flow analysis based on parameters of the function call point to obtain a code set corresponding to the anchor point API.

Further, comparing the token sequence with a database model to obtain a position corresponding to the database operation triplet information, including:

comparing the token sequence with a database model according to set conditions; wherein the setting conditions include: neglecting case, prefix and suffix, token appearance sequence, setting similarity of information in each token database model, judging field attribute and judging character string type length;

and obtaining the position corresponding to the database operation triplet information according to the comparison result.

Further, performing two-stage stain analysis from a stain source to a database write operation and from a database read operation to a stain convergence point based on the database operation triplet information to obtain a static detection result of a storage type XSS vulnerability, including:

finding all the taint propagation paths from the taint point source to the database writing position, and obtaining a writing taint propagation path set by analyzing whether the taint source can propagate the taint to the contaminated database reading position;

finding all stain propagation paths from the database reading position to the stain converging point, and finding the position of stain reading in the database through analysis, and judging whether the position can reach a certain stain converging point or not to obtain a read stain propagation path set

And matching according to the database read-write positions of each read-taint propagation path and each write-taint propagation path, so that the static detection result of the storage type XSS vulnerability is obtained.

A static detection device for storage type XSS loopholes in modern Web application comprises:

the source code acquisition module is used for acquiring source codes of the target program;

the source code analysis module is used for acquiring database operation triplet information based on the source code, and the database operation triplet information comprises: an operation action of a database, a table name of the operation action and a field name of the operation action;

and the stain analysis module is used for carrying out two-section stain analysis from a stain source to a database write operation and from a database read operation to a stain convergence point based on the database operation triplet information so as to obtain a static detection result of the storage type XSS loophole.

A computer device, the computer device comprising: a processor and a memory storing computer program instructions; the processor executes the computer program instructions to implement the static detection method for storage type XSS vulnerabilities in modern Web applications described in any of the above.

Compared with the prior art, the method and the device have the advantages that the complete database query statement is not required to be obtained from the code, the entry positions of the data access layer are directly found from the PHP code, and the code fragments are collected at the positions. It was found through the study of the present invention that as long as three elements are found in these code fragments: the operation action (reading or writing) of the database, the table name and the field name of the specific operation can finish the splicing of the read taint propagation path and the write taint propagation path. The present invention therefore defines three elements as database operation triples (Database Operate Triple), and for those APIs that are able to discover database operation triples at call locations, the present invention refers to them as anchor API (Anchor Point API). On the other hand, since the code information collected in the anchor point API is usually not in accordance with the grammar rule of the SQL statement, the database operation triplet information is difficult to obtain through normal analysis. The invention uses a Fuzzy analysis technique (Fuzzy Parse) to process code fragments, the Fuzzy analysis technique firstly processes target codes as PHP codes, then processes the PHP codes as SQL codes, finally compares database models (database schema), extracts database operation triplet information, splices read stain propagation paths and write stain propagation paths, and finally discovers secondary injection loopholes.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flowchart illustrating a method of static detection of storage type XSS vulnerabilities in a modern Web application, according to an illustrative embodiment.

FIG. 2 is a diagram illustrating code fragment collection and parsing according to an example embodiment.

FIG. 3 is a diagram illustrating a database query operation triplet information alignment, according to an example embodiment.

FIG. 4 is a database write operation in a Catfish, shown according to an example embodiment.

FIG. 5 is a database read operation in Catfish, shown according to an example embodiment.

Detailed Description

Exemplary embodiments will be described in detail below with reference to the accompanying drawings.

Aiming at the problem that the traditional method cannot obtain a complete SQL sentence from a target Web application comprising a data access layer and then find a secondary injection type vulnerability, the invention provides a static detection method for a storage type XSS vulnerability in a modern Web application. FIG. 1 is a flow chart of the method of the present invention, and the specific technical implementation steps are as follows.

The first stage is a target program processing stage, which processes the target program, extracts abstract syntax tree (Abstract Syntax Tree, AST) information, control flow information, data flow information and the like from the code, and converts the abstract syntax tree (Abstract Syntax Tree, AST) information, control flow information, data flow information and the like into a form which can be analyzed later (the system is a code attribute map at present, and other basic analysis tools such as CodeQL can be used). In addition, the database model corresponding to the target program, namely all table names, field names and field attributes in the database, is obtained at this stage.

The second phase is the database operation triplet derivation phase. The database operation triples are key factors for linking a read path and a write path of the taint propagation, and include three key information of the database operation: target table name, target field name, and specific read and write operations (database operation triples). The system needs to operate the triplet information through the database in the stain analysis stage to combine the stain propagation read-write paths aiming at the same database position so as to find the storage type XSS. To be able to accurately identify database operation triples, the system will first determine if the complete SQL statement is directly available in the target source code. If the complete SQL sentence can be obtained, the target program is explained to use the database access mode of direct query, so that when the database operation triples are analyzed, accurate semantic analysis is adopted for the SQL sentence, namely, a relatively accurate character string analysis technology is selected to extract the SQL sentence, and related information of the database operation is analyzed. If a complete SQL statement is not available, the system may choose to use fuzzy parsing techniques to obtain the database operation triplet information. The derivation of the database operation triplet information by the technology is divided into two steps: anchor API derivation and ambiguity resolution. The details thereof are as follows:

step one is anchor API derivation. From observation, database operation triplet information is often contained in object source code. Thus, in order to obtain database operation triplet information, it is first necessary to collect source code fragments containing such information in the target. To achieve this goal, the system will first find the API call site locations on the object code where the source code information can be gathered, i.e., the anchor API set. The anchor API is often an API call interface provided by the data access layer to the user traffic layer. According to the implementation of the data access layer, the anchor APIs can be further divided into chained call APIs and non-chained call APIs. The data access layer of the chained call API is accessed in the form of an object, the whole object needs to be marked as an anchor point API, the non chained API directly uses a function to access, the function needs to be marked as the anchor point API, and the system can judge whether the chained call is performed according to the characteristics of the function. The detailed process is as follows:

(1) Because the call of the anchor point API often appears on the stain propagation path, the forward data flow analysis is needed to be performed from the stain source set and the stain aggregation point set of the target item respectively, namely, the backward data flow analysis is needed to be performed from the stain aggregation point, so as to collect all the appearing API calls on the path.

(2) Analyzing the function body of each API in the collected API set, and removing all the unconditional APIs from the function body: the PHP processing method comprises a PHP built-in function, a harmless processing function, a character string processing function and the like.

(3) And carrying out call frequency statistics on the rest APIs, and taking the APIs with the front frequency as potential anchor point APIs according to the set threshold value (threshold value defaulting is 3), so as to form a potential anchor point API set.

(4) And judging whether the anchor point API is a chained API or not.

And step two, a fuzzy analysis stage. After the anchor point API set is obtained, the system analyzes all call points of each anchor point API in the target system, so that source code fragment information required by analysis is obtained. The process is shown in fig. 2, and the detailed steps are as follows.

(1) The call point for each anchor API is found in the object code. If chained, find the same call point of the starting identifier (class name or object name) of the chained, otherwise find the function call point directly. And carrying out backward data flow analysis on the parameters so as to obtain a code set in one process.

(2) The fuzzy analysis is carried out on each code set, and as the invention needs to process two target languages, namely, a language written at the back end of Web application (the back end language is PHP) and an SQL language used for database query, the invention needs to analyze the two target languages respectively.

First, PHP code analysis is carried out: the purpose of this stage is to analyze and process syntax semantic information about PHP language correlations from the code set. The PHP string connector (such as point number, plus sign, etc.) in the identification code performs string connection, identifies the escape character and reverts to the escape character, removes the built-in function or converts the built-in function into specific values (such as time (), date (), max (), etc.).

And (3) carrying out SQL analysis, namely identifying wild cards and carrying out SQL built-in function analysis. In the example of fig. 2, PHP analysis is performed on the collected code (a), the string connector is connected by using the point number as the string connector, the single quotation number is converted and the built-in function time () is removed to obtain the code set (b), SQL analysis is performed on the code set (b), the wild card model is identified, the wild card model is converted into the column names of all columns in the table, and then the built-in function COUNT () is analyzed and then converted into Integer (INT).

(3) Finally, the code set after PHP analysis and SQL analysis is formed into a token sequence, and the database model is compared (figure 3), so that the position of the database operation triplet is found. When comparing, the system ignores the case, prefix and suffix, token appearance sequence, and sets the similarity of the information in each token database model (editing distance is 1). And meanwhile, judging the field attribute, and setting a storage position which is possible to be used as an attack load only for the field with the character string type and the length not smaller than 8 by the system. The database operating locations that do not meet the condition will be automatically ignored. At this stage, the system will eventually output all database operation triples found in the target program.

And the third stage is spot analysis, wherein the system performs two-stage spot analysis from a spot source to a database write operation (writing spot propagation path) and from a database read operation to a spot convergence point (reading spot propagation path) based on the collected database operation triplet information, and finally discovers loopholes. The detailed steps are as follows:

(1) All the dirty propagation paths from the dirty point source to the database write location are found. Its inputs are the dirty point source and the database write set. All possible sets of write smear propagation paths are output by analyzing whether the smear source can propagate the smear to a contaminated database read location.

(2) All of the taint propagation paths from the database read locations to the taint point of convergence are found. The input parameters are dirty point convergence and database reading position information. The positions of the stain readouts in the database are found through analysis, whether the positions can reach a certain stain convergence point or not is detected through analysis, and finally all possible read stain propagation path sets are output.

(3) After two rounds of taint analysis, the invention matches the read-write positions of the database of each read taint propagation path and each write taint propagation path, thereby finding the complete secondary injection taint propagation path in the target code.

In summary, the present invention directly discovers entry locations of the data access layer from the PHP code without obtaining a complete database query statement from the code, and collects code fragments at these locations. It was found through the study of the present invention that as long as three elements are found in these code fragments: the operation action (reading or writing) of the database, the table name and the field name of the specific operation can finish the splicing of the read taint propagation path and the write taint propagation path. The present invention therefore defines three elements as database operation triples (Database Operate Triple), and for those APIs that are able to discover database operation triples at call locations, the present invention refers to them as anchor API (Anchor Point API). On the other hand, since the code information collected in the anchor point API is usually not in accordance with the grammar rule of the SQL statement, the database operation triplet information is difficult to obtain through normal analysis. The invention uses a Fuzzy analysis technique (Fuzzy Parse) to process code fragments, the Fuzzy analysis technique firstly processes target codes as PHP codes, then processes the PHP codes as SQL codes, finally compares database models (database schema), extracts database operation triplet information, splices read stain propagation paths and write stain propagation paths, and finally discovers secondary injection loopholes.

Taking a real PHP item Catfish 5.4.0 as an example, after a code attribute graph and a database index are formed after program processing, the system performs forward slicing from a stain source in the code, performs backward slicing from a stain converging point, gathers all APIs for frequency statistics, analyzes potential anchor APIs, and finally obtains potential anchor APIs comprising Db:: name- > select, db:: name- > insert, and Request:: post. Further analysis finds that the calling style of the first two anchor APIs is chained operation and the entry class of data access is Db, so that Db class and Request:: post () are taken as anchor APIs. Next, the system gathers code sets from the anchor API, such as at the call site for forum_db- > query_build, resulting in the following code fragments:

the system performs backward slicing from the call site, converts the obtained code into a list form and stores the list form as a code set. The target code set is then fuzzy parsed: the obtained code fragments are processed, firstly, through PHP processing, the array symbols "[", "]" and "= >", the character string connectors "", and the character strings are identified and combined. MySQL processing is performed, identifying the SQL key "insert", "select", converting wild card "×into all_fields (i.e. all fields). Comparing the database model, in the code of fig. 4, the system finds that the database operation triples are respectively:

1. table name: terminal (terminal)

2. Field names pid, jname, jvalue, time

3. The operation is as follows: write operation

And in the code of FIG. 5, the system would find the database operation triplet information here as:

1. table name: terminal (terminal)

2. Field name all_fields

3. The operation is as follows: read operation

Thus, identification of the database operation triplet information is completed. This information is stored and read for the spot analysis stage. In the stain analysis stage, the system will operate on triplet information in combination with the stain source, the stain collection point, the harmless treatment function, and the database provided in the previous stage. The controllable write smudge propagation path, and the read smudge propagation path are analyzed. And finally, integrating the results, combining the taint write path and the read path with the same table names and field names with the collected database operation triples as the basis, splicing the write taint propagation path and the read taint propagation path, and finally finding out the secondary injection holes.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. The specification and embodiments are to be regarded as exemplary only, and the disclosure is not limited to the exact construction illustrated and described above, and various modifications and changes may be made without departing from the scope thereof.

Claims

1. A static detection method for storage type XSS loopholes in modern Web application is characterized by comprising the following steps: acquiring a source code of a target program;

2. The method of claim 1, wherein obtaining database operation triplet information based on the source code comprises:

3. The method of claim 2, wherein deriving database operation triplet information from the database model and the analyzable form comprises:

4. The method of claim 3, wherein deriving an anchor API based on the analyzable form comprises:

5. The method of claim 3, wherein analyzing and comparing anchor APIs in conjunction with the database model to obtain database operation triplet information comprises:

6. The method of claim 5, wherein finding a call point for each anchor API in the source code to obtain a set of codes corresponding to the anchor API comprises:

judging whether the anchor point API is a chained API or not;

7. The method of claim 5, wherein comparing the token sequence to a database model to obtain a location corresponding to database operation triplet information comprises:

8. The method of claim 1, wherein performing a two-stage spot analysis of a spot source to database write operation and a database read operation to a spot sink based on the database operation triplet information to obtain a static detection result of a stored XSS vulnerability comprises:

9. A static detection device for storage type XSS vulnerabilities in modern Web applications, the device comprising: the source code acquisition module is used for acquiring source codes of the target program;

10. A computer device, the computer device comprising: a processor and a memory storing computer program instructions; the processor, when executing the computer program instructions, implements a static detection method for storage XSS vulnerabilities in a modern Web application as claimed in any of claims 1-8.