CN111966346B - Taint analysis method and device of application system - Google Patents

Taint analysis method and device of application system Download PDF

Info

Publication number
CN111966346B
CN111966346B CN202010938928.XA CN202010938928A CN111966346B CN 111966346 B CN111966346 B CN 111966346B CN 202010938928 A CN202010938928 A CN 202010938928A CN 111966346 B CN111966346 B CN 111966346B
Authority
CN
China
Prior art keywords
variable
call
graph
pruning
taint analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010938928.XA
Other languages
Chinese (zh)
Other versions
CN111966346A (en
Inventor
王杰
吴云广
周刚
于一鸣
郭振宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202010938928.XA priority Critical patent/CN111966346B/en
Publication of CN111966346A publication Critical patent/CN111966346A/en
Application granted granted Critical
Publication of CN111966346B publication Critical patent/CN111966346B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/34Graphical or visual programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation

Abstract

Embodiments of the present description provide methods and apparatus for taint analysis of an application system. In the method, a control flow graph is generated from a call relation graph that is built from application layer code in program code of an application system using a first call relation building algorithm. Further, performing taint analysis on program code of the application system using the control flow graph; and when the taint analysis result indicates that the calling statement does not have an edge relationship in the calling relationship graph, using a second calling relationship construction algorithm to expand the edge relationship for the calling statement in the calling relationship graph.

Description

Taint analysis method and device of application system
Technical Field
The embodiments of the present disclosure generally relate to the field of security, software engineering, software compilation or program analysis, and more particularly, to a taint analysis method and apparatus for an application system.
Background
In recent years, the industry has increasingly demanded static taint analysis techniques, particularly taint analysis tools with high scalability and accuracy. Taint analysis techniques can help the industry track data propagation links, thereby solving data problems in many complex scenarios, such as privacy disclosure, asset analysis, change management and control, data consistency, and the like.
In view of this, the taint analysis tools Flowdroid and Ptaint have been proposed. However, significant performance and accuracy issues are faced when applying the taint analysis tools Flowdroid and Ptaint to industrial application processes.
Disclosure of Invention
In view of the foregoing, embodiments of the present specification provide a taint analysis method and apparatus for an application system. By using the taint analysis method and the taint analysis device, the performance and the precision of taint analysis can be improved.
According to an aspect of embodiments herein, there is provided a method for taint analysis of an application system, comprising: generating a control flow graph from a call relation graph, the call relation graph being constructed from application layer code in program code of the application system using a first call relation construction algorithm; performing taint analysis on program code of an application system using the control flow graph; and when the taint analysis result indicates that the calling statement does not have an edge relationship in the calling relationship graph, using a second calling relationship construction algorithm to expand the edge relationship for the calling statement in the calling relationship graph.
Optionally, in one example of the above aspect, using a second call relation construction algorithm to expand an edge relation for the call statement in the call relation graph comprises: using a second call relationship construction algorithm to extend an edge relationship for the call statement in the call relationship graph and the control flow graph, the method further comprising: and generating a data flow graph according to the expanded control flow graph.
Optionally, in an example of the above aspect, generating a dataflow graph from the extended control flow graph includes: constructing a candidate data propagation path according to the expanded control flow graph; pruning the candidate data propagation paths; and generating the data flow graph based on the candidate data propagation path after pruning.
Optionally, in one example of the above aspect, the nodes of the dataflow graph include code fields or database fields, and the edge relationships represent data flow directions between the fields.
Optionally, in an example of the above aspect, the taint analysis includes taint analysis considering context information of call points, and pruning the candidate data propagation paths includes: pruning the candidate data propagation paths based on call site context information.
Optionally, in one example of the above aspect, the call site context information includes variable usage information that is accessible to variables across processes.
Optionally, in an example of the above aspect, pruning the candidate data propagation paths based on the call site context information includes: determining whether there is a variable usage information inconsistency in the candidate data propagation paths based on variable usage information for the cross-range accessible variables; and when determining that the variable use information in the candidate data propagation path is inconsistent, pruning the candidate data propagation path.
Optionally, in an example of the above aspect, the taint analysis includes taint analysis considering a relationship between a variable usage point and a variable definition point, and pruning the candidate data propagation paths includes: pruning the candidate data propagation paths based on a relationship between the variable usage points and the variable definition points.
Optionally, in an example of the above aspect, pruning the candidate data propagation paths based on a relationship between the variable usage point and the variable definition point includes: when a first variable in the program code is assigned as a heap variable, searching an alias variable of the first variable based on a relation between a variable using point and a variable defining point, wherein the first variable is a local variable used in a process; judging whether the alias variable is used in subsequent program codes; and pruning the candidate data propagation paths associated with the alias variable when the alias variable is not used in subsequent program code.
Optionally, in an example of the above aspect, pruning the candidate data propagation paths based on a relationship between the variable usage point and the variable definition point includes: when a first variable in a program code is assigned as a heap variable, mapping the first variable to a second variable in a calling context or a calling method, wherein the first variable is a local variable used in a process; searching for an alias variable of the second variable based on a relationship between a variable usage point and a variable definition point; judging whether the alias variable is used in subsequent program codes; and pruning the candidate data propagation paths associated with the alias variable when the alias variable is not used in subsequent program code.
Optionally, in an example of the above aspect, an accuracy of the first call relation construction algorithm is higher than an accuracy of the second call relation construction algorithm, and a performance of the first call relation construction algorithm is lower than a performance of the second call relation construction algorithm.
Optionally, in one example of the above aspect, the program code includes code-converted program code.
Optionally, in one example of the above aspect, the transcoding includes: performing packet supplementing processing, class replacement and/or method replacement on the program codes; inserting a statement near the method call; and/or inserting statements at the beginning or end of a method.
Optionally, in one example of the above aspect, the taint analysis comprises a slice-based parallel taint analysis.
Optionally, in an example of the above aspect, the slice includes a slice sliced based on a program entry point or a slice sliced based on a contamination start point.
Optionally, in one example of the above aspect, the method further comprises: a program entry point, a contamination start point, and a contamination end point of the program code are determined.
According to another aspect of embodiments herein, there is provided an apparatus for taint analysis of an application system, comprising: a control flow graph generating unit that generates a control flow graph from a call relation graph that is constructed from application layer code in program code of the application system by using a first call relation construction algorithm; a taint analysis unit that performs taint analysis on program codes of an application system using the control flow graph; and the edge relation expansion unit is used for expanding the edge relation for the calling statement in the calling relation graph by using a second calling relation construction algorithm when the taint analysis result indicates that the calling statement does not have the edge relation in the calling relation graph.
Optionally, in an example of the above aspect, when the taint analysis result indicates that an edge relationship does not exist in the calling relationship graph for a calling statement, the edge relationship extension unit uses a second calling relationship construction algorithm to extend an edge relationship for the calling statement in the calling relationship graph and the control flow graph, and the apparatus further includes: and the data flow graph generating unit generates a data flow graph according to the expanded control flow graph.
Optionally, in an example of the above aspect, the dataflow graph generating unit includes: the data propagation path construction module is used for constructing a candidate data propagation path according to the expanded control flow graph; the pruning processing module is used for carrying out pruning processing on the candidate data propagation paths; and the data flow graph generating module is used for generating the data flow graph based on the candidate data propagation path after pruning.
Optionally, in one example of the above aspect, the nodes of the dataflow graph include code fields or database fields, and the edge relationships represent data flow directions between the fields.
Optionally, in an example of the above aspect, the taint analysis includes taint analysis that considers call point context information, and the pruning processing module prunes the candidate data propagation paths based on the call point context information.
Optionally, in an example of the above aspect, the call site context information includes variable usage information that can access a variable across a process, and the pruning processing module: determining whether there is a variable usage information inconsistency in the candidate data propagation paths based on variable usage information for the cross-range accessible variables; and when determining that the variable use information in the candidate data propagation path is inconsistent, pruning the candidate data propagation path.
Optionally, in an example of the above aspect, the taint analysis includes taint analysis that considers relationships between variable usage points and variable definition points, and the pruning processing module prunes the candidate data propagation paths based on the relationships between the variable usage points and the variable definition points.
Optionally, in an example of the above aspect, the pruning processing module: when a first variable in the program code is assigned as a heap variable, searching an alias variable of the first variable based on a relation between a variable using point and a variable defining point, wherein the first variable is a local variable used in a process; judging whether the alias variable is used in subsequent program codes; and pruning the candidate data propagation paths associated with the alias variable when the alias variable is not used in subsequent program code.
Optionally, in an example of the above aspect, the pruning processing module: when a first variable in a program code is assigned as a heap variable, mapping the first variable to a second variable in a calling context or a calling method, wherein the first variable is a local variable used in a process; searching for an alias variable of the second variable based on a relationship between a variable usage point and a variable definition point; judging whether the alias variable is used in subsequent program codes; and pruning the candidate data propagation paths associated with the alias variable when the alias variable is not used in subsequent program code.
Optionally, in an example of the above aspect, the apparatus further comprises: a transcoding unit that transcodes an application framework of the application system, the taint analysis unit performing taint analysis on the transcoded program code using the control flow graph.
Optionally, in an example of the above aspect, the apparatus further comprises: an element determination unit that determines a program entry point, a contamination start point, and a contamination end point of the program code.
Optionally, in an example of the above aspect, the apparatus further comprises: and the section processing unit is used for carrying out section processing on the analysis task of the taint analysis, wherein the taint analysis comprises section-based parallel taint analysis.
According to another aspect of embodiments of the present specification, there is provided an electronic apparatus including: at least one processor, and a memory coupled with the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform a taint analysis method as described above.
According to another aspect of embodiments of the present description, there is provided a machine-readable storage medium storing executable instructions that, when executed, cause the machine to perform a taint analysis method as described above.
Drawings
A further understanding of the nature and advantages of the present disclosure may be realized by reference to the following drawings. In the drawings, similar components or features may have the same reference numerals.
Fig. 1 shows an example schematic diagram of a privacy data disclosure process.
FIG. 2 shows an example schematic of a taint analysis process for a Flowdroid-based application system.
FIG. 3 illustrates an example flow diagram of a method for taint analysis of procedure calls of an application system in accordance with embodiments of the present description.
Fig. 4A and 4B illustrate example schematic diagrams of program code of a transcoded application according to an embodiment of the present specification.
FIG. 5 illustrates an example schematic of a process for performing taint analysis on program code of an application according to an embodiment of the present description.
FIG. 6 illustrates an example flow diagram of a process of generating a data flow graph according to embodiments of the present specification.
FIG. 7 illustrates an example diagram of a pruning process based on call site context information in accordance with an embodiment of the present specification.
Fig. 8 is a flowchart showing one example of a pruning processing procedure based on the relationship between the variable use point and the variable definition point according to an embodiment of the present specification.
FIG. 9 illustrates an example of code analyzed within a process according to an embodiment of the present specification.
Fig. 10 is a flowchart showing another example of a pruning processing procedure based on the relationship between the variable use point and the variable definition point according to an embodiment of the present specification.
FIG. 11 shows an example of code for inter-process analysis according to an embodiment of the present description.
FIG. 12 illustrates an example flow diagram of an apparatus for taint analysis of an application system in accordance with embodiments of the present description.
Fig. 13 is a block diagram illustrating an implementation example of a dataflow graph generating unit according to an embodiment of the present specification.
Fig. 14 shows a block diagram of an implementation example of a pruning processing module according to an embodiment of the present description.
Fig. 15 shows a block diagram of another implementation example of the pruning processing module according to an embodiment of the present specification.
FIG. 16 shows a schematic diagram of an electronic device for implementing a taint analysis process for an application system, according to an embodiment of the present description.
Detailed Description
The subject matter described herein will now be discussed with reference to example embodiments. It should be understood that these embodiments are discussed only to enable those skilled in the art to better understand and thereby implement the subject matter described herein, and are not intended to limit the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as needed. For example, the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. In addition, features described with respect to some examples may also be combined in other examples.
As used herein, the term "include" and its variants mean open-ended terms in the sense of "including, but not limited to. The term "based on" means "based at least in part on". The terms "one embodiment" and "an embodiment" mean "at least one embodiment". The term "another embodiment" means "at least one other embodiment". The terms "first," "second," and the like may refer to different or the same object. Other definitions, whether explicit or implicit, may be included below. The definition of a term is consistent throughout the specification unless the context clearly dictates otherwise.
In industrial applications, there are a large number of inter-process calls (e.g., service layer method calls dao layer interface to obtain data in database) in a single application, and there are service calls (e.g., service calls made by rpc in rest), that is, data in a single application can be propagated to other applications by means of the service calls. Data security issues, such as privacy disclosure, asset damage, etc., can arise if the data is used illegally by the calling party application.
Fig. 1 shows an example schematic diagram of a privacy data disclosure process. As shown in fig. 1, it is assumed that the data column IDCard in the database possessed by application app _1 is labeled as private data. In response to a remote procedure call from application app _2, the private data of the data column IDCard is retrieved from the database and sent to application app _2 via several translation layers (POJO translation layers). Application app _2 further exposes the private data to other applications. Finally, app _ n gets this private data, stores it as the data column idiinfo in its own database and shows it to the user. In this case, if the user of application app _ n does not know that the data column IDinfo is derived from the private data IDCard, there is a security risk that private data is misused.
In view of the foregoing, there is a need for trace analysis of data propagation during procedure calls of an application system. Taint analysis techniques are widely used for data propagation tracking analysis. Taint analysis techniques refer to techniques in which analysis data is propagated through a program. Taint analysis is an important means for analyzing privacy disclosure and code bugs in the field of data security, and has very wide application in the fields of security and software engineering. The taint analysis process mainly comprises three aspects of pollution source marking, pollution propagation rule specification and taint propagation. The source of pollution refers to untrusted data, such as user sensitive data, untrusted external input. A pollution propagation rule is an inference rule that specifies how to spread polluted data according to the semantics of program instructions and functions. For example, if a is source, b is a, and sink is b, the sink data will be affected by the data pollution of the variable source. Taint analysis techniques include static taint analysis and dynamic taint analysis.
Taint analysis includes three elements: a contamination start Point (Source), a contamination end Point (Sink), and an analysis by program Entry (Entry Point). In the taint analysis process, a Call relation Graph (Call Graph) of calls between processes (functions) needs to be built according to a program analysis entry. Call Graph is used to present Call relationships between procedures (functions) in a computer program. Nodes in the Call Graph are composed of procedures (e.g., methods) in program code, and edges (which may also be referred to as "edge relationships") in the Call Graph are used to represent calling relationships between procedures (e.g., methods).
Examples of taint analysis techniques may include static taint analysis tools Flowdroid and Ptaint (Doop based). In the taint analysis process of the Flowdroid-based application system, the taint analysis object is the source code or intermediate representation (intermediate code or byte code under Java framework) of a program, so that the explicit flow static analysis in taint propagation can be converted into the analysis for the static data dependency in the program.
In the taint analysis, first, a Call Graph (Call Graph) is constructed from the function Call relationship between the programs for all the program codes of the application program, and the Call Graph may also be called a Call Graph. Specific taint analyses are then performed on the basis of different program characteristics, either between functions or within functions. Examples of explicit taint propagation may include, but are not limited to, direct assignment propagation, propagation through function (procedure) calls, propagation through aliases (pointers), and the like.
FIG. 2 shows an example schematic of a taint analysis process for a Flowdroid-based application system. The following description will be given taking the Java program shown in fig. 2 as an example.
In the example shown in FIG. 2, variable b in line 3 program code is the initial taint mark variable. The 4 th line of program codes directly assigns the calculation result of an arithmetic expression containing the variable b to the variable c, and the taint mark can be directly propagated from the variable at the right part of the assignment statement to the variable at the left part of the assignment statement due to the direct assignment relationship between the variable c and the variable b. This taint propagation approach is direct assignment propagation.
The variable c is then passed as an argument to the function foo in line 5 of the program code, whereby the dirty mark on c is propagated through the function call to the argument z of the function foo. The dirty mark of z is propagated through the direct assignment to x.f in line 8 of program code. Since the other two parameter objects x and y of foo are both references to object a, there is an alias between them. Therefore, the taint mark of x.f can be propagated to a taint propagation end point (taint convergence point, or taint end point) sink (y.f) of the program code of line 9 through an alias, thereby causing privacy data disclosure to occur.
The Flowdroid-based taint analysis scheme can provide high-precision taint analysis. However, the Flowdroid analysis requires that the entire application be preloaded and the Call Graph be constructed for the program code of the entire application. The code of the application includes application layer code and library file code, and the percentage of library files in an enterprise application is about 92%, but only a small portion of the code will propagate the pollution, and therefore, the taint analysis for library files is mostly unnecessary, resulting in a serious performance problem of the taint analysis based on Flowdroid.
In addition, native methods, libraries and frameworks (such as Spring frameworks) are used in large numbers in enterprise applications. native methods, missing libraries, and dynamically generated code of the framework are not visible to static analysis, and these hidden logics are called implicit dependencies. Under the condition of implicit dependency, a Call Graph construction scheme provided by the Flowdroid has missed edges and false edges, so that the problems of missed report and false report are serious, and the problem of correctness of the taint analysis based on the Flowdroid is serious.
Furthermore, in the presence of unbalanced edges, the accuracy of the taint analysis cannot be guaranteed. Moreover, the Flowdroid scheme initiates pointer analysis each time an object in the heap object is assigned during taint analysis, resulting in poor performance.
In view of the foregoing, embodiments of the present specification provide a taint analysis method for an application system. In the method, first, an initial Call graph is constructed using a first Call relation construction algorithm with respect to application layer code in program code of an application system. Next, a control flow graph is generated from the initial Call graph, and program code is tainted analyzed using the control flow graph. And when the calling statement is analyzed to have no edge relation in the initial Call graph, using a second calling relation construction algorithm to expand the edge relation for the calling statement in the initial Call graph, thereby obtaining a final Call graph. By using the taint analysis method, because only the application layer codes and part of necessary library file codes in the program codes are constructed into the Call Graph with a smaller scale, and the taint analysis is carried out on the Call Graph with the smaller scale, the workload of the taint analysis is greatly reduced, and the performance of the taint analysis is improved.
The term "taint analysis" refers in a narrow sense to taint analysis on data of interest. In this specification, the term "taint analysis" should be interpreted broadly as taint analysis with respect to all data involved in program code or all accessed data. Furthermore, in this specification, the term "contamination" may be used interchangeably with "data dissemination". In addition, in this specification, the term "application system" may also be understood as "application", "application program", or "system in which application programs are installed".
A method and apparatus for taint analysis for an application system according to embodiments of the present specification will be described in detail below with reference to the accompanying drawings.
FIG. 3 illustrates an example flow diagram of a method 300 for taint analysis of an application system in accordance with embodiments of the present description.
As shown in FIG. 3, at step 310, the application framework of the application system is transcoded. That is, at the application framework level, the program code of the application system is transcoded. In one example, the program code may be a program source code or an intermediate code obtained by compiling the program source code. For example, in the case where the program source code is Java code, the intermediate code may be bytecode. Transcoding may include, for example: performing packet supplementing processing, class replacement and/or method replacement on the program codes; inserting a statement near the method call; and/or inserting statements at the beginning or end of a method. By carrying out code conversion on the application framework of the application system, the functions of the application framework can be simulated, the pollution starting point and the pollution ending point can be simulated, and the callback function of the implicit method can be called. In one example, program code may be transcoded according to a configuration file. Fig. 4A and 4B illustrate example schematic diagrams of program code of a transcoded application according to an embodiment of the present specification. For example, line 19 code inserted in FIG. 4B is an example of inserting a statement at the beginning or end of a method. In fig. 4B, the type associated with the annotated variable in the xml configuration file is SampleServiceImpl, whereby the inserted line 19 code in fig. 4B initializes the variable to a SampleServiceImpl instance, line 19 code being inserted at the beginning of the default construction method of ServerFacadeImple from the mid-code level. In other words, line 19 code is inserted at the beginning of the method. Further, in fig. 4B, statements are inserted before and after the method call of line 4 for marking the method input as a dirty start source (line 3 code) and marking the return result as a dirty end sink (line 5 code).
After the program code is transcoded as described above, the various elements of the taint analysis, i.e., the program entry point, the taint start point, and the taint end point, are determined at 320. The determination of the entry point of the program is very simple. Typically, accessible from the outside
Figure BDA0002672944850000101
Methods and controller methods are identified as program entry points. Program entry points are used to build the Call Graph of inter-procedure (function) calls. The input data may be considered as a contamination start point (Source) and the output data may be considered as a contamination end point (Sink). Examples of input data may include: input parameters for program entry points, return values for remote procedure calls, fields retrievable by the database. Examples of output data may include: return values for program entry points, parameters for remote procedure calls, and fields that may be saved to a database. In one example of the present specification, the field includes a code field or a database field. For example, field "id" in "source.id" and "dto.id" in line 9 in fig. 4A.
However, no explicit statements or methods can be used to configure the contamination start point or the contamination end point. For example, the return value of getSamplesByName in line 22 of fig. 4B is dirty and sent out, and the name or ID field of the element in the list should be marked as dirty end. To do this, the method getSamplesByNameWrap calling the above method is simulated on lines 2-14. Before invocation, an injection statement marks the variable as a dirty end point (line 3), and after invocation, an injection statement simulates that dirty data is written to a dirty end point (lines 5 to 12).
After the program entry point, the taint start point, and the taint end point are obtained as described above, at 330, a Call Graph (i.e., initial Call Graph) is constructed for the application layer code using a first Call relation construction algorithm. When the first call relation construction algorithm is selected, only the accuracy of the algorithm is concerned, for example, an algorithm with high accuracy, such as a Spark algorithm, can be selected without paying attention to the performance of the algorithm.
It is noted that the operations of 310-330 may be performed in advance. After the initial Call Graph is obtained as above, a taint analysis process for the application system is performed. For example, a taint analysis process is invoked inter-procedurally for an application system.
At 340, a control flow Graph is generated according to the initial Call Graph. A control flow graph is an abstract representation of a process, typically used in compilers and static analysis, and represents all the paths that a program will traverse during its execution. In embodiments of the present description, the control flow graph may also include inter-process control flows, such as call flow (call flow), return flow (return flow), and the like. Nodes in a control flow graph may be composed of statements or basic blocks (basic blocks) in program code, with edges representing the flow of operational control between the nodes. Next, at 350, a taint analysis is performed on the program code of the application system using the control flow graph.
FIG. 5 illustrates an example schematic diagram of a process for performing taint analysis on program code of an application according to an embodiment of the present description.
In fig. 5, the diagram shown on the far left is an initial Call Graph constructed based on Main () and foo (). In this Call Graph, main (), foo (), source () and Sink () are nodes, and a connecting line between each node represents an edge. As shown in fig. 5, there is an edge relationship between main () and foo () and Sink (), and an edge relationship between foo () and source ().
The diagram shown in the middle is a control flow graph, also called inter-process control flow (ICFG). In this control flow graph example, X ═ new X (), x.f ═ source (), return X, b ═ foo (a), and sink (b.f) are nodes, b ═ foo (a) has an edge relationship with X ═ new X (), return X, and sink (b.f), and X ═ new X () has an edge relationship with x.f ═ source (), x.f ═ source () has an edge relationship with return X, and b ═ foo (a) has an edge relationship with sink (b.f).
The rightmost diagram is a dataflow diagram of procedure calls of an application system and may also be referred to as a dataflow diagram. In one example of this specification, nodes in a dataflow graph are fields, and edges are data flow directions between the fields, i.e., data propagation directions or data flow relationships. In one example, the fields may include a code field or a database field. In one example, the fields may include, for example, variable fields.
In performing taint analysis, ICFG is first constructed based on the initial Call Graph. Subsequently, smear propagation was calculated based on ICFG (smear analysis). When calculating the taint propagation condition, if a calling statement is encountered, whether the calling statement has an edge relation in the initial Call Graph is checked. If there is a corresponding edge relationship in the initial Call Graph, the calculation continues down.
If a Call statement is encountered and there is no corresponding edge relationship for the Call statement in the Call Graph, then at 360, a second Call relationship construction algorithm is used to extend the edge relationship for the Call statement in the Call Graph. The precision of the second call relation construction algorithm is lower than that of the first call relation construction algorithm, but the performance of the second call relation construction algorithm is superior to that of the first call relation construction algorithm. An example of the second call relation construction algorithm may include, for example, the CHA algorithm.
Optionally, in an example, if a Call statement is encountered and there is no corresponding edge relation in the Call Graph in the Call statement, while using the second Call relation construction algorithm to expand an edge relation for the Call statement in the Call Graph, the second Call relation construction algorithm may also be used to expand an edge relation for the Call statement in the control flow Graph, thereby obtaining an expanded control flow Graph.
Further optionally, in one example, after the extended control flow graph is obtained as above, at 370, a data flow graph is generated from the extended control flow graph, thereby obtaining a data propagation path for the application system. In an embodiment of the present specification, the data propagation path is a data propagation path from a contamination start point to a contamination end point, for example, as shown in fig. 5, the data propagation path x.f is source () - > sink (b.f).
FIG. 6 illustrates an example flow diagram of a generation process 600 of a data flow graph according to embodiments of the present specification.
As shown in FIG. 6, at 610, candidate data propagation paths are constructed from the expanded control flow graph. Next, at 620, pruning is performed on the candidate data propagation paths. The pruning process may be implemented using any suitable pruning processing algorithm known in the art. Then, at 630, a dataflow graph is generated based on the candidate data propagation paths that have undergone pruning processing.
It is noted that the example shown in fig. 6 includes a pruning process for the candidate data propagation paths, and in other embodiments of the present description, the pruning process for the candidate data propagation paths may not be included.
Further optionally, in one example, call site context information can be considered in performing the taint analysis. Accordingly, pruning the candidate data propagation paths may include pruning the candidate data propagation paths based on the call site context information. Taint analysis
In one example, the call site context information may include variable usage information that can access variables across processes. Accordingly, when pruning the candidate data propagation paths based on the call site context information, it is determined whether there is a variable usage information inconsistency in the candidate data propagation paths based on the variable usage information of the cross-process accessible variable. And when the fact that the variable use information is inconsistent in the candidate data propagation path is determined, the candidate data propagation path is considered to be an error data propagation path, and pruning processing is carried out on the candidate data propagation path. And when the fact that the variable use information is consistent in the candidate data propagation path is determined, the candidate data propagation path is considered to be the correct data propagation path, and the candidate data propagation path is reserved.
FIG. 7 illustrates an example diagram of a pruning process implemented based on call site context information.
FIG. 7 illustrates a common unbalanced path in an enterprise application. In fig. 7, solid boxes represent program code, solid arrows represent call relations, and the top two solid boxes represent entries of two paths. The path 1 is: entry1# querybyme () - > edition # exec (Cb) - > Cb1# query () - > Dao1# querybyme (name), pollution starting point is returned data (DAO layer) of Dao1# querybyme (name), and pollution ending point is its method call Entry (facade layer). The path 2 is: entry2# querybpid () - > executive # exec (Cb) - > Cb2# query () - > Dao #2 querybrid (r id), pollution starting point is return data (Dao layer) of Dao # querybrid (rid), and pollution ending point is method call Entry (facade layer) (Entry 2). In the Flowdroid based approach, the analysis starts from the contamination starting points, in this example the contamination starting points Dao1# queryByName and Dao2# querybrid (rid), where the previous call context is unknown, and therefore an empty context is typically injected and propagated backwards. When analyzed for Executor # exec (cb), its context is still empty and it is considered reasonable to have both call points Entry1 and Entry 2. However, for path 1, only Entry1 can be returned, and path 2 can only return Entry2, ultimately resulting in a data propagation path with 2 false positives.
In the implementation of fig. 7, it is assumed that variable res1 in Cb1# query () is contaminated. In Cb1# query (), the this variable is known to be of type Cb1 and contains a method named query, whereby this information is added to its context and returned to the caller method along with the return edge for reference during subsequent parsing. In subsequent analysis, pruning is performed once the context information is found to be inconsistent. Specifically, when the context information reaches the executive # exec (Cb), since the variable this in Cb1# query () corresponds to the variable Cb in executive # exec (Cb), the context information this: Cb1 is updated to Cb: Cb 1. Similarly, when Entry1# querybyme () and Entry2# querybrid () are reached, Cb2 in Entry2# querybrid () is found to be declared as a Cb1 type in context but is actually used as a Cb2 type, so that a contradiction in the variable use information occurs (i.e., the variable use information is inconsistent), thereby determining that the data propagation path from Cb1# query () to Entry2# querybrid () is an error data propagation path, thereby performing a pruning process on the error data propagation path.
Further optionally, in one example, the relationship between the variable usage point and the variable definition point may also be considered in performing the taint analysis. Correspondingly, the pruning processing of the candidate data propagation paths comprises: the candidate data propagation paths are pruned based on the relationship (use-def relationship) between the variable use point and the variable definition point.
Fig. 8 shows a flowchart of one example of a pruning process 800 based on the relationship between variable usage points and variable definition points, according to an embodiment of the present description. The example shown in fig. 8 is a process (in-process analysis) based on local variables used in the process.
As shown in FIG. 8, at 810, a determination is made as to whether a first variable (current variable) in the program code is assigned as a heap variable. Here, the first variable is a variable used within the process. If the first variable is not assigned as a heap variable, then a return is made to 810 to continue monitoring.
When a first variable in the program code is assigned as a heap variable, alias variables for the first variable are searched for based on relationships between variable points of use and variable definition points at 820.
After the alias variable of the first variable is searched, at 830, it is determined whether the alias variable is used in subsequent program code.
If the alias variable is not used in subsequent program code, the candidate data propagation path associated with the alias variable is pruned 840. In other words, as long as the candidate data propagation paths through the alias variable are pruned.
If the alias variable is used in subsequent program code, then at 850, the candidate data propagation path associated with the alias variable is reserved.
An example of code analyzed in a process according to an embodiment of the present description is shown in fig. 9. The pruning processing procedure shown in fig. 9 is explained below in conjunction with the code example shown in fig. 9.
As shown in fig. 9, in code example 1, source () represents a contamination start point, and sink () represents a contamination end point. Each time a propagated variable is assigned to a heap variable, it is searched for aliases. In this example, the variable whose third line code is propagated is assigned to the heap variable b.id, thus opening a search for the alias variable of b, then finding that a is the alias of b, and continuing to search for the data propagation flow (dirty flow) related to a, starting from line 1. In this example, the variable a is not used in the later code, so that analysis for the variable a is unnecessary, so that if there is an edge relation corresponding to the variable a in the Call Graph, pruning is performed.
Fig. 10 is a flowchart showing another example of a pruning processing procedure based on the relationship between the variable use point and the variable definition point according to an embodiment of the present specification. The example shown in fig. 10 is a process (inter-process analysis) based on local variables used between processes.
As shown in FIG. 10, at 1010, a determination is made as to whether a first variable (current variable) in the program code is assigned as a heap variable. Here, the first variable is a variable used between processes. If the first variable is not assigned as a heap variable, then monitoring continues back at 1010.
When a first variable in the program code is assigned as a heap variable, the first variable is mapped to a second variable in the calling context or calling method at 1020.
At 1030, an alias variable for the second variable is searched for based on the relationship between the variable usage point and the variable definition point.
At 1040, it is determined whether the alias variable is used in subsequent program code.
If the alias variable is not used in subsequent program code, the candidate data propagation path associated with the alias variable is pruned 1050.
If the alias variable is used in subsequent program code, the candidate data propagation path associated with the alias variable is reserved 1060.
An example of code for inter-process analysis according to an embodiment of the present description is shown in FIG. 11. The pruning processing procedure shown in fig. 10 is explained below in conjunction with the code example shown in fig. 11.
For inter-process analysis, in addition to the contamination that is propagated backwards in the forward direction through parameters, return values, and this related variables, there are often alias cases in the enterprise code that are in the reverse direction across processes. As shown in FIG. 10, in code example 2, the result variable is referenced and contaminated in an inner class, then returned to the outer class method and finally passed into the data propagation endpoint. Therefore, the first variable is firstly mapped to the second variable in the calling context or the calling method, namely the variable mapping between different processes or different classes, and then the intra-process analysis method is used for pruning processing based on the relation between the variable using point and the variable definition point.
A method for taint analysis for an application system according to embodiments of the present description is described above with reference to fig. 3 through 11.
In the taint analysis method, a small-scale Call Graph is constructed only for the application layer codes and part of necessary library file codes in the program codes, and taint analysis is performed on the small-scale Call Graph, so that the workload of taint analysis is greatly reduced, and the performance of taint analysis is improved. Therefore, the taint analysis scheme which is efficient and has high accuracy and recall rate can be provided for large-scale enterprise application, especially under the condition that implicit dependence caused by a large number of native methods, libraries and frameworks is used.
Furthermore, in the above taint analysis method, by using taint analysis considering context information of call points, error data propagation paths can be pruned from the data flow graph, thereby improving the accuracy of taint analysis. Furthermore, by adding context information only for variables that are accessible across processes, and calculating these context information only once for each method, the accuracy of taint analysis can be improved with minimal performance overhead introduced.
In addition, in the taint analysis method, alias searching and pruning are carried out on local variables through use-def analysis, complexity of alias analysis can be reduced, and performance faster than global search and on-demand search is achieved. In addition, by interpreting the usage relationships, the error data propagation paths in the dataflow graph can be pruned.
In addition, in the taint analysis method, the Call Graph is constructed by adopting a first calling relation construction algorithm with high precision for the application layer codes, and the overall accuracy of the constructed Call Graph can be improved. In addition, the second Call relation construction algorithm with relatively low precision and better performance is used for realizing the edge relation expansion aiming at the Call Graph and the control flow Graph, so that the missed edge rewrites can be efficiently realized, and the recall rate is further ensured.
Optionally, in an example, after obtaining the dataflow graph of the application system as above, the method may further include: obtaining a data propagation path list of the application system according to the data flow graph; and saving the data propagation path list of the application system in a relational database for use when the application system comprises a plurality of application systems to create a data propagation path list across the application systems with the data propagation path lists of other application systems. An example of a data propagation path list may be as shown in table 1.
Side data ID Origin of contamination Endpoint of contamination
0001 x.f=source() Sink(b.f)
…… …… ……
TABLE 1
Further optionally, in one example, the cross-application system data propagation path list is characterized as data propagation path graph data. For example, after the data propagation path list of the single application system is obtained as described above, the obtained data propagation path list of the single application system is stored in the relational database. Then, synchronizing to an offline data warehouse, and finally, synchronizing to a graph database, thereby obtaining data propagation path graph data across the application systems. The data propagation path diagram data of the cross-application system can be applied to application scenes such as data leakage, change management, data consistency check and the like.
Further optionally, in one example, the taint analysis may include a slice-based parallel taint analysis. In other words, the taint analysis task may be split into multiple slicing tasks, and the resulting slicing tasks distributed to different servers for parallel taint analysis. In one example, the slicing process may include a program entry point-based slicing process or a pollution start point (Source) -based slicing process. Accordingly, the slices may include slices sliced based on program entry points or slices sliced based on dirty starting points.
Further, it is noted that what is shown in fig. 3 is merely an example embodiment, and in other embodiments of this description, one, more, or all of operations 310, 320, 330, and 370 may not be included.
Fig. 12 illustrates an example flow diagram of an apparatus for taint analysis of an application (hereinafter "taint analysis apparatus") 1200 according to an embodiment of the present description. As shown in fig. 12, the taint analysis apparatus 1200 includes a code conversion unit 1210, an element determination unit 1220, a slicing processing unit 1230, a control flow graph generation unit 1240, a taint analysis unit 1250, an edge relation extension unit 1260, and a data flow graph generation unit 1270.
The transcoding unit 1210 is configured to transcode an application framework of the application system. The operation of the transcoding unit 1210 may refer to the operation of 310 described above with reference to fig. 3.
The element determination unit 1220 is configured to determine a program entry point, a contamination start point, and a contamination end point of the program code. Accordingly, taint analysis includes taint analysis based on program entry points, contamination start points, and contamination end points. The operation of the element determining unit 1220 may refer to the operation of 320 described above with reference to fig. 3.
The slicing processing unit 1230 is configured to perform slicing processing on an analysis task of the stain analysis. Accordingly, taint analysis includes slice-based parallel taint analysis.
The control flow graph generating unit 1240 is configured to generate a control flow graph from a call relation graph that is built from application layer code in program code of an application system by using a first call relation building algorithm. The operation of the control flow graph generation unit 1240 may refer to the operation of 340 described above with reference to fig. 3.
Taint analysis unit 1250 is configured to perform taint analysis on program code of the application system using the control flow graph. The operation of taint analysis unit 1250 may refer to the operation of 350 described above with reference to fig. 3.
The edge relationship extension unit 1260 is configured to use a second call relationship construction algorithm to extend an edge relationship for a call statement in the call relationship graph when the taint analysis result indicates that the call statement does not have an edge relationship in the first call relationship graph. The operation of the call relationship diagram extension unit 1260 may refer to the operation of 360 described above with reference to FIG. 4.
Further optionally, in one example, the edge relationship extension unit 1260 may be further configured to use a second call relationship construction algorithm to extend an edge relationship for the call statement in the control flow graph when the taint analysis result indicates that the call statement does not have an edge relationship in the first call relationship graph. The dataflow graph generation unit 1270 is configured to generate a dataflow graph from the expanded control flow graph.
Fig. 13 shows a block diagram of an implementation example of the dataflow graph generating unit 1300 according to an embodiment of the present specification. As shown in fig. 13, the dataflow graph generating unit 1300 includes a data propagation path building module 1310, a pruning processing module 1320, and a dataflow graph generating module 1330.
The data propagation path construction module 1310 is configured to construct candidate data propagation paths from the expanded control flow graph. The pruning processing module 1320 is configured to prune the candidate data propagation paths. The data flow graph generation module 1330 is configured to generate a data flow graph based on the candidate data propagation paths after the pruning process.
It is noted that fig. 13 shows only an exemplary embodiment of the data flow graph generating unit, and in other embodiments of the present specification, the data flow graph generating unit may not include the pruning processing module.
Further optionally, in one example, the taint analysis includes taint analysis that takes into account call site context information. The pruning processing module 1320 is configured to prune the candidate data propagation paths based on the call site context information.
Further optionally, in one example, the call site context information may include variable usage information that can access the variable across the legs. Accordingly, the pruning processing module 1320 is configured to determine whether there is a variable usage information inconsistency in the candidate data propagation paths based on the variable usage information of the cross-range accessible variable; and when determining that the variable use information in the candidate data propagation path is inconsistent, pruning the candidate data propagation path.
Further optionally, in one example, the taint analysis includes taint analysis that considers relationships between variable usage points and variable definition points. The taint analysis pruning processing module 1320 is configured to prune the candidate data propagation paths based on the relationship between the variable usage points and the variable definition points.
Fig. 14 shows a block diagram of an implementation example of a pruning processing module 1400 according to an embodiment of the present description. As shown in fig. 14, the pruning processing module 1400 includes an alias variable search sub-module 1410, a variable continuation judgment sub-module 1420, and a pruning processing sub-module 1430.
The alias variable search submodule 1410 is configured to search for an alias variable of a first variable, which is a local variable used within a process, based on a relationship between a variable use point and a variable definition point when the first variable in the program code is assigned as a heap variable.
The variable continuation determination submodule 1420 is configured to determine whether the alias variable is used in subsequent program code.
Pruning processing submodule 1430 is configured to prune the candidate data propagation paths associated with an alias variable when the alias variable is not used in subsequent program code.
Fig. 15 shows a block diagram of another implementation example of the pruning processing module 1500 according to an embodiment of the present specification. As shown in fig. 15, the pruning processing module 1500 includes a variable mapping sub-module 1510, an alias variable searching sub-module 1520, a variable continuation judging sub-module 1530, and a pruning processing sub-module 1540.
The variable mapping submodule 1510 is configured to map a first variable in the program code, which is a local variable used between procedures, to a second variable in the calling context or the calling method when the first variable is assigned as a heap variable.
The alias variable search submodule 1520 is configured to search for an alias variable of the second variable based on a relationship between the variable use point and the variable definition point.
The variable continuation determination submodule 1530 is configured to determine whether an alias variable is used in subsequent program code.
Pruning processing submodule 1540 is configured to prune the candidate data propagation paths associated with an alias variable when the alias variable is not used in subsequent program code.
Further, optionally, in one example, the taint analysis apparatus may further include a data propagation path determination unit and a saving unit. The data propagation path determination unit is configured to obtain a data propagation path list of the application system from the dataflow graph. The saving unit is configured to save the data propagation path list of the application system in a relational database for use when a plurality of application systems are included to create a data propagation path list across the application systems together with the data propagation path lists of the other application systems.
Further, it is noted that optionally, in one example, the taint analysis apparatus may further include a call relation graph construction unit. The call relation graph communication unit is configured to construct a call relation graph from application layer code in program code of an application system by using a first call relation construction algorithm. In another example, the call relationship graph building unit may also be implemented with an edge relationship extension unit.
Further, optionally, in another example, the taint analysis apparatus may not include one, more, or all of the transcoding unit, the element determination unit, the slicing processing unit, and the dataflow graph generation unit.
As described above with reference to fig. 1 to 15, the stain analysis method and the stain analysis apparatus according to the embodiments of the present specification are described. The taint analysis device above can be implemented in hardware, as well as in software, or a combination of hardware and software.
FIG. 16 shows a schematic diagram of an electronic device 1400 for implementing taint analysis for an application system, according to an embodiment of the present description. As shown in fig. 16, electronic device 1600 may include at least one processor 1610, a memory (e.g., non-volatile storage) 1620, a memory 1630, and a communication interface 1640, and the at least one processor 1610, memory 1620, memory 1630, and communication interface 1640 are connected together via a bus 1660. The at least one processor 1610 executes at least one computer-readable instruction (i.e., the elements described above as being implemented in software) stored or encoded in memory.
In one embodiment, computer-executable instructions are stored in the memory that, when executed, cause the at least one processor 1610 to: generating a control flow graph from a call relation graph, the call relation graph being constructed from application layer code in program code of the application system using a first call relation construction algorithm; performing taint analysis on program code of the application system using the control flow graph; and when the taint analysis result indicates that the calling statement does not have an edge relationship in the calling relationship graph, using a second calling relationship construction algorithm to expand the edge relationship for the calling statement in the calling relationship graph.
It should be appreciated that the computer-executable instructions stored in the memory, when executed, cause the at least one processor 1610 to perform the various operations and functions described above in connection with fig. 1-15 in the various embodiments of the present description.
According to one embodiment, a program product, such as a machine-readable medium (e.g., a non-transitory machine-readable medium), is provided. A machine-readable medium may have instructions (i.e., elements described above as being implemented in software) that, when executed by a machine, cause the machine to perform various operations and functions described above in connection with fig. 1-15 in the various embodiments of the present specification. Specifically, a system or apparatus may be provided which is provided with a readable storage medium on which software program code implementing the functions of any of the above embodiments is stored, and causes a computer or processor of the system or apparatus to read out and execute instructions stored in the readable storage medium.
In this case, the program code itself read from the readable medium can realize the functions of any of the above-described embodiments, and thus the machine-readable code and the readable storage medium storing the machine-readable code constitute a part of the present invention.
Examples of the readable storage medium include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or from the cloud via a communications network.
It will be understood by those skilled in the art that various changes and modifications may be made in the above-disclosed embodiments without departing from the spirit of the invention. Accordingly, the scope of the invention should be determined from the following claims.
It should be noted that not all steps and units in the above flows and system structure diagrams are necessary, and some steps or units may be omitted according to actual needs. The execution order of the steps is not fixed, and can be determined as required. The apparatus structures described in the above embodiments may be physical structures or logical structures, that is, some units may be implemented by the same physical entity, or some units may be implemented by a plurality of physical entities, or some units may be implemented by some components in a plurality of independent devices.
In the above embodiments, the hardware units or modules may be implemented mechanically or electrically. For example, a hardware unit, module or processor may comprise permanently dedicated circuitry or logic (such as a dedicated processor, FPGA or ASIC) to perform the corresponding operations. The hardware units or processors may also include programmable logic or circuitry (e.g., a general purpose processor or other programmable processor) that may be temporarily configured by software to perform the corresponding operations. The specific implementation (mechanical, or dedicated permanent, or temporarily set) may be determined based on cost and time considerations.
The detailed description set forth above in connection with the appended drawings describes exemplary embodiments but does not represent all embodiments that may be practiced or fall within the scope of the claims. The term "exemplary" used throughout this specification means "serving as an example, instance, or illustration," and does not mean "preferred" or "advantageous" over other embodiments. The detailed description includes specific details for the purpose of providing an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described embodiments.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (30)

1. A method for taint analysis of an application system, comprising:
generating a control flow graph from a call relation graph, the call relation graph being constructed from application layer code in program code of the application system using a first call relation construction algorithm;
performing taint analysis on program code of an application system using the control flow graph; and
when the taint analysis result indicates that the calling statement does not have an edge relation in the calling relation graph, using a second calling relation construction algorithm to expand the edge relation for the calling statement in the calling relation graph,
the first call relation construction algorithm comprises a call relation construction algorithm with high algorithm precision, the second call relation construction algorithm is lower in precision than the first call relation construction algorithm, and the second call relation construction algorithm is better than the first call relation construction algorithm in performance.
2. The method of claim 1, wherein using a second call relationship building algorithm to expand an edge relationship for the call statement in the call relationship graph comprises:
using a second call relation construction algorithm to extend an edge relation for the call statement in the call relation graph and the control flow graph,
the method further comprises the following steps:
and generating a data flow graph according to the expanded control flow graph.
3. The method of claim 2, wherein generating a dataflow graph from the extended control flow graph includes:
constructing a candidate data propagation path according to the expanded control flow graph;
pruning the candidate data propagation paths; and
and generating the data flow graph based on the candidate data propagation path after pruning.
4. The method of claim 3, wherein the nodes of the dataflow graph include code fields or database fields, and edge relationships represent data flow directions between fields.
5. The method of claim 4, wherein the taint analysis comprises taint analysis that considers call site context information, and pruning the candidate data propagation paths comprises:
pruning the candidate data propagation paths based on call site context information.
6. The method of claim 5, wherein the call site context information includes variable usage information for variables accessible across processes.
7. The method of claim 6, wherein pruning the candidate data propagation paths based on call site context information comprises:
determining whether there is a variable usage information inconsistency in the candidate data propagation paths based on variable usage information for the cross-range accessible variables; and
and when the fact that the variable use information is inconsistent in the candidate data propagation path is determined, pruning is carried out on the candidate data propagation path.
8. The method of claim 4, wherein the taint analysis comprises taint analysis that considers relationships between variable usage points and variable definition points, and pruning the candidate data propagation paths comprises:
pruning the candidate data propagation paths based on relationships between the variable use points and the variable definition points.
9. The method of claim 8, wherein pruning the candidate data propagation paths based on relationships between variable usage points and variable definition points comprises:
when a first variable in the program code is assigned as a heap variable, searching an alias variable of the first variable based on a relation between a variable using point and a variable defining point, wherein the first variable is a local variable used in a process;
judging whether the alias variable is used in subsequent program codes; and
pruning the candidate data propagation paths associated with the alias variable when the alias variable is not used in subsequent program code.
10. The method of claim 8, wherein pruning the candidate data propagation paths based on relationships between variable usage points and variable definition points comprises:
when a first variable in a program code is assigned as a heap variable, mapping the first variable to a second variable in a calling context or a calling method, wherein the first variable is a local variable used in a process;
searching for an alias variable of the second variable based on a relationship between a variable usage point and a variable definition point;
judging whether the alias variable is used in subsequent program codes; and
pruning the candidate data propagation paths associated with the alias variable when the alias variable is not used in subsequent program code.
11. The method of claim 1, wherein the accuracy of the first calling relationship building algorithm is higher than the accuracy of the second calling relationship building algorithm, and the performance of the first calling relationship building algorithm is lower than the performance of the second calling relationship building algorithm.
12. The method of claim 1, wherein the program code comprises transcoded program code.
13. The method of claim 12, wherein the transcoding comprises:
performing packet supplementing processing, class replacement and/or method replacement on the program codes;
inserting a statement near the method call; and/or
Statements are inserted at the beginning or end of a method.
14. The method of claim 1, wherein the taint analysis comprises a slice-based parallel taint analysis.
15. The method of claim 14, wherein the slices comprise slices sliced based on program entry points or slices sliced based on dirty starting points.
16. The method of claim 1, further comprising:
a program entry point, a contamination start point, and a contamination end point of the program code are determined.
17. An apparatus for spot analysis of an application system, comprising:
a control flow graph generating unit that generates a control flow graph from a call relation graph that is constructed from application layer code in program code of the application system by using a first call relation construction algorithm;
a taint analysis unit that performs taint analysis on program codes of an application system using the control flow graph; and
an edge relationship extension unit that uses a second call relationship construction algorithm to extend an edge relationship for a call statement in the call relationship graph when the taint analysis result indicates that the call statement does not have an edge relationship in the call relationship graph,
the first call relation construction algorithm comprises a call relation construction algorithm with high algorithm precision, the second call relation construction algorithm is lower in precision than the first call relation construction algorithm, and the second call relation construction algorithm is better than the first call relation construction algorithm in performance.
18. The apparatus of claim 17, wherein the edge relationship extension unit uses a second call relationship construction algorithm to extend an edge relationship for a call statement in the call relationship graph and the control flow graph when the taint analysis result indicates that the call statement does not have an edge relationship in the call relationship graph,
the device further comprises:
and the data flow graph generating unit generates a data flow graph according to the expanded control flow graph.
19. The apparatus of claim 18, wherein the dataflow graph generating unit includes:
the data propagation path construction module is used for constructing a candidate data propagation path according to the expanded control flow graph;
the pruning processing module is used for carrying out pruning processing on the candidate data propagation paths; and
and the data flow graph generating module is used for generating the data flow graph based on the candidate data propagation path after pruning.
20. The apparatus of claim 19, wherein the nodes of the dataflow graph include code fields or database fields, and edges represent data flow directions between fields.
21. The apparatus of claim 20, wherein the taint analysis comprises taint analysis that considers call point context information, the pruning processing module to prune the candidate data propagation paths based on the call point context information.
22. The apparatus of claim 21, wherein the call site context includes variable usage information that can access variables across processes, the pruning processing module to:
determining whether there is a variable usage information inconsistency in the candidate data propagation paths based on variable usage information for the cross-range accessible variables; and
and when the fact that the variable use information is inconsistent in the candidate data propagation path is determined, pruning is carried out on the candidate data propagation path.
23. The apparatus of claim 20, wherein the taint analysis comprises taint analysis that considers relationships between variable usage points and variable definition points, the pruning processing module to prune the candidate data propagation paths based on the relationships between variable usage points and variable definition points.
24. The apparatus of claim 23, wherein the pruning processing module:
when a first variable in the program code is assigned as a heap variable, searching an alias variable of the first variable based on a relation between a variable using point and a variable defining point, wherein the first variable is a local variable used in a process;
judging whether the alias variable is used in subsequent program codes; and
pruning the candidate data propagation paths associated with the alias variable when the alias variable is not used in subsequent program code.
25. The apparatus of claim 23, wherein the pruning processing module:
when a first variable in a program code is assigned as a heap variable, mapping the first variable to a second variable in a calling context or a calling method, wherein the first variable is a local variable used in a process;
searching for an alias variable of the second variable based on a relationship between a variable usage point and a variable definition point;
judging whether the alias variable is used in subsequent program codes; and
pruning the candidate data propagation paths associated with the alias variable when the alias variable is not used in subsequent program code.
26. The apparatus of claim 17, further comprising:
a transcoding unit transcoding an application framework of the application system,
the taint analysis unit performs taint analysis on the transcoded program code using the control flow graph.
27. The apparatus of claim 17, further comprising:
an element determination unit that determines a program entry point, a contamination start point, and a contamination end point of the program code.
28. The apparatus of claim 27, further comprising:
a slicing processing unit that performs slicing processing on the analysis task of the stain analysis,
wherein the taint analysis comprises a slice-based parallel taint analysis.
29. An electronic device, comprising:
at least one processor, and
a memory coupled with the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of any of claims 1-16.
30. A machine-readable storage medium storing executable instructions that, when executed, cause the machine to perform the method of any one of claims 1 to 16.
CN202010938928.XA 2020-09-09 2020-09-09 Taint analysis method and device of application system Active CN111966346B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010938928.XA CN111966346B (en) 2020-09-09 2020-09-09 Taint analysis method and device of application system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010938928.XA CN111966346B (en) 2020-09-09 2020-09-09 Taint analysis method and device of application system

Publications (2)

Publication Number Publication Date
CN111966346A CN111966346A (en) 2020-11-20
CN111966346B true CN111966346B (en) 2022-05-10

Family

ID=73391977

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010938928.XA Active CN111966346B (en) 2020-09-09 2020-09-09 Taint analysis method and device of application system

Country Status (1)

Country Link
CN (1) CN111966346B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113176990B (en) * 2021-03-25 2022-10-18 中国人民解放军战略支援部队信息工程大学 Taint analysis framework and method supporting correlation analysis among data
CN113392404B (en) * 2021-06-15 2023-04-07 浙江网商银行股份有限公司 Vulnerability detection method and device and electronic equipment
CN117272331B (en) * 2023-11-23 2024-02-02 北京安普诺信息技术有限公司 Cross-thread vulnerability analysis method, device, equipment and medium based on code vaccine

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1508681A (en) * 2002-12-17 2004-06-30 国际商业机器公司 Method and system for searching reduction variable quantity in assign ment satement
CN105550594A (en) * 2015-12-17 2016-05-04 西安电子科技大学 Security detection method for android application file
CN106940773A (en) * 2017-01-10 2017-07-11 西安电子科技大学 Privacy compromise Hole Detection confirmation method based on static stain data analysis
CN107229867A (en) * 2017-06-12 2017-10-03 北京奇虎科技有限公司 Kernel bug excavation method, device, computing device and computer-readable storage medium
CN111611586A (en) * 2019-02-25 2020-09-01 上海信息安全工程技术研究中心 Software vulnerability detection method and device based on graph convolution network

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7509298B2 (en) * 2006-03-31 2009-03-24 International Business Machines Corporation System and method for a logical-model based application understanding and transformation
US8584246B2 (en) * 2009-10-13 2013-11-12 International Business Machines Corporation Eliminating false reports of security vulnerabilities when testing computer software
US20140130153A1 (en) * 2012-11-08 2014-05-08 International Business Machines Corporation Sound and effective data-flow analysis in the presence of aliasing
US9811322B1 (en) * 2016-05-31 2017-11-07 Oracle International Corporation Scalable provenance generation from points-to information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1508681A (en) * 2002-12-17 2004-06-30 国际商业机器公司 Method and system for searching reduction variable quantity in assign ment satement
CN105550594A (en) * 2015-12-17 2016-05-04 西安电子科技大学 Security detection method for android application file
CN106940773A (en) * 2017-01-10 2017-07-11 西安电子科技大学 Privacy compromise Hole Detection confirmation method based on static stain data analysis
CN107229867A (en) * 2017-06-12 2017-10-03 北京奇虎科技有限公司 Kernel bug excavation method, device, computing device and computer-readable storage medium
CN111611586A (en) * 2019-02-25 2020-09-01 上海信息安全工程技术研究中心 Software vulnerability detection method and device based on graph convolution network

Also Published As

Publication number Publication date
CN111966346A (en) 2020-11-20

Similar Documents

Publication Publication Date Title
CN111966346B (en) Taint analysis method and device of application system
US7490265B2 (en) Recovery segment identification in a computing infrastructure
US8392465B2 (en) Dependency graphs for multiple domains
Nayak et al. Automatic Test Data Synthesis using UML Sequence Diagrams.
US8127269B2 (en) Transforming a flow graph model to a structured flow language model
US20220358023A1 (en) Method And System For The On-Demand Generation Of Graph-Like Models Out Of Multidimensional Observation Data
CN102542382A (en) Method and device for managing business rule
CN111610978A (en) Applet conversion method, device, equipment and storage medium
CN114116065B (en) Method and device for acquiring topological graph data object and electronic equipment
CN105809389A (en) Method and apparatus for generating BOM trees
CN104320312A (en) Network application safety test tool and fuzz test case generation method and system
CN114691658A (en) Data backtracking method and device, electronic equipment and storage medium
KR101596257B1 (en) System and method for developing of service based on software product line
CN111966718B (en) System and method for data propagation tracking of application systems
US8874622B2 (en) Flexible order of authoring for data integration solutions
US10268461B2 (en) Global data flow optimization for machine learning programs
JP2018169693A (en) Information processing device, information processing method, and information processing program
CN111158667B (en) Code injection method and device, electronic equipment and storage medium
CN113849183A (en) Byte code conversion using virtual artifacts
CN116578282A (en) Code generation method, device, electronic equipment and medium
CN115934161A (en) Code change influence analysis method, device and equipment
US11301498B2 (en) Multi-cloud object store access
Wu et al. A practical covert channel identification approach in source code based on directed information flow graph
CN116432185B (en) Abnormality detection method and device, readable storage medium and electronic equipment
CN116450682B (en) Model generation method, device, equipment and medium based on data combination

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant