CN111966718A

CN111966718A - System and method for data propagation tracking of application systems

Info

Publication number: CN111966718A
Application number: CN202010938679.4A
Authority: CN
Inventors: 吴云广; 王杰; 王丹; 周刚
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-09-09
Filing date: 2020-09-09
Publication date: 2020-11-20
Anticipated expiration: 2040-09-09
Also published as: CN111966718B

Abstract

Embodiments of the present specification provide a system and method for data propagation tracking for application systems. In the system, a code compiling device compiles a program source code of an application system to obtain a code compiling result. The code modeling means performs code modeling using the code compilation result to construct factor information required for taint analysis, the factor information including a contamination start point, a contamination end point, and a program entry point. Then, the taint analysis device performs taint analysis on the code compiling result by using the constructed element information to obtain data propagation path information of the application system, wherein the data propagation path information is used for indicating a data flow direction relation between a pollution starting point and a pollution end point.

Description

System and method for data propagation tracking of application systems

Technical Field

Embodiments of the present description relate generally to the field of security, software engineering, software compilation or program analysis, and more particularly, to systems and methods for data propagation tracking for application systems.

Background

In recent years, the industry has increasingly demanded static taint analysis techniques, particularly taint analysis tools with high scalability and accuracy. Taint analysis techniques can help the industry track data propagation links, thereby solving data problems in many complex scenarios, such as privacy disclosure, asset analysis, change management and control, data consistency, and the like. How to realize data propagation tracking in an application system becomes a problem to be solved urgently.

Disclosure of Invention

In view of the foregoing, embodiments of the present disclosure provide a data propagation tracking system and method for an application system. By the data propagation tracking system and the data propagation tracking method, taint analysis aiming at inter-process calling of an application system can be realized, data propagation path information of accessed data is obtained, and data flow transition tracking aiming at the accessed data is realized.

According to an aspect of an embodiment of the present specification, there is provided a system for data propagation tracking of an application system, including: the code compiling device is used for compiling a program source code of the application system to obtain a code compiling result; a code modeling means for performing code modeling using the code compiling result to construct element information required for taint analysis, the element information including a contamination start point, a contamination end point, and a program entry point; and the taint analysis device is used for carrying out taint analysis on the code compiling result by using the constructed element information to obtain data propagation path information of the application system, and the data propagation path information is used for indicating the data flow direction relation between the pollution starting point and the pollution end point.

Optionally, in an example of the above aspect, the data propagation path information is a data flow direction relationship between a pair of fields, and the fields include a code field or a database field.

Optionally, in one example of the above aspect, the system further comprises: and the data storage device stores the data propagation path information of the application system into a database.

Optionally, in one example of the above aspect, the stored data propagation path information is constructed as a dataflow graph.

Optionally, in one example of the above aspect, the application system includes a plurality of application systems, and the dataflow graph includes a dataflow graph across the application systems that is constructed by linking data propagation path information of the plurality of application systems.

Optionally, in one example of the above aspect, the system further comprises: and the path information inquiry device responds to the data propagation path information inquiry request, inquires the data propagation path information in the database and provides a data propagation path information inquiry result.

Optionally, in an example of the above aspect, the path information query apparatus includes: a path information query interface used by a user to input a path information query request; and the visual presentation unit is used for presenting the inquired data propagation path information to a user in a visual mode.

Optionally, in one example of the above aspect, the system further comprises: and the distributed scheduling device is used for performing distributed scheduling on the taint analysis task of the application system.

Optionally, in an example of the above aspect, the code compiling apparatus further performs a complementary packing process on the code compiling result.

Optionally, in one example of the above aspect, the application framework of the application system is a Sofa framework, and the code modeling means includes: the configuration file scanning unit is used for scanning the configuration file of the code compiling result to obtain an SQL configuration file and a class file, and organizing the class file according to a topological structure to obtain an SOA model topology; the SQL conversion unit is used for converting SQL-like statements in the SQL configuration file into analyzable SQL statements; the SQL analysis unit is used for analyzing the analyzable SQL statement in the converted SQL configuration file into a table and a field; the element construction unit is used for carrying out code analysis on the data access layer and the application framework and constructing element information of the code layer; and the association mapping unit is used for carrying out association mapping on the element information constructed based on the data access layer and the fields in the analyzed SQL sentences in the SQL configuration file.

Optionally, in an example of the above aspect, when the application system is a Java-based implemented application system, the code modeling apparatus further includes: and the byte code modification unit is used for carrying out byte code modification on the code compiling result.

Optionally, in one example of the above aspect, the taint analysis apparatus includes: a control flow graph generating unit that generates a control flow graph from a call relation graph that is constructed from application layer code in program code of the application system by using a first call relation construction algorithm; a taint analysis unit which uses the control flow graph to traverse program codes of an application system for taint analysis; the edge relation expansion unit is used for expanding an edge relation for the calling statement in the calling relation graph and the control flow graph by using a second calling relation construction algorithm when the taint analysis result indicates that the calling statement does not have the edge relation in the calling relation graph; and a data propagation path information determination unit that determines data propagation path information of the application system from the expanded control flow graph.

According to another aspect of embodiments herein, there is provided a method for data propagation tracing for an application system, comprising: performing code compiling on a program source code of an application system to obtain a code compiling result; performing code modeling using the code compilation result to construct element information required for taint analysis, the element information including a contamination start point, a contamination end point, and a program entry point; and performing taint analysis on the code compiling result by using the constructed element information to obtain data propagation path information of the application system, wherein the data propagation path information is used for indicating a data flow direction relation between a pollution starting point and a pollution end point.

Optionally, in one example of the above aspect, the method further comprises: and storing the data propagation path information of the application system into a database.

Optionally, in one example of the above aspect, the data propagation path information is constructed as a dataflow graph.

Optionally, in one example of the above aspect, the method further comprises: and responding to the data propagation path information query request, performing data propagation path information query in the database, and providing a data propagation path information query result.

Optionally, in an example of the above aspect, before performing taint analysis on the code compilation result, the method further includes: and performing distributed scheduling on the taint analysis task of the application system.

Optionally, in one example of the above aspect, the method further comprises: and performing pack supplementing processing on the code compiling result before constructing element information required by taint analysis according to the code compiling result.

Optionally, in an example of the above aspect, constructing the factor information required for taint analysis according to the code compiling result includes: scanning a configuration file of the code compiling result to obtain an SQL configuration file and a class file, and organizing the class file according to a topological structure to obtain an SOA model topology; converting SQL-like statements in the SQL configuration file into analyzable SQL statements; analyzing the analyzable SQL statement in the converted SQL configuration file into a table and a field; carrying out code analysis on the data access layer and the application framework to construct element information of the code layer; and performing association mapping on the element information constructed based on the data access layer and the fields in the analyzed SQL sentences in the SQL configuration file.

Optionally, in one example of the above aspect, performing taint analysis on the code compilation result using the constructed factor information includes: generating a control flow graph from a call relation graph, the call relation graph being constructed from application layer code in program code of the application system using a first call relation construction algorithm; traversing program code of an application system for taint analysis using the control flow graph; when the taint analysis result indicates that the calling statement does not have an edge relation in the calling relation graph, using a second calling relation construction algorithm to expand the edge relation for the calling statement in the calling relation graph and the control flow graph; and determining data propagation path information of the application system according to the expanded control flow graph.

According to another aspect of embodiments of the present specification, there is provided an electronic apparatus including: at least one processor, and a memory coupled with the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform a data propagation tracking method as described above.

According to another aspect of embodiments herein, there is provided a machine-readable storage medium having stored thereon executable instructions that, when executed, cause the machine to perform a data propagation tracking method as described above.

Drawings

A further understanding of the nature and advantages of the present disclosure may be realized by reference to the following drawings. In the drawings, similar components or features may have the same reference numerals.

Fig. 1 shows an example schematic diagram of a privacy data disclosure process.

FIG. 2 illustrates an example block diagram of a system for implementing data propagation tracking for an application system in accordance with embodiments of this specification.

FIG. 3 shows a block diagram of an implementation example of a code modeling apparatus according to an embodiment of the present specification.

FIG. 4 illustrates an example flow diagram of a code modeling process in accordance with an embodiment of the present description.

FIG. 5 shows a block diagram of one implementation example of a taint analysis apparatus according to an embodiment of the present description.

FIG. 6 illustrates an example flow diagram of a process for data propagation analysis of code compilation results in accordance with an embodiment of the present description.

FIG. 7 illustrates an example schematic diagram of a process for performing taint analysis on program code of an application according to an embodiment of the present description.

Fig. 8 illustrates an example schematic of data propagation path information in accordance with an embodiment of the present description.

FIG. 9 illustrates an example schematic diagram of a dataflow graph across application systems in accordance with an embodiment of the present specification.

FIG. 10 illustrates an example flow diagram for a method of implementing data propagation tracking for an application system in accordance with an embodiment of the present description.

FIG. 11 illustrates a schematic diagram of an electronic device for implementing data propagation tracking for an application system in accordance with embodiments of the present description.

Detailed Description

The subject matter described herein will now be discussed with reference to example embodiments. It should be understood that these embodiments are discussed only to enable those skilled in the art to better understand and thereby implement the subject matter described herein, and are not intended to limit the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as needed. For example, the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. In addition, features described with respect to some examples may also be combined in other examples.

As used herein, the term "include" and its variants mean open-ended terms in the sense of "including, but not limited to. The term "based on" means "based at least in part on". The terms "one embodiment" and "an embodiment" mean "at least one embodiment". The term "another embodiment" means "at least one other embodiment". The terms "first," "second," and the like may refer to different or the same object. Other definitions, whether explicit or implicit, may be included below. The definition of a term is consistent throughout the specification unless the context clearly dictates otherwise.

In industrial applications, there are a large number of inter-process calls (e.g., service layer method calls dao layer interface to obtain data in database) in a single application, and there are service calls (e.g., service calls made by rpc in rest), that is, data in a single application can be propagated to other applications by means of the service calls. If the data is illegally used by the application of the calling party, data security problems such as privacy disclosure, asset damage and the like are caused, so that data propagation during the process calling of the application system needs to be tracked and analyzed, data propagation path information of the accessed data is obtained, and data security risks are timely discovered and responded.

Fig. 1 shows an example schematic diagram of a privacy data disclosure process. As shown in fig. 1, it is assumed that the data column IDCard in the database possessed by application app _1 is labeled as private data. In response to a remote procedure call from application app _2, the private data of the data column IDCard is retrieved from the database and sent to application app _2 via several translation layers (POJO translation layers). Application app _2 further exposes the private data to other applications. Finally, app _ n gets this private data, stores it as the data column idiinfo in its own database and shows it to the user. In this case, if the user of application app _ n does not know that the data column IDinfo is derived from the private data IDCard, there is a security risk that private data is misused.

Taint analysis techniques are widely used for data propagation tracking analysis. Taint analysis techniques refer to techniques in which analysis data is propagated through a program. Taint analysis is an important means for analyzing privacy disclosure and code bugs in the field of data security, and has very wide application in the fields of security and software engineering. The taint analysis process mainly comprises three aspects of pollution source marking, pollution propagation rule specification and taint propagation. The source of pollution refers to untrusted data, such as user sensitive data, untrusted external input. A pollution propagation rule is an inference rule that specifies how to spread polluted data according to the semantics of program instructions and functions. For example, if a is source, b is a, and sink is b, the sink data will be affected by the data pollution of the variable source. Taint analysis techniques include static taint analysis and dynamic taint analysis.

Taint analysis includes three elements: a contamination start Point (Source), a contamination end Point (Sink), and an analysis by program Entry (Entry Point). In the taint analysis process, a Call relation Graph (Call Graph) of calls between processes (functions) needs to be built according to a program analysis entry. Call Graph is used to present Call relationships between procedures (functions) in a computer program. Nodes in the Call Graph are composed of methods in program code, and edges in the Call Graph are used for representing calling relations among the methods. Examples of taint analysis techniques may include static taint analysis tools Flowdroid and Ptaint (Doop based). In the taint analysis process of the Flowdroid-based application system, the taint analysis object is the source code or intermediate representation of a program, so that the explicit flow static analysis in taint propagation can be converted into the analysis for the static data dependency in the program.

When performing taint analysis, first, a Call Graph (Call Graph) is constructed from the function Call relationship between programs for all program codes of an application program. Then, specific taint analysis is performed between functions or within functions according to different program characteristics. Examples of explicit taint propagation may include, but are not limited to, direct assignment propagation, propagation through function (procedure) calls, propagation through aliases (pointers), and the like.

The term "taint analysis" refers in a narrow sense to taint analysis on data of interest. In this specification, the term "taint analysis" should be interpreted broadly as taint analysis with respect to all data involved in program code or all accessed data. Furthermore, in this specification, the term "contamination" may be used interchangeably with "data dissemination". In addition, in this specification, the term "application system" may also be understood as "application", "application program", or "system in which application programs are installed".

A system and method for implementing data propagation tracking for an application system according to embodiments of the present specification will be described in detail below with reference to the accompanying drawings.

FIG. 2 illustrates an example block diagram of a system 200 for implementing data propagation tracking for an application system (hereinafter "data propagation tracking system") in accordance with embodiments of this specification.

As shown in FIG. 2, the data propagation tracking system 200 includes a code compilation device 210. The code compiling apparatus 210 is configured to compile a code of a program source code of an application system (e.g., in a code repository) into a code compiling result. In this specification, data propagation path information is constructed by performing static taint analysis on program code of an application system using a static taint analysis method. The object of the static taint analysis is a code compiling result (such as intermediate code) obtained by compiling program source codes of the application system. For example, in the case of program code implemented by Java, the code compilation result is referred to as a compiled jar packet, and the jar packet includes bytecode obtained by compiling Java source code. In static taint analysis based on the flowdroid framework, the object of the static taint analysis is jimple code, which is intermediate code between source code and bytecode. In the flowdroid framework, Java bytecode is converted into jimple code by using the root framework.

Optionally, in an example, when the program source code of the application system is compiled, the code compilation result may be subjected to a complementary packaging process, so as to supplement some necessary program codes of the static taint analysis for the code compilation result. In one example, the compiling behavior of the code compiling apparatus 210 can be modified, so that the necessary jar packages required by the static taint analysis can be supplemented in the code compiling result. For example, in static taint analysis, a CE (the underlying container of the sofa framework) is found to be called, and the CE is not normally packed into jar packets, so that a complementary packing process is required to force the jar packets to be packed out for such code that is strongly dependent in static taint analysis, thereby preventing data flow analysis from breaking.

Further optionally, in one example, the code compiling apparatus 210 may perform on-demand compilation to convert the program code into intermediate code or bytecode. Here, the term "on-demand compilation" means that the compilation object of the code compiling apparatus 210 is specified on-demand, that is, when the code compiling apparatus 210 performs code compilation, the program source code submitted by COMMITID (commit ID) on what system is compiled.

The data propagation tracking system 200 also includes a code modeling apparatus 220. The code modeling means 220 is configured to build factor information required for taint analysis, including a contamination start point (Source), a contamination end point (Sink), and a program entry point, by code modeling using a code compilation result. All the essential information required for the taint analysis can be constructed using the code modeling apparatus 220.

In one example of the present description, the input data may be considered a contamination start point and the output data may be considered a contamination end point. Examples of input data may include: input parameters for program entry points, return values for remote procedure calls, fields retrievable by the database. Examples of output data may include: return values for program entry points, parameters for remote procedure calls, and fields that may be saved to a database. In one example of the present specification, the field includes a code field or a database field. Further, in one example, after being processed by the code modeling apparatus 220, an association mapping relationship is established between the element information of the code field type and the element information of the database field type.

FIG. 3 shows a block diagram of an implementation example of a code modeling apparatus 300 according to an embodiment of the present description. As shown in fig. 3, the code modeling apparatus 300 includes a profile scanning unit 310, an SQL converting unit 320, an SQL parsing unit 330, an element constructing unit 340, and an association mapping unit 350.

The configuration file scanning unit 310 is configured to scan the configuration file of the code compiling result to obtain an SQL configuration file and a class file, and organize the class file according to a topology structure to obtain an SOA model topology. For example, under the Sofa framework based on a Java implementation, the profile scanning unit 310 may be configured to scan all XML profiles and annotations in the code compilation result, thereby resulting in all SQLmap profiles and all Java beans (i.e., class files). The resulting Java beans can be topologically organized together, for example, by using beans topology to topologically organize the Java beans together, resulting in a Service Oriented Architecture (SOA) model topology. SOA is a component model that splits different functional units of an application (called services) and ties them together through well-defined interfaces and protocols between the services. The obtained SQLmap configuration file and the Java Bean can be loaded into a memory for subsequent analysis processing. In one example of the present specification, the application framework of the application system may be a Sofa framework. The Sofa framework is a modified framework of the Spring framework and can be downward compatible with the Spring framework.

The SQL conversion unit 320 is configured to convert SQL-like statements in the SQL configuration file into parsable SQL statements. In embodiments of the present specification, the term "SQL-like statement" refers to an SQL statement that is not parsable or executable.

Optionally, in one example, the SQLmap configuration file may be converted to an ibatis/mybatis memory object using ibatis/mybatis api under a Java-based implementation of the Sofa framework. Subsequently, processing is performed for a dynamic tag characteristic such as if in the SQLmap profile, thereby realizing variable usage extraction. The parameter set is then built by transforming the objects passed parameters by the data access layer (DAO layer), thereby converting SQL-like statements in the SQLmap configuration file into SQL statements that can be directly executed or parsed in the database.

After SQL translation as described above, SQL parsing unit 330 is configured to parse the parseable SQL statement in the translated SQL configuration file into tables and fields.

The element construction unit 340 is configured to perform code analysis on the data access layer and the application framework, and construct element information of the code layer. For example, under the Sofa framework based on Java implementation, DAO layer code can be analyzed, thereby obtaining taint source/sink of DAO layer code. In addition, the source/sink of the program Entry Point, the interface layer (for upstream application system call) and the calling layer code (for calling downstream application system) is obtained by scanning the service published in the sofa framework and the referenced service (the service is published as a sofa service, and the service reference is called as a sofa reference), thereby constructing the element information of the code layer.

The association mapping unit 350 is configured to perform association mapping on the element information constructed based on the data access layer and the fields in the parsed SQL statements in the SQL configuration file, so as to establish an association relationship between the source/sink fields of the DAO layer and the database fields.

Further, optionally, when the application system is a Java-based implemented application system, the code modeling apparatus 300 may further include a bytecode transformation unit (not shown). The byte code modification unit is used for carrying out byte code modification on the code compiling result, so that data stream analysis breakage is avoided.

FIG. 4 illustrates an example flow diagram of a code modeling process 400 according to an embodiment of this specification.

As shown in fig. 4, at 410, the code compilation result is subjected to profile scanning to obtain an SQL profile and a class file, and the class file is organized according to a topology structure to obtain a model topology.

At 420, the SQL-like statements in the SQL configuration file are converted into parsable SQL statements; and at 430, the parseable SQL statements in the converted SQL configuration file are parsed into tables and fields.

At 440, code analysis is performed on the data access layer and the application framework to construct factor information for the code layer. Then, at 450, the element information constructed based on the data access layer is associated and mapped with the fields in the parsed SQL statement in the SQL configuration file, so that an association relationship is established between the source/sink fields of the code DAO layer and the database fields.

Returning to FIG. 2, the data propagation tracking system 200 also includes a taint analysis device 230. The taint analysis apparatus 230 is configured to perform taint analysis on the code compilation result using the constructed factor information, thereby obtaining data propagation path information of the application system. Here, the data propagation path information is used to indicate a data flow direction relationship between the contamination start point and the contamination end point. In one example of the present specification, the data propagation path information is a data flow relationship between a pair of fields, and the fields include a code field or a database field. The code field is a field in an object of the program code, and is composed of "app _ name", "service _ name", "method _ signature", "class _ name", and "field _ name", for example, a.service1.method1.requestclass. The database field is composed of "app _ name", "db _ name", "table _ name", "column _ name", for example, c.db. In the example shown in fig. 1, the data propagation path information may be, for example, db.

In one example of the present description, taint analysis can be implemented using a Flowdroid-based taint analysis tool. In another example, a taint analysis method based on a modified Flowdroid may be used to implement.

FIG. 5 shows a block diagram of one implementation example of a taint analysis apparatus 500 according to an embodiment of the present description. As shown in fig. 5, the taint analysis apparatus 500 includes a control flow graph generation unit 510, a taint analysis unit 520, an edge relation extension unit 530, and a data path information generation unit 540.

The control flow graph generating unit 510 is configured to generate a control flow graph from a call relation graph that is built from application layer code in program code of an application system by using a first call relation building algorithm. A control flow graph is an abstract representation of a process, typically used in compilers and static analysis, and represents all the paths that a program will traverse during its execution. In embodiments of the present description, the control flow graph may also include inter-process control flows, such as call flow (call flow) and return flow (return flow). Nodes in a control flow graph may be composed of statements or basic blocks (basic blocks) in program code, with edges representing the flow of operational control between the nodes. In addition, when the first call relation construction algorithm is selected, only the accuracy of the algorithm is concerned, for example, an algorithm with high accuracy, such as a Spark algorithm, can be selected, and the performance of the algorithm is not required to be concerned.

The taint analysis unit 520 is configured to perform taint analysis using a control flow graph to traverse program code of an application system.

In performing taint analysis, an inter-process control flow Graph (ICFG) is first constructed based on the initial Call Graph. Subsequently, stain propagation conditions (data propagation conditions) are calculated based on the ICFG. When a Call statement is encountered, it is checked whether the Call statement has an edge relationship in the initial Call Graph. If an edge relationship exists, the calculation continues down.

The edge relationship extension unit 530 is configured to use a second calling relationship building algorithm to extend edge relationships for a calling statement in the Call Graph and control flow Graph if the calling statement is encountered and there is no edge relationship in the Call Graph. The precision of the second call relation construction algorithm is lower than that of the first call relation construction algorithm, but the performance of the second call relation construction algorithm is superior to that of the first call relation construction algorithm. An example of the second call relation construction algorithm may include, for example, the CHA algorithm.

The data propagation path information determining unit 540 is configured to determine data propagation path information of the application system from the extended control flow graph. In an embodiment of the present specification, the data propagation path is a path from a contamination start point to a contamination end point, such as the data propagation path of x.f ═ source () - > sink (b.f) shown in fig. 7.

FIG. 6 illustrates an example flow diagram of a process 600 for data propagation analysis of code compilation results in accordance with an embodiment of the present description.

As shown in fig. 6, at 610, a Call relation Graph (i.e., initial Call Graph) is constructed from application layer code in the program code of the application system using a first Call relation construction algorithm. Subsequently, at 620, a control flow graph is generated from the initial call relationship graph.

After the control flow graph is generated as above, the control flow graph is used to traverse the program code of the application system for taint analysis at 630. At 640, when the taint analysis result indicates that the call statement does not have an edge relationship in the call relationship graph, a second call relationship construction algorithm is used to extend the edge relationship for the call statement in the call relationship graph and the control flow graph.

Then, at 650, data propagation path information for the application system is determined from the extended control flow graph.

In fig. 7, the diagram shown on the far left is an initial Call Graph constructed based on Main () and foo (). In this Call Graph, main (), foo (), source () and Sink () are nodes, and a connecting line between each node represents an edge. As shown in fig. 7, there is an edge relationship between main () and foo () and Sink (), and an edge relationship between foo () and source ().

The diagram shown in the middle is a control flow graph, also called inter-process control flow (ICFG). In this control flow graph example, X ═ new X (), x.f ═ source (), return X, b ═ foo (a), and sink (b.f) are nodes, b ═ foo (a) has an edge relationship with X ═ new X (), return X, and sink (b.f), and X ═ new X () has an edge relationship with x.f ═ source (), x.f ═ source () has an edge relationship with return X, and b ═ foo (a) has an edge relationship with sink (b.f).

The rightmost diagram is a dataflow diagram of procedure calls of an application system and may also be referred to as a dataflow diagram. In one example of this specification, nodes in a dataflow graph are fields, and edges are data flow directions between fields, i.e., data propagation directions. In one example, the fields may include a code field or a database field.

By using the taint analysis method, a smaller-scale Call Graph is constructed only for the application layer codes and part of necessary library file codes in the program codes, and taint analysis is performed on the smaller-scale Call Graph, so that the workload of taint analysis is greatly reduced, and the performance of taint analysis is ensured. Therefore, the taint analysis scheme which is efficient and has high accuracy and recall rate can be provided for large-scale enterprise application, especially under the condition that implicit dependence caused by a large number of native methods, libraries and frameworks is used.

In addition, in the taint analysis method, the Call Graph is constructed by adopting a first calling relation construction algorithm with high precision for the application layer codes, and the overall accuracy of the constructed Call Graph can be improved. In addition, the second Call relation construction algorithm with relatively low precision and better performance is used for realizing the edge relation expansion aiming at the Call Graph and the control flow Graph, so that the missed edge rewrites can be efficiently realized, and the recall rate is further ensured.

Fig. 8 illustrates an example schematic of data propagation path information in accordance with an embodiment of the present description. Wherein the data propagation path information numbered 001 is data propagation path information obtained by performing taint analysis on the application system a, the data propagation path information numbered 002 is data propagation path information obtained by performing taint analysis on the application system B, and the data propagation path information numbered 003 is data propagation path information obtained by performing taint analysis on the application system C.

Further optionally, in one example, the data propagation tracking system 200 may also include a data storage 240. The data storage 240 stores data propagation path information of the application system in a database. Further optionally, in one example, the stored data propagation path information may be constructed as a dataflow graph.

For example, after the data propagation path information of each single application system is obtained as described above, the obtained data propagation path information of the single application system is stored in the relational database. And then synchronizing to an offline data warehouse and finally to a graph database, thereby obtaining a data flow graph. A dataflow graph is graph data that is made up of data propagation path information. When the application system comprises a plurality of application systems, the dataflow graph includes a dataflow graph across the application systems that is constructed by linking data propagation path information of the plurality of application systems. The obtained data flow graph of the cross-application system can be applied to application scenes such as data leakage, change management and data consistency check.

FIG. 9 illustrates an example schematic diagram of a dataflow graph across application systems in accordance with an embodiment of the present specification. After analyzing the obtained data propagation path information for the application systems A, B and C in fig. 8, the obtained data propagation path information may be linked, thereby obtaining a data flow graph across the application systems. The cross-application data flow graph can reveal the data flow relationship of data in each application system.

In the data flow graph example shown in FIG. 9, 4 nodes N are included₁To N₄Wherein node N₁Represents "a.service1. method1.requestclass. nodal", node N₂Denotes "b.service2. method2.requestclass. principal", node N₃Denotes "c.service 3.method3.requestclass. id", and node N₄Denotes "c.db.table 1. id".

Further optionally, in one example, the data propagation tracking system 200 may further include a path information querying device 250. The path information query device 250 is configured to perform a data propagation path information query in the database in response to the data propagation path information query request, and provide a data propagation path information query result, for example, to a user who issued the query request.

Optionally, in an example, the path information query device 250 may include a path information query interface and a visualization presenting unit. The path information query interface is used by a user to input a path information query request. For example, the path information query interface may be implemented as an API interface. The visualization presenting unit is configured to present the queried data propagation path information to the user in a visualization manner.

Further optionally, in one example, the data propagation tracking system 200 may further include a distributed scheduling apparatus 260. The distributed scheduling apparatus 260 is configured to perform distributed scheduling on the taint analysis task of the application system.

Optionally, in an example, the distributed scheduling apparatus 260 may adopt a layer 2 distributed scheduling policy for distributed scheduling. The first layer is an application and the second layer is a slice. For example, the number of application systems to be analyzed is 1000, and the distributed scheduling apparatus 260 first collects N applications to perform the first-layer distribution, so as to distribute the applications to N servers for taint analysis, and to ensure that each server is divided into one application. Each server then runs a code modeling process. In addition, at the final stage of the code modeling process, a slice (slice) process is performed. The purpose of slicing is to reduce the code complexity into multiple small parts and reduce hot spots as much as possible. Because static taint analysis consumes a lot of memory, and when a hot spot occurs, that is, path tracing of some fields is very complicated, a situation of insufficient memory, that is, memory overflow, easily occurs, so that different application systems are cut into slices with different numbers according to respective code complexity, and then the slices can be used by the distributed scheduling apparatus 260 to perform a second-layer distribution, and are thrown onto N servers, and each server is divided into one slice as much as possible. The taint analysis device 230 is then enabled to perform field-based static taint analysis and static code tracking. When one server analysis is completed, the distributed scheduling device 260 schedules the next slice analysis, and continues in this manner until all analyses are completed. In one example, two queues, an application queue and a slice queue, may be generated, and the application and the slice to be analyzed are loaded into the application queue and the slice queue in a FIFO manner, and the distributed scheduling apparatus 260 obtains the application and the slice from the application queue and the slice queue to perform distributed scheduling. Furthermore, the distributed scheduling of the distributed scheduling apparatus 260 may employ a Pipeline (Pipeline) mechanism.

Further optionally, in one example. The distributed scheduling of the distributed scheduler 260 employs a balance policy, i.e., the distributed scheduler 260 ensures that the applications and slices are distributed to each server as uniformly as possible. However, sometimes 2 applications or 2 slices are thrown to the same server, and since the analysis consumes memory and CPU, the same server cannot be allowed to start 2 tasks at the same time, so that a single lock mechanism is adopted in the distributed scheduling process, that is, when the analysis is not finished, one server directly skips if a new analysis task is distributed, and the skipped applications or slices are placed in a cache queue, so that many skipped tasks occur, and the tasks are not performed with taint analysis. In view of this, in the distributed scheduling process, a backoff thread may also be added. Specifically, if a server is idle for a period of time (i.e., the distributed scheduling apparatus has not correctly distributed the analysis tasks to the idle server), it will actively call a backoff thread to pull the skipped tasks out of the cache queue one by one in order to start analysis until the cache queue is empty.

A data propagation tracking system according to embodiments of the present specification is described above with reference to fig. 1 to 9. By utilizing the data propagation tracking system, the taint analysis aiming at the inter-process call of the application system can be realized, the data propagation path information of the accessed data is obtained, and the data flow transition tracking aiming at the accessed data is realized.

FIG. 10 illustrates an example flow diagram of a method 1000 for implementing data propagation tracking for an application system (hereinafter "data propagation tracking method"), according to embodiments of the present description.

As shown in fig. 10, at 1010, a code compilation is performed on a program source code of an application system to obtain a code compilation result.

At 1020, code modeling is performed using the code compilation results to construct elemental information required for taint analysis, the elemental information including a contamination start point, a contamination end point, and a program entry point.

At 1030, the constructed element information is used for performing taint analysis on the code compiling result to obtain data propagation path information of the application system, wherein the data propagation path information is used for indicating a data flow direction relation between a pollution starting point and a pollution end point. Optionally, in one example, the data propagation path information is a data flow direction relationship between a pair of fields, and the fields include a code field or a database field.

At 1040, the data propagation path information for the application system is stored in a database. In one example, the stored data propagation path information may be constructed as a dataflow graph. Where the application system includes a plurality of application systems, the constructed dataflow graph may include a dataflow graph across application systems constructed by linking data propagation path information of the plurality of application systems.

At 1050, in response to the data propagation path information query request, a data propagation path information query is performed in the database and a data propagation path information query result is provided.

Further, optionally, before performing taint analysis on the code compilation result, the data propagation tracking method may further include: and performing distributed scheduling on the taint analysis task of the application system.

Further, optionally, before constructing element information required for taint analysis according to the code compiling result, the data propagation tracking method may further include: and performing packet supplementing processing on the code compiling result.

Further, it is noted that what is shown in fig. 10 is merely an exemplary embodiment, and in other embodiments of this description, one or both of the operations of 1040 and 1050 may not be included.

As described above with reference to fig. 1 to 10, the data propagation tracking method and the data propagation tracking apparatus according to the embodiment of the present specification are described. The above data propagation tracking device may be implemented by hardware, or may be implemented by software, or a combination of hardware and software.

FIG. 11 illustrates a schematic diagram of an electronic device 1100 for implementing data propagation tracking for an application system in accordance with embodiments of the present description. As shown in fig. 11, electronic device 1100 may include at least one processor 1110, a memory (e.g., non-volatile storage) 1120, a memory 1130, and a communication interface 1140, and the at least one processor 1110, memory 1120, memory 1130, and communication interface 1140 are connected together via a bus 1160. The at least one processor 1110 executes at least one computer-readable instruction (i.e., the elements described above as being implemented in software) stored or encoded in memory.

In one embodiment, computer-executable instructions are stored in the memory that, when executed, cause the at least one processor 1110 to: performing code compiling on a program source code of an application system to obtain a code compiling result; performing code modeling using the code compilation result to construct element information required for taint analysis, the element information including a contamination start point, a contamination end point, and a program entry point; and performing taint analysis on the code compiling result by using the constructed element information to obtain data propagation path information of the application system, wherein the data propagation path information is used for indicating a data flow direction relation between a pollution starting point and a pollution end point.

It should be appreciated that the computer-executable instructions stored in the memory, when executed, cause the at least one processor 1110 to perform the various operations and functions described above in connection with fig. 1-10 in the various embodiments of the present description.

According to one embodiment, a program product, such as a machine-readable medium (e.g., a non-transitory machine-readable medium), is provided. A machine-readable medium may have instructions (i.e., elements described above as being implemented in software) that, when executed by a machine, cause the machine to perform various operations and functions described above in connection with fig. 1-10 in the various embodiments of the present specification. Specifically, a system or apparatus may be provided which is provided with a readable storage medium on which software program code implementing the functions of any of the above embodiments is stored, and causes a computer or processor of the system or apparatus to read out and execute instructions stored in the readable storage medium.

In this case, the program code itself read from the readable medium can realize the functions of any of the above-described embodiments, and thus the machine-readable code and the readable storage medium storing the machine-readable code form part of the present invention.

Examples of the readable storage medium include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or from the cloud via a communications network.

It will be understood by those skilled in the art that various changes and modifications may be made in the above-disclosed embodiments without departing from the spirit of the invention. Accordingly, the scope of the invention should be determined from the following claims.

It should be noted that not all steps and units in the above flows and system structure diagrams are necessary, and some steps or units may be omitted according to actual needs. The execution order of the steps is not fixed, and can be determined as required. The apparatus structures described in the above embodiments may be physical structures or logical structures, that is, some units may be implemented by the same physical entity, or some units may be implemented by a plurality of physical entities, or some units may be implemented by some components in a plurality of independent devices.

In the above embodiments, the hardware units or modules may be implemented mechanically or electrically. For example, a hardware unit, module or processor may comprise permanently dedicated circuitry or logic (such as a dedicated processor, FPGA or ASIC) to perform the corresponding operations. The hardware units or processors may also include programmable logic or circuitry (e.g., a general purpose processor or other programmable processor) that may be temporarily configured by software to perform the corresponding operations. The specific implementation (mechanical, or dedicated permanent, or temporarily set) may be determined based on cost and time considerations.

The detailed description set forth above in connection with the appended drawings describes exemplary embodiments but does not represent all embodiments that may be practiced or fall within the scope of the claims. The term "exemplary" used throughout this specification means "serving as an example, instance, or illustration," and does not mean "preferred" or "advantageous" over other embodiments. The detailed description includes specific details for the purpose of providing an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described embodiments.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A system for data propagation tracking of an application system, comprising:

the code compiling device is used for compiling a program source code of the application system to obtain a code compiling result;

a code modeling means for performing code modeling using the code compiling result to construct element information required for taint analysis, the element information including a contamination start point, a contamination end point, and a program entry point; and

and the taint analysis device is used for carrying out taint analysis on the code compiling result by using the constructed element information to obtain data propagation path information of the application system, and the data propagation path information is used for indicating the data flow direction relation between the pollution starting point and the pollution end point.

2. The system of claim 1, wherein the data propagation path information is a data flow relationship between pairs of fields, and the fields comprise code fields or database fields.

3. The system of claim 1 or 2, further comprising:

and the data storage device stores the data propagation path information of the application system into a database.

4. The system of claim 3, wherein the stored data propagation path information is constructed as a dataflow graph.

5. The system of claim 4, wherein the application system includes a plurality of application systems and the dataflow graph includes a dataflow graph across application systems that is constructed by linking data propagation path information of the plurality of application systems.

6. The system of claim 3, further comprising:

and the path information inquiry device responds to the data propagation path information inquiry request, inquires the data propagation path information in the database and provides a data propagation path information inquiry result.

7. The system of claim 6, wherein the path information query means comprises:

a path information query interface used by a user to input a path information query request; and

and the visual presentation unit is used for presenting the inquired data propagation path information to a user in a visual mode.

8. The system of claim 1, further comprising:

and the distributed scheduling device is used for performing distributed scheduling on the taint analysis task of the application system.

9. The system of claim 1, wherein the code compiling apparatus further performs a complementary packing process on the code compiling result.

10. The system of claim 1, wherein the code modeling means comprises:

the model topology creating unit is used for scanning the configuration file of the code compiling result to obtain an SQL configuration file and a class file, and organizing the class file according to a topological structure to obtain an SOA model topology;

the SQL conversion unit is used for converting SQL-like statements in the SQL configuration file into analyzable SQL statements;

the SQL analysis unit is used for analyzing the analyzable SQL statement in the converted SQL configuration file into a table and a field;

the element construction unit is used for carrying out code analysis on the data access layer and the application framework and constructing element information of the code layer; and

and the association mapping unit is used for performing association mapping on the element information constructed based on the data access layer and the fields in the analyzed SQL sentences in the SQL configuration file.

11. The system of claim 10, wherein when the application system is a Java-based implementation application system, the code modeling means further comprises:

and the byte code modification unit is used for carrying out byte code modification on the code compiling result.

12. The system of claim 1, wherein the taint analysis device comprises:

a control flow graph generating unit that generates a control flow graph from a call relation graph that is constructed from application layer code in program code of the application system by using a first call relation construction algorithm;

a taint analysis unit which uses the control flow graph to traverse program codes of an application system for taint analysis;

the edge relation expansion unit is used for expanding an edge relation for the calling statement in the calling relation graph and the control flow graph by using a second calling relation construction algorithm when the taint analysis result indicates that the calling statement does not have the edge relation in the calling relation graph; and

and the data propagation path information determining unit is used for determining the data propagation path information of the application system according to the expanded control flow graph.

13. A method for data propagation tracking for an application system, comprising:

performing code compiling on a program source code of an application system to obtain a code compiling result;

performing code modeling using the code compilation result to construct element information required for taint analysis, the element information including a contamination start point, a contamination end point, and a program entry point; and

and performing taint analysis on the code compiling result by using the constructed element information to obtain data propagation path information of the application system, wherein the data propagation path information is used for indicating a data flow direction relation between a pollution starting point and a pollution end point.

14. The method of claim 13, wherein the data propagation path information is a data flow relationship between pairs of fields, and the fields comprise code fields or database fields.

15. The method of claim 13 or 14, further comprising:

and storing the data propagation path information of the application system into a database.

16. The method of claim 15, wherein the data propagation path information is constructed as a dataflow graph.

17. The method of claim 16, wherein the application system comprises a plurality of application systems and the dataflow graph comprises a dataflow graph across application systems that is constructed by linking data propagation path information of the plurality of application systems.

18. The method of claim 15, further comprising:

and responding to the data propagation path information query request, performing data propagation path information query in the database, and providing a data propagation path information query result.

19. The method of claim 13, prior to performing taint analysis on the code compilation results, the method further comprising:

and performing distributed scheduling on the taint analysis task of the application system.

20. The method of claim 13, further comprising:

and performing pack supplementing processing on the code compiling result before constructing element information required by taint analysis according to the code compiling result.

21. The method of claim 13, wherein constructing the factor information required for taint analysis from the code compilation results comprises:

scanning a configuration file of the code compiling result to obtain an SQL configuration file and a class file, and organizing the class file according to a topological structure to obtain an SOA model topology;

converting SQL-like statements in the SQL configuration file into analyzable SQL statements;

analyzing the analyzable SQL statement in the converted SQL configuration file into a table and a field;

carrying out code analysis on the data access layer and the application framework to construct element information of the code layer; and

and performing association mapping on the element information constructed based on the data access layer and the fields in the analyzed SQL sentences in the SQL configuration file.

22. The method of claim 13, wherein performing taint analysis on the code compilation result using the constructed factor information comprises:

generating a control flow graph from a call relation graph, the call relation graph being constructed from application layer code in program code of the application system using a first call relation construction algorithm;

traversing program code of an application system for taint analysis using the control flow graph;

when the taint analysis result indicates that the calling statement does not have an edge relation in the calling relation graph, using a second calling relation construction algorithm to expand the edge relation for the calling statement in the calling relation graph and the control flow graph; and

and determining data propagation path information of the application system according to the expanded control flow graph.

23. An electronic device, comprising:

at least one processor, and

a memory coupled with the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of any of claims 13-22.

24. A machine-readable storage medium storing executable instructions that, when executed, cause the machine to perform the method of any of claims 13 to 22.