CN112800425A

CN112800425A - Code analysis method and device based on graph calculation

Info

Publication number: CN112800425A
Application number: CN202110145882.0A
Authority: CN
Inventors: 左志强; 张奕裕; 王林章; 李宣东
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-02-03
Filing date: 2021-02-03
Publication date: 2021-05-14
Anticipated expiration: 2041-02-03
Also published as: CN112800425B

Abstract

The invention discloses a code analysis method and device based on graph computation. The method comprises the steps that program codes are converted into a global flow control chart without function call nodes at the first phase, and code statements of each node are converted into data flow direction information; and then partitioning the global flow control chart, wherein each partition is a sub-control flow chart, initializing a node set to be analyzed of each partition after partitioning, analyzing the partitions by taking the partitions as units and adopting an integral synchronous parallel computing mode, calling an interface for analysis realized by a user for each node during partition analysis, and deleting the nodes from the node set to be analyzed until the output of the interface for analysis realized by the user is stable, until the nodes in the node set to be analyzed of each partition are emptied. And performing disk data scheduling by taking the partitions as units during partition synchronous parallel computing. The invention can realize the data flow and context sensitive analysis of large-scale system software codes on a single machine.

Description

Code analysis method and device based on graph calculation

Technical Field

The invention relates to code analysis technology, in particular to data stream sensitivity and context sensitivity analysis.

Background

The program static analysis technology plays an important role in a wide range of application programs, including automatic vulnerability discovery, compiler optimization, security vulnerability detection, and the like. The existing static analysis technology has different tradeoffs in analysis precision and expandability and the realization of the algorithm is mainly specific to a specific use scene. Generally speaking, algorithms that differentiate results based on various program attributes (e.g., call context or control flow graph) are more useful than algorithms that do not differentiate results. For example, these precise algorithms can be used to discover more real bugs and reduce the reporting of false alarms. Related context-sensitive analysis techniques, field-sensitive analysis techniques, flow-sensitive analysis techniques and path-sensitive analysis techniques have thus emerged in the field of program analysis. Although these techniques analyze better than their insensitive counterparts, they are much more computationally expensive and may require memory resources that a computer cannot provide.

Considering that the technologies are applied to limited resources, they are difficult to be expanded to analysis application of large-scale system codes. To alleviate this problem, existing work uses complex processing methods, such as adjusting the level of context sensitivity, exploring different forms of sensitivity, or pre-processing using inexpensive techniques to find a balance between scalability, versatility, and utility. However, the above improved method is effective only for its specific application and is not universally applicable, and the specific implementation of the method is so complicated that it requires a careful design by a field specialist and is almost impossible for a conventional developer in the industry to use the techniques.

Disclosure of Invention

The problems to be solved by the invention are as follows: sophisticated static analysis techniques are often very complex to implement because they implement much of the logic of optimization and expansion on the underlying analysis functionality. The tight coupling of the basic analysis function and the special processing of scalability makes static analysis difficult to ensure the analysis correctness, understand codes and use, and reuse by other clients.

In order to solve the problems, the invention adopts the following scheme:

further, a method of graph computation based code analysis according to the present invention comprises the steps of:

s1: acquiring a program code to be analyzed;

s2: converting a program code to be analyzed into a global control flow graph; when the global control flow graph is converted, function calling nodes in the global control flow graph are expanded from top to bottom according to the control flow graph of the called functions, so that the global control flow graph does not contain the nodes called by the functions; simultaneously extracting data flow direction information contained in a code statement corresponding to each node of the global control flow graph, so that each node of the global control flow graph contains the corresponding data flow direction information;

s3: dividing the global control flow graph into a plurality of sub-control flow graphs as partitions, and initializing a node set to be analyzed of the partitions;

s4: analyzing each partition in a mode of integral synchronous parallel computation and an iterative mode by taking the partition as a synchronization unit based on an interface for realizing analysis by a user until a node set to be analyzed of each partition is emptied;

the analysis of the partitions comprises the following steps:

s421: based on an interface for realizing analysis by a user, carrying out iterative analysis on nodes in a node set to be analyzed of a target partition and data flow direction information corresponding to the nodes until the output data flow information corresponding to the nodes is not changed any more, deleting the corresponding nodes from the target partition until no node exists in the node set to be analyzed, and thus obtaining the output data flow information corresponding to each node in the target partition; the output data flow information is data flow direction information and is obtained by calling an analysis interface realized by a user by taking the data flow direction information of the nodes as parameters; the target partition is a partition in analyzing the partition;

s422: according to the edge relation between the nodes of the global control flow graph, the output nodes of the target partition and the output data flow information of the output nodes are merged into the message queues of the subsequent partitions corresponding to the output nodes; the output node is a node of which the subsequent node in the target partition is a node of other partition nodes;

s423: and carrying out deduplication processing on the data flow direction information corresponding to each node of the target partition.

Further, according to the method of code analysis based on graph computation of the present invention, step S421 and step S422 are processed as a whole in a manner of overall synchronous parallel computation.

Further, according to the method for code analysis based on graph computation of the present invention, in step S3, the global control flow graph is divided into a plurality of partitions, and the partitions are divided into K partitions by using a node number balancing principle, where K is a preset number of partitions.

Further, according to the method for analyzing code based on graph computation of the present invention, in step S3, the method further includes creating and marking a mirror input node and a mirror output node for the corresponding partition respectively on both sides of the partition where the edge is located across the edges of the partition; in step S422, the mirror output node is used as the output node.

Further, according to the method of the present invention for code analysis based on graph calculation, step S423 includes the steps of:

s4231: constructing a project set for the data flow information of each node of the target partition;

s4232: finding out an item set of which the item set at least appears for T times in the target partition as a high-frequency item set;

s4233: calculating a priority value of each high-frequency item set; the priority value is the product of the number of items and the frequency thereof;

s4234: checking whether the data flow information of each node of the target partition contains a high-frequency item set; if the data flow information of the node only contains a certain high-frequency item set, replacing the high-frequency item set part in the data flow information of the node with the reference of the high-frequency item set; if the data flow information of the node comprises a plurality of high-frequency item sets, taking the high-frequency item sets as candidate high-frequency item sets, and replacing the corresponding high-frequency item set part in the data flow information of the node with the reference of the corresponding high-frequency item set by adopting a greedy algorithm according to the priority value degradation order of the candidate high-frequency item sets;

wherein T is preset.

Further, according to the method of code analysis based on graph calculation of the present invention, step S4231 is replaced with: constructing a project set for data flow direction information of each node of the global control flow graph; step S4232 is replaced with: and finding out the item set of which the item set at least appears for T times in the global control flow graph as a high-frequency item set.

Further, according to the method for code analysis based on graph computation of the present invention, the step S4 further includes the steps of:

s41: calling a synchronous partition set; the synchronous partition set is a partition set required by the overall synchronous parallel computation;

s42: analyzing each partition in the synchronous partition set in a mode of integral synchronous parallel computing with the partition as a synchronous unit based on an interface for realizing analysis by a user;

s43: merging nodes in the message queues of the partitions and output data stream information corresponding to the nodes into the partitions corresponding to the nodes;

s44: steps S41 to S44 are repeatedly executed until the node sets to be analyzed of the respective partitions are emptied.

Further, according to the method for code analysis based on graph computation of the present invention, in step S3, when the global control flow graph is divided into a plurality of partitions, a priority is set for each partition according to a data flow direction relationship and/or an edge relationship between nodes; in step S41, when a synchronous partition set is called, topN partitions with the highest priority and not emptied by nodes are selected to form the synchronous partition set; wherein topN is predetermined.

Further, according to the method for code analysis based on graph computation of the present invention, step S43 is performed by using an overall synchronous parallel computation.

Further, according to the method for code analysis based on graph computation of the present invention, in step S2, the node information in the global control flow graph is stored in the disk; the node information at least comprises a node identification code and data flow direction information; in step S3, the global control flow graph is divided into a plurality of partitions, and the node identification code is used as a node division partition of the global control flow graph; the nodes stored in the partitions are node identification codes; step S41 further includes the steps of: and calling the data flow direction information of the corresponding node from the disk for each node according to the node identification code of each node in each partition in the synchronous partition set, and calling the data flow direction information of the corresponding node from the disk for each node not belonging to each partition in the synchronous partition set.

The invention relates to a device for code analysis based on graph calculation, which comprises the following modules:

m1, used for: acquiring a program code to be analyzed;

m2, used for: converting a program code to be analyzed into a global control flow graph; when the global control flow graph is converted, function calling nodes in the global control flow graph are expanded from top to bottom according to the control flow graph of the called functions, so that the global control flow graph does not contain the nodes called by the functions; simultaneously extracting data flow direction information contained in a code statement corresponding to each node of the global control flow graph, so that each node of the global control flow graph contains the corresponding data flow direction information;

m3, used for: dividing the global control flow graph into a plurality of sub-control flow graphs as partitions, and initializing a node set to be analyzed of the partitions;

m4, used for: analyzing each partition in a mode of integral synchronous parallel computation and an iterative mode by taking the partition as a synchronization unit based on an interface for realizing analysis by a user until a node set to be analyzed of each partition is emptied;

the analysis of the partitions comprises the following modules:

m421, for: based on an interface for realizing analysis by a user, carrying out iterative analysis on nodes in a node set to be analyzed of a target partition and data flow direction information corresponding to the nodes until the output data flow information corresponding to the nodes is not changed any more, deleting the corresponding nodes from the target partition until no node exists in the node set to be analyzed, and thus obtaining the output data flow information corresponding to each node in the target partition; the output data flow information is data flow direction information and is obtained by calling an analysis interface realized by a user by taking the data flow direction information of the nodes as parameters; the target partition is a partition in analyzing the partition;

m422, used for: according to the edge relation between the nodes of the global control flow graph, the output nodes of the target partition and the output data flow information of the output nodes are merged into the message queues of the subsequent partitions corresponding to the output nodes; the output node is a node of which the subsequent node in the target partition is a node of other partition nodes;

m423 for: and carrying out deduplication processing on the data flow direction information corresponding to each node of the target partition.

Further, according to the apparatus for code analysis based on graph computation of the present invention, the module M421 and the module M422 as a whole are processed in a manner of overall synchronous parallel computation.

Further, according to the apparatus for code analysis based on graph computation of the present invention, in the module M3, the global control flow graph is divided into a plurality of partitions, and the partitions are divided into K partitions by using a node number balancing principle, where K is a preset number of partitions.

Further, according to the apparatus for analyzing code based on graph computation of the present invention, in the module M3, creating and marking a corresponding mirror input node and a corresponding mirror output node for the corresponding partition on both sides of the partition where the edge is located across the edges of the partition; in module M422, the mirror output node is used as the output node.

Further, according to the apparatus for code analysis based on graph computation of the present invention, the module M423 includes the following modules:

m4231, for: constructing a project set for the data flow information of each node of the target partition;

m4232, for: finding out an item set of which the item set at least appears for T times in the target partition as a high-frequency item set;

m4233, for: calculating a priority value of each high-frequency item set; the priority value is the product of the number of items and the frequency thereof;

m4234, for: checking whether the data flow information of each node of the target partition contains a high-frequency item set; if the data flow information of the node only contains a certain high-frequency item set, replacing the high-frequency item set part in the data flow information of the node with the reference of the high-frequency item set; if the data flow information of the node comprises a plurality of high-frequency item sets, taking the high-frequency item sets as candidate high-frequency item sets, and replacing the corresponding high-frequency item set part in the data flow information of the node with the reference of the corresponding high-frequency item set by adopting a greedy algorithm according to the priority value degradation order of the candidate high-frequency item sets;

wherein T is preset.

Further, according to the apparatus of the present invention for code analysis based on graph calculation, the module M4231 is replaced with: constructing a project set for data flow direction information of each node of the global control flow graph; module M4232 is replaced with: and finding out the item set of which the item set at least appears for T times in the global control flow graph as a high-frequency item set.

Further, according to the apparatus for code analysis based on graph computation of the present invention, the module M4 further includes the following modules:

m41, used for: calling a synchronous partition set; the synchronous partition set is a partition set required by the overall synchronous parallel computation;

m42, used for: analyzing each partition in the synchronous partition set in a mode of integral synchronous parallel computing with the partition as a synchronous unit based on an interface for realizing analysis by a user;

m43, used for: merging nodes in the message queues of the partitions and output data stream information corresponding to the nodes into the partitions corresponding to the nodes;

m44, used for: the modules M41 to M44 are repeatedly executed until the node sets to be analyzed of the respective partitions are emptied.

Further, according to the apparatus for code analysis based on graph computation of the present invention, in the module M3, when the global control flow graph is divided into a plurality of partitions, a priority is set for each partition according to a data flow direction relationship and/or an edge relationship between nodes; in a module M41, when a synchronous partition set is called, topN partitions with highest priority and not emptied by nodes are selected to form the synchronous partition set; wherein topN is predetermined.

Further, according to the apparatus for code analysis based on graph computation of the present invention, the module M43 is implemented by using an overall synchronous parallel computation.

Further, according to the apparatus for code analysis based on graph computation of the present invention, in the module M2, the node information in the global control flow graph is stored in the disk; the node information at least comprises a node identification code and data flow direction information; in a module M3, a global control flow graph is divided into a plurality of partitions, and a node identification code is used as a node division partition of the global control flow graph; the nodes stored in the partitions are node identification codes; module M41 also includes modules: and calling the data flow direction information of the corresponding node from the disk for each node according to the node identification code of each node in each partition in the synchronous partition set, and calling the data flow direction information of the corresponding node from the disk for each node not belonging to each partition in the synchronous partition set.

The invention has the following technical effects:

1. the invention performs pointer/alias analysis and data stream analysis in a stream sensitive and context sensitive manner, can effectively eliminate infeasible alias relation and unreachable data stream, and improves the accuracy of defect detection.

2. The method can form the flow sensitivity analysis into the problem of calculation of the evolution diagram, and further can efficiently process the problem of memory explosion caused by context sensitivity analysis by the extranuclear calculation technology based on the hard disk, so that the method can support complete flow sensitivity analysis and context sensitivity analysis of large-scale system codes.

3. The invention can be deployed and realized in a single-machine environment, so that developers can operate in own working environment without accessing clusters, and the daily development tasks of the developers are satisfied to provide extensible analysis support.

4. The invention provides a flow analysis and context analysis platform for developers, so that the developers can complete the required program analysis tasks in a self-defined way without professional knowledge in the program analysis field.

Drawings

FIG. 1 is a schematic overall flow chart of an embodiment of the present invention.

FIG. 2 is a flow chart of partition analysis according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The present embodiment relates to a code analysis platform system that requires an interface to be custom-implemented by a code analyst or software developer. The platform system calls a custom-realized interface in the code analysis process, and finally realizes code specific analysis. The interface realized by the user definition is called an interface for the analysis realized by the user. The method implemented by the platform system is a code analysis method based on graph computation.

As shown in fig. 1, in this embodiment, the method includes the following steps:

s1: acquiring a program code to be analyzed;

s4: and analyzing each partition in a mode of integral synchronous parallel computation and an iterative mode by taking the partition as a synchronization unit based on an analysis interface realized by a user until the node set to be analyzed of each partition is emptied.

Here, step S1 indicates that the program code is the input of the present invention, and how to input and obtain the program code is not described in detail in the context of the present invention. The program code may be program code of various languages, for example, the code analysis platform system according to this embodiment is a code analysis platform system of C + + language, and those skilled in the art understand that the C + + language may be extended to any program language, such as program languages of C, Java, Python, Go, Ada, and the like.

In step S2, those skilled in the art understand that when the code is converted into the control flow graph, the function call is usually a node in the control flow graph. For example, in this embodiment, a class-based in-process compiler Pass is used to generate a control flow graph, and a function call process is a node in the control flow graph. In order to obtain a global control flow graph, in this embodiment, first, all functions are expanded into a control flow graph, then, nodes of the control flow graph corresponding to a main program are traversed from top to bottom, a function call node in the control flow graph is replaced with the control flow graph of the function, and the control flow graph under the function is continuously traversed after the function call node is replaced until all the nodes are not the nodes called by the function. "until all the nodes are not the nodes of the function call" is the "node which causes no function call to be included in the global control flow graph" of the aforementioned step S2. It should be noted that to avoid the problem of infinite loop unrolling of function recursive calls, the function calls that have been unrolled in the main control flow graph are directly represented by node-connected edges.

In the above process, each node in the nodes of the control flow graph is a program code statement. For this purpose, program code statements of individual nodes in the nodes of the control flow graph are also converted into data flow direction information. The data flow direction information is used to represent a data flow relationship, for example, a program code statement "a = b + c" may be regarded as b and c flow direction a, so that a one-way edge relationship from variable node b and variable node c to variable node a may be constructed. These one-way edge relationships can be viewed as a graph. Thus, the data flow information is a variable relationship graph. These one-way edge relationships can be grouped into sets, and thus, data flow information is also a set of variable one-way edges. The variable one-way edge here means that nodes at both ends of the edge are both variables.

In addition, the problem that the memory cannot bear the data volume explosion when the control flow diagram of the function call is expanded from top to bottom is solved. In this embodiment, when a control flow graph traversal node for function call is developed from top to bottom, the node of the program code statement is converted into data flow direction information, and then the node information is stored in a disk. When the node information is stored in the disk, a node identification code is established for the node, the node information is stored in the disk according to the node identification code, and the node in the global control flow graph only stores the node identification code. The node information comprises a node identification code, a program code statement of the node, a file in which the program code statement is located, a row number and a column number of the program code statement in the file, a function in which the node is located, a class information, a module information and a data flow information for representing a variable relational graph of the program code statement. It should be noted that, in the present invention, the program code statement of the node, the file in which the program code statement is located, the row and column number of the program code statement in the file, and the function, class, and module information in which the node is located are only for convenience of indexing, and are not essential to the present invention, and are optionally stored information, and the node identification code and the data flow direction information are essential information in the present invention.

Step S3 is to divide the global control flow graph into partitions. The division of the global control flow graph into partitions is performed to allow the partitions to serve as synchronization means when the parallel computation is synchronized as a whole in the subsequent step S4. After the global control flow graph is divided into partitions, each partition forms a sub-control flow graph. When the global control flow graph is divided into partitions, the node identification codes are used as the nodes of the global control flow graph, namely, at the moment, the nodes only need the node identification codes and do not need complete node information. There are many ways to partition a global control flow graph. For example, the partitions are determined according to the number of nodes of the sub-control flow graph in the partitions, so that the number of nodes of the sub-control flow graph in the partitions is limited in a preset numerical range. For another example, the partitioning is performed according to the procedure of the function call, where the number of partitions is related to the number of functions of the program code. In this embodiment, the global control flow graph is divided into K partitions by using a node number balancing principle, where K is a preset partition number. The principle of node number balancing, that is, the requirement is to keep the number of nodes in each partition number as the same as possible. More specifically, in this embodiment, the partitioning of the global control flow graph includes the following steps:

step S31: calculating the partition average node number Na = (Nt + K-1)/K; where Na is the partition average number of nodes and Nt is the number of nodes in the global control flow graph.

Step S32: traversing nodes according to the topological order of the global control flow graph from each entry node of the global control flow graph, adding the traversed nodes and edges to form a sub-control flow graph, generating partitions according to the sub-control flow graph until the number of nodes of the sub-control flow graph reaches the average number Na of the partitions, and then continuously traversing the nodes until all the nodes in the global control flow graph are traversed, namely generating K partitions.

In step S32, when traversing the nodes according to the topological order of the global control flow graph, the present embodiment preferably adopts a breadth-first traversal manner. Those skilled in the art will appreciate that depth-first traversal may also be employed.

The partitions generated after the global control flow graph partitions only include the sub-control flow graphs. To facilitate subsequent processing, the generated partitions also need to initialize some partition data. In this embodiment, the partition data includes: priority, set of nodes to be analyzed, set of data flow information and message queue.

The priority of the partitions is used for performing partition scheduling when the partitions are used as synchronization units to perform overall synchronous parallel computation. The "setting priority for each partition according to the data flow direction relationship and/or the edge relationship between the nodes" means that the partition priority may be set according to the data flow direction relationship, or according to the edge relationship between the nodes, or according to the data flow direction relationship and the edge relationship between the nodes. The edge relation between the nodes is the dependency relation between the partition child control flow graph. In this embodiment, the partition priority is determined according to the following method: the partition priority is the maximum of the sum of the partition priority of the forward-dependent partition and the number of input nodes of the forward-dependent partition to the local partition.

In the invention, for any node edge N1- > N2 in a control flow graph, a node N1 is a preamble node of a node N2, and a node N2 is a subsequent node of a node N1. If node edge N1- > N2 is partitioned across, the partition where node N1 is located is a forward dependent partition of the partition where node N2 is located, the partition where node N2 is located is a subsequent partition of the partition where node N1 is located, node N1 is an output node of the partition where node N1 is located, node N2 is an input node of the partition where node N2 is located, specifically, node N2 is an input node of the partition where node N1 is located to the partition where node N2 is located, node N1 is an output node of the partition where node N1 is located to the partition where node N2 is located, the partition where node N2 is located is a subsequent partition of output node N1, and the partition where node N1 is located is a forward dependent partition of input node N2.

Obviously, at this time, the lower the partition priority value, the higher the priority level.

The node set to be analyzed and the data flow information set are temporary variables for area analysis in the subsequent steps. The node set to be analyzed is used for representing nodes to be analyzed in the partition, and the nodes in the node set to be analyzed of the partition are emptied one by one along with the progress of partition analysis. During initialization, all nodes in the partition sub-control flow graph are added into a node set to be analyzed, and all nodes are indicated to be subjected to iterative analysis. The data flow direction information set is used for storing data flow direction information of each node of the partitioned sub-control flow graph. During initialization, all nodes in the sub-control flow graph of the partition are added into the data flow direction information set, and data flow direction information corresponding to the nodes is scheduled and read from a disk when being kept to be analyzed.

And the message queue is used for storing the output result after the partition analysis. The output results after the partition analysis will be connected to the subsequent partitions. This description will be described in more detail later.

Step S4 this embodiment includes the following steps:

s44: and repeatedly executing the steps S41 to S44 until the node sets to be analyzed of all the partitions in the global partition set are emptied.

Step S41 represents a scheduling policy, for example, in some embodiments, the program code size is not large, and the number of partitions is determined according to the number of CPUs, the number of cores, and the number of threads of the computer when the partitions are partitioned, where the partitions in the synchronous partition set include all the partitions. In this embodiment, the synchronous partition set is a subset of all partitions, and when the synchronous partition set is called, topN partitions with highest priority and not emptied by nodes are selected to form the synchronous partition set; wherein topN is predetermined. The topN is usually determined by the number of CPUs, cores, and threads of the computer. Therefore, topN can also be determined directly by extracting the number of CPUs or the number of cores of the CPUs. The node is not emptied, which means that the node set to be analyzed is not emptied.

In addition, in step S41, in this embodiment, since the data flow information set in each partition of the synchronized partition set only stores the index of the data flow information, that is, the node identification code, it is further necessary to call the corresponding data flow information from the disk for the data flow information set of each partition of the synchronized partition set, that is, call the data flow information of the corresponding node from the disk for each node according to the node identification code of each node in each partition of the synchronized partition set, and call the data flow information of the corresponding node from the disk for each node not belonging to each partition of the synchronized partition set, so as to ensure that the data flow information of the memory is only limited to the data flow information of each node in each partition of the synchronized partition set.

Step S42, namely the partition analyzing step, namely the step of analyzing the partitions described above, with reference to fig. 2, includes the following steps:

The target partition in the above process is a partition in analyzing a partition;

step S421 is a loop iteration process, and referring to fig. 2, specifically to this embodiment, is implemented by the following steps:

s4211: initializing output data flow information of each node;

s4213: extracting nodes from a node set to be analyzed; if no node is extracted, it indicates that the node set to be analyzed has no node, then a loop iteration process is skipped, which is equivalent to step S4212, i.e., "until the node set to be analyzed has no node" in step S421 "

S4214, calling an interface for realizing analysis by a user;

s4215, judging whether the interface output is the same as the output data stream information; if the two are the same, deleting the node from the node set to be analyzed, namely step S4217; otherwise, outputting and updating the output data stream information by using the interface and adding the subsequent nodes into the node set to be analyzed, namely step S4216; then, the process goes to step S4213.

In this embodiment, two interfaces are defined as the interfaces for the user to perform analysis, which are: combine and Transfer. The input parameters of the combination interface are nodes, and the output is a node set. The Transfer interface takes the combination node set and the data flow information corresponding to each node of the combination node set as input, and outputs the data flow information after alias calculation. The Transfer interface analyzes the input data flow information and outputs the data flow information after being merged, extracted or connected. Those skilled in the art will appreciate that the two interfaces described above may also be directly merged into one interface. For example, when a developer uses the program code, the developer needs to perform data flow analysis on a certain variable of the program code, and at this time, the developer only needs to extract data flow information related to the variable as output.

In step S4211, the output data flow information of each node is initialized as the output data flow information of the node according to the output data flow information of the corresponding node in the message queue, and if the output flow information of the corresponding node does not exist in the message queue, the output flow information of the node is initialized to be empty.

The step S4216 "updating the output data stream information with the interface output" means that the output data stream information of the node is replaced with the interface output.

It should be noted that the above loop iteration process of step S421 can also be implemented in other ways, but the process is substantially the same. For example, in another embodiment, when a node is extracted from the node set to be analyzed, the node is directly deleted from the node set to be analyzed, and then step S4215 does not need to execute "if both are the same, delete the node from the node set to be analyzed". In contrast to the latter, the process of step S421 described above in this embodiment is executed repeatedly more than once to invoke the interface of the analysis implemented by the user.

The output data stream information of the node in step S4211 is to determine whether the interface output is the same as the output data stream information in step S4215, that is, "the output data stream information corresponding to the node does not change" in step S421, and the corresponding output stream information is the output obtained by calling the interface for analysis by the user. Therefore, the output data flow information of the node in step S4211 is a temporary variable for determining whether the output of the interface realized by the calling user is stable in step S4215.

In addition, in another embodiment, the output data stream information corresponding to the node may be used as an input parameter of the interface when the subsequent node calls the interface for analysis implemented by the user.

Step S422 illustrates that the message queue is a collection of output data flow information for partitioned output nodes.

Step S423 is to reduce the amount of calculation of the data flow information in the subsequent calculation. In this embodiment, FCS deduplication is adopted, which specifically includes the following steps:

s4232: finding out an item set of which the item set at least appears for T times in the target partition as a high-frequency item set; wherein T is preset;

furthermore, in some embodiments, step S4231 may also be replaced with: constructing a project set for data flow direction information of each node of the global control flow graph; step S4232 may also be replaced with: and finding out the item set of which the item set at least appears for T times in the global control flow graph as a high-frequency item set.

As previously mentioned, the data flow information for a node is a set of variable unidirectional edges. In step S4231, when the item set is constructed, each item of the item set is a variable one-way edge, that is, the item set is a set of variable one-way edges as items. Therefore, in step S4234, whether the data flow information of the node includes the high frequency item set is a set operation process.

In step S423, the nodes in the message queue of each partition and the output data stream information corresponding to the nodes are merged into the partition corresponding to the node, that is, the nodes in the message queue and the subsequent nodes thereof are added to the node set to be analyzed of the corresponding partition.

Obviously, in the method of this embodiment, the sub-control flow diagrams corresponding to the partitions are disjoint to each other. And the node in the message queue is a node of the child control flow graph in the previous partition, and more particularly, the partition forward-dependent partition output node. Therefore, normally, the output node does not originally belong to the partition where the message queue is located, and at this time, when the node is added to the node set to be analyzed of the partition where the message queue is located, the node may be regarded as a mirror node of another partition.

In another embodiment, the mirror node is also pre-constructed in step S3. That is, in step S3, the method further includes creating and marking corresponding mirror input nodes and mirror output nodes for the partitions respectively on both sides of the partition where the edge is located across the edges of the partitions. That is, the subsequent node of the output node of the partition is used as a mirror image output node and added to the node set to be analyzed of the partition, and the preamble node of the input node of the partition is used as a mirror image input node and added to the node set to be partitioned of the partition. In step S422, the mirror output node is used as the output node. Thus, the nodes in the message queue of each partition in step S423 necessarily correspond to the mirror input nodes of the corresponding partition. Incorporating partitions at this point is referred to as a natural process.

In addition, the merge partition is also embodied in the aforementioned step S4211, that is, the output data flow information of the node in the message queue is extracted to the output data flow information of the corresponding node.

Obviously, the purpose of the aforementioned mirror node, mirror input node, mirror output node and message queue is to pass the output data flow information of the node between the partitions. The partitioned child control flow graph is not logically independent, but is part of a global control flow graph. The partitioning is provided merely to facilitate thread scheduling in the case of overall synchronous parallel computing and to reduce memory usage in the case of computing.

In addition, in order to improve the thread utilization rate, the steps S421 and S422 may be performed as a whole by performing a synchronous parallel computation. Step S43 may also be performed in a global synchronous parallel computing manner.

Claims

1. A method for code analysis based on graph computation, the method comprising the steps of:

s1: acquiring a program code to be analyzed;

the analysis of the partitions comprises the following steps:

2. A method of graph computation based code analysis according to claim 1, characterized in that step S421 and step S422 are processed as a whole in a way of overall synchronous parallel computation.

3. The method for code analysis based on graph computation according to claim 1, wherein in step S3, the global control flow graph is divided into a plurality of partitions and the partitions are divided into K partitions by using a node number balancing principle, wherein K is a preset number of partitions.

4. The method for graph computation based code analysis according to claim 1, wherein step S3 further comprises creating and marking respective mirror input nodes and mirror output nodes for the corresponding partitions respectively on both sides of the partition where the edge is located across the edges of the partitions; in step S422, the mirror output node is used as the output node.

5. The method of graph computation based code analysis of claim 1, wherein step S423 includes the steps of:

wherein T is preset.

6. The method of graph computation-based code analysis of claim 5, wherein step S4231 is replaced with: constructing a project set for data flow direction information of each node of the global control flow graph; step S4232 is replaced with: and finding out the item set of which the item set at least appears for T times in the global control flow graph as a high-frequency item set.

7. The method for graph computation based code analysis according to any of claims 1 to 6, wherein step S4 further comprises the steps of:

8. The method for code analysis based on graph computation of claim 7, wherein in step S3, when the global control flow graph is divided into a plurality of partitions, a priority is set for each partition according to a data flow direction relationship and/or an edge relationship between nodes; in step S41, when a synchronous partition set is called, topN partitions with the highest priority and not emptied by nodes are selected to form the synchronous partition set; wherein topN is predetermined.

9. The method for graph computation based code analysis according to claim 7, wherein step S43 is performed in a manner of global synchronous parallel computation.

10. The method for code analysis based on graph computation of claim 7, wherein in step S2, the node information in the global control flow graph is stored in a disk; the node information at least comprises a node identification code and data flow direction information; in step S3, the global control flow graph is divided into a plurality of partitions, and the node identification code is used as a node division partition of the global control flow graph; the nodes stored in the partitions are node identification codes; step S41 further includes the steps of: and calling the data flow direction information of the corresponding node from the disk for each node according to the node identification code of each node in each partition in the synchronous partition set, and calling the data flow direction information of the corresponding node from the disk for each node not belonging to each partition in the synchronous partition set.

11. An apparatus for code analysis based on graph computation, the apparatus comprising:

m1, used for: acquiring a program code to be analyzed;

the analysis of the partitions comprises the following modules:

12. Apparatus for graph computation based code analysis according to claim 11, wherein module M421 and module M422 as a whole are processed in a global synchronous parallel computation.

13. The apparatus for graph computation based code analysis according to claim 11, wherein in the module M3, the global control flow graph is divided into a plurality of partitions, and the partitions are divided into K partitions by using a node number balancing principle, where K is a preset number of partitions.

14. The apparatus for graph computation based code analysis according to claim 11, wherein in module M3, further comprises creating and marking respective mirror input nodes and mirror output nodes for the corresponding partitions respectively on both sides of the partition where the edge is located across the edges of the partitions; in module M422, the mirror output node is used as the output node.

15. The apparatus for graph computation-based code analysis according to claim 11, wherein module M423 comprises the following modules:

wherein T is preset.

16. Apparatus for graph computation based code analysis according to claim 15, characterised in that module M4231 is replaced by: constructing a project set for data flow direction information of each node of the global control flow graph; module M4232 is replaced with: and finding out the item set of which the item set at least appears for T times in the global control flow graph as a high-frequency item set.

17. Apparatus for graph computation based code analysis according to any of claims 11 to 16, wherein the module M4 further comprises the following modules:

18. Apparatus for graph computation based code analysis according to claim 17, wherein in block M3, when dividing the global control flow graph into partitions, priorities are set for each partition according to data flow relationships and/or edge relationships between nodes; in a module M41, when a synchronous partition set is called, topN partitions with highest priority and not emptied by nodes are selected to form the synchronous partition set; wherein topN is predetermined.

19. Apparatus for graph computation based code analysis according to claim 17, in which module M43 is implemented as a whole synchronous parallel computation.

20. The apparatus for graph computation based code analysis according to claim 17, wherein in block M2, node information in the global control flow graph is stored to disk; the node information at least comprises a node identification code and data flow direction information; in a module M3, a global control flow graph is divided into a plurality of partitions, and a node identification code is used as a node division partition of the global control flow graph; the nodes stored in the partitions are node identification codes; module M41 also includes modules: and calling the data flow direction information of the corresponding node from the disk for each node according to the node identification code of each node in each partition in the synchronous partition set, and calling the data flow direction information of the corresponding node from the disk for each node not belonging to each partition in the synchronous partition set.