CN115586920B

CN115586920B - Fragile code segment clone detection method and device, electronic equipment and storage medium

Info

Publication number: CN115586920B
Application number: CN202211592925.0A
Authority: CN
Inventors: 张涛; 宁戈; 张弛; 谭博迈
Original assignee: Beijing Anpro Information Technology Co ltd
Current assignee: Beijing Anpro Information Technology Co ltd
Priority date: 2022-12-13
Filing date: 2022-12-13
Publication date: 2023-03-14
Anticipated expiration: 2042-12-13
Also published as: CN115586920A

Abstract

The application provides a fragile code fragment clone detection method, a fragile code fragment clone detection device, electronic equipment and a storage medium, wherein the fragile code fragment clone detection method comprises the following steps: generating a program dependence graph of a program to be tested; positioning all second initial nodes matched with first initial nodes of a plurality of fragile code fingerprint subgraphs in a program dependence graph of a program to be tested by using a locality sensitive hashing algorithm, and determining a pair of matched first initial nodes and second initial nodes as an initial node clone pair, wherein the fragile code fingerprint subgraphs are partial program dependence graphs extracted from the program dependence graph of the fragile code segment; and performing slice subgraph matching on the fragile code fingerprint subgraph and the program dependency graph of the program to be tested in parallel based on all the initial node clone pairs, thereby detecting the cloned fragile code segment. By extracting the fragile code fingerprint subgraph, quickly matching the initial node clone pair by using a locality sensitive hashing algorithm and then executing subgraph matching in parallel, the accuracy and efficiency of fragile code clone detection are improved.

Description

Fragile code segment clone detection method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer application technologies, and in particular, to a fragile code fragment clone detection method and apparatus, an electronic device, and a storage medium.

Background

In software development, the reuse of open source code is a ubiquitous phenomenon and is also a non-negligible supply chain security risk threat vector. When a developer frequently copies code segments in some open source code libraries, the developer does not perform deep understanding analysis on the introduced code for the consideration of development efficiency, which causes defects hidden in the open source code to be also introduced into the software system. These key code fragments that contain defects, resulting in the creation of software bugs, are referred to as "fragile code".

Existing methods for detecting fragile code clones generally perform alias replacement or perform a higher degree of abstraction on the code to be detected, such as generating a program dependency graph of the code, based on which vulnerabilities caused by the fragile code clones can be found by matching the same structure or the same features with known situations. But is affected when code modifications increase, such as adding and deleting lines of code, which also makes vulnerability detection more difficult and application scenarios limited. Also, lack of verification of context can result in false positives or false positives, as security flaws are very sensitive to context.

In addition, the code clone detection technology using the program dependency graph as the code characterization mode has the limitation of low execution efficiency. That is, when the sub-graph library (fragile code fingerprint sub-graph library) as the matching target is large in capacity, or when the program dependency graph of the program to be tested contains a large number of nodes and edges, sub-graph matching faces extremely high computational load and time overhead problems.

Therefore, how to improve the accuracy and efficiency of the fragile code clone detection has become a technical problem for those skilled in the art.

Disclosure of Invention

Embodiments disclosed herein aim to provide a fragile code fragment clone detection method, apparatus, electronic device and storage medium to solve the above problems.

In a first aspect of the present disclosure, there is provided a fragile code fragment clone detection method, comprising the steps of:

generating a program dependence graph of a program to be tested;

locating all second starting nodes matched with first starting nodes of a plurality of fragile code fingerprint subgraphs in a program dependency graph of the program to be tested by using a locality sensitive hashing algorithm, and determining a pair of matched first starting nodes and second starting nodes as a starting node clone pair, wherein the fragile code fingerprint subgraphs are partial program dependency graphs extracted from the program dependency graph of the fragile code segments;

performing slice subgraph matching in parallel based on all the starting node clone pairs to the fragile code fingerprint subgraph and the program dependency graph of the program to be tested, wherein the slice subgraph matching comprises, for any starting node clone pair:

slicing the fragile code fingerprint subgraph where the first starting node is located and the program dependency graph of the program to be tested from the first starting node and the second starting node at the same time, and judging whether all other nodes in the fragile code fingerprint subgraph where the first starting node is located can be matched with corresponding nodes in the program dependency graph of the program to be tested or not;

and if so, determining that a fragile code segment corresponding to the fragile code fingerprint subgraph where the first starting node is located exists in the program to be tested.

Optionally, the locating, by using a locality sensitive hashing algorithm, all second starting nodes matching with the first starting nodes of the plurality of fragile code fingerprint subgraphs in the program dependency graph of the program to be tested includes the following steps:

mapping nodes in a program dependence graph of the program to be tested into a plurality of hash buckets through a preset locality sensitive hash function;

acquiring first starting nodes of all the fragile code fingerprint subgraphs, and respectively calculating a to-be-queried barrel number for the first starting nodes through the preset locality sensitive hash function;

calculating the distance between the first starting node and a node in a program dependency graph of the program to be tested in the hash bucket corresponding to the bucket number to be inquired;

and marking the node in the program dependency graph of the program to be tested, the distance of which is less than a preset threshold value, as a second starting node.

Optionally, the performing slice subgraph matching in parallel on the fragile code fingerprint subgraph and the program dependency graph of the program to be tested based on all the starting node clone pairs comprises the following steps:

creating a subtask for performing slice subgraph matching for each starting node clone pair;

sending the fragile code fingerprint subgraph where the first starting node in the starting node clone pair is located and the program dependency graph of the program to be tested to the subtasks;

the subtasks are executed in parallel.

Optionally, the extraction method of the fragile code fingerprint subgraph comprises the following steps:

acquiring a fragile code segment and a fragile code statement in the fragile code segment;

generating a program dependency graph for the fragile code fragment;

backward slicing is carried out on the program dependency graph of the fragile code fragment by taking the nodes of the fragile code statement as slicing points to obtain a sliced program dependency graph, and the sliced program dependency graph comprises all nodes and corresponding edges which have control or data dependency relation with the fragile code statement;

determining the slicer dependency graph as a fragile code fingerprint subgraph.

Optionally, the slicing is performed on the fragile code fingerprint subgraph where the first starting node is located and the program dependency graph of the program to be tested from the first starting node and the second starting node at the same time, and whether all the other nodes in the fragile code fingerprint subgraph where the first starting node is located can be matched with corresponding nodes in the program dependency graph of the program to be tested is determined, including the following steps:

respectively taking the first starting node and the second starting node as a fragile code fingerprint subgraph where the first starting node is located and a slicing point of a program dependency graph of the program to be tested, wherein a node of a fragile code statement in the fragile code fingerprint subgraph is taken as the first starting node;

and from the slicing point, simultaneously carrying out backward slicing on the fragile code fingerprint subgraph where the first starting node is located and the program dependency graph of the program to be tested, respectively obtaining a first precursor node which has a control or data dependency relationship with the first starting node and a second precursor node which has a control or data dependency relationship with the second starting node, judging whether the first precursor node and the second precursor node are matched, if so, repeating the step, and if not, finishing the slicing, thereby judging whether all the other nodes in the fragile code fingerprint subgraph where the first starting node is located can be matched with the corresponding nodes in the program dependency graph of the program to be tested.

Optionally, before locating, by using a locality sensitive hashing algorithm, all second starting nodes that match the first starting nodes of the plurality of fragile code fingerprint subgraphs in the program dependency graph of the program to be tested, the method further includes the following steps:

acquiring a node set of the program dependency graph of the program to be tested and all node sets of the fragile code fingerprint subgraphs, and calculating an intersection;

based on the intersection, removing nodes which do not exist in the intersection in the program dependence graph of the program to be tested;

based on the intersection, removing ones of the number of fragile code fingerprint subgraphs that contain nodes not present in the intersection.

In a second aspect of the present disclosure, there is provided a fragile code fragment clone detection device comprising:

the graph generating unit is used for generating a program dependence graph of the program to be tested;

an initial node clone pair determining unit, configured to locate, by using a locality sensitive hashing algorithm, all second initial nodes that match first initial nodes of a plurality of fragile code fingerprint subgraphs in a program dependency graph of the program to be tested, and determine a pair of the matched first initial nodes and second initial nodes as an initial node clone pair, where the fragile code fingerprint subgraph is a partial program dependency graph extracted from the program dependency graph of the fragile code segment;

a slice subgraph matching unit for performing slice subgraph matching in parallel on the basis of all the starting node clone pairs to the fragile code fingerprint subgraph and the program dependency graph of the program to be tested, wherein the slice subgraph matching comprises the following steps for any starting node clone pair:

The technical scheme provided by the embodiment of the disclosure has the following beneficial effects:

in addition, the vulnerability detection fineness and accuracy based on code segment clone detection are improved by extracting the vulnerable code statements and the corresponding context information thereof into the vulnerable code fingerprint subgraph.

It should be understood that the statements herein reciting aspects are not intended to limit the critical or essential features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a flow diagram illustrating a fragile code fragment clone detection method according to an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a method for fragile code fragment clone detection in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram illustrating a method for fragile code fragment clone detection in accordance with an exemplary embodiment;

FIG. 4 is a flowchart illustrating a method for extracting a fragile code fingerprint subgraph in accordance with an exemplary embodiment;

FIG. 5 illustrates an example of a program dependency graph of a vulnerable code fragment in accordance with an exemplary embodiment;

FIG. 6 illustrates an example of a slicer dependency graph for a vulnerable code segment in accordance with an exemplary embodiment;

FIG. 7 illustrates an example of a program dependency graph and two fragile code fingerprint subgraphs of a program under test in accordance with an illustrative embodiment;

FIG. 8 is a schematic diagram illustrating the structure of a fragile code fragment clone detection device in accordance with an exemplary embodiment;

fig. 9 is a schematic structural diagram of an electronic device according to an exemplary embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

The terms "include" and its similar terms are to be understood as open-ended inclusions, i.e., "including but not limited to," in the description of the embodiments of the present disclosure. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.

The first embodiment is as follows:

referring to fig. 1, an embodiment of the present disclosure proposes a fragile code fragment clone detection method, including steps S101-S103:

s101: generating a program dependence graph of a program to be tested;

specifically, embodiments of the present disclosure will first generate a program dependency graph for the program under test. The Program Dependency Graph (PDG) mainly describes the relationship of interdependence and interaction between code instructions, and the Program Dependency Graph includes control Dependency and data Dependency between statements. Wherein, each node represents a statement (such as an assignment statement, a method call statement, etc.) or a control predicate; each edge represents a control or data dependency relationship, the data dependency edge reflects the influence of one variable on the value of another variable, and the control dependency edge is the edge of the control dependency corresponding to the influence of a statement on the variable value. For example, joern, an open source analysis tool, may be used to generate a program dependency graph for C/C + + source code. Program dependency graphs are often used to characterize code features of source code, which in turn incorporates a subgraph matching algorithm for code clone detection.

S102: locating all second starting nodes matched with first starting nodes of a plurality of fragile code fingerprint subgraphs in a program dependence graph of the program to be tested by using a locality sensitive hashing algorithm, and determining a pair of matched first starting nodes and second starting nodes as a starting node clone pair, wherein the fragile code fingerprint subgraphs are partial program dependence graphs extracted from the program dependence graph of the fragile code segment;

specifically, a plurality of fragile code fingerprint subgraphs are generated in advance, and each fragile code fingerprint subgraph is a partial program dependency graph extracted from a program dependency graph of known fragile code fragments based on preset rules. The fragile code fingerprint subgraph is an abstract representation of a fragile code segment and also comprises a series of nodes and edges as well as the PDG of a program to be tested. A fragile code fingerprint sub-graph uniquely corresponds to a fragile code segment.

And for each fragile code fingerprint subgraph, a first starting node is arranged, all nodes which are the same as or similar to the first starting node of the fragile code fingerprint subgraph can be quickly found in the program dependence graph of the program to be tested through a locality sensitive hashing algorithm, and the nodes are determined as second starting nodes. Therefore, the nodes which are the same as or similar to all the first starting nodes, namely the second starting nodes, are obtained, and the positions of the second starting nodes in the program dependence graph of the program to be tested are determined. For each pair of matching first and second start nodes, they are determined to be a clone of start nodes. And determining the position of a second initial node in the program dependence graph of the program to be tested and the initial node clone pair as the premise and the basis for carrying out subsequent subgraph matching. Wherein the first starting node of the fragile code fingerprint subgraph can be determined according to actual requirements, which is not specifically limited by the present disclosure. For example, the node in which the code statement of interest in the vulnerable code segment is located may be taken as the first start node.

The local-Sensitive Hashing (LSH) is a fast nearest neighbor search algorithm for massive high-dimensional data. The basic idea of LSH is: after two points which are close (similar) in the original space are mapped by the LSH hash function, the hashes of the two points are the same with a high probability; and two points that are far away (dissimilar) from each other have a small probability of having the same hash value after mapping.

According to statistics, after a source file with a code amount of about two thousand lines is analyzed, the obtained program dependency graph comprises about 4 ten thousand nodes and 16 ten thousand edges. If the similarity of the nodes matched one by one through loop traversal or the traditional hash matching mode of two by two is adopted, the calculation complexity is quite high. And the LSH algorithm can greatly accelerate the node matching speed.

Further, referring to fig. 2, locating all second start nodes matching with the first start nodes of the fragile code fingerprint subgraphs in the program dependency graph of the program to be tested by using the locality sensitive hashing algorithm, including steps S1021-S1024:

s1021: mapping nodes in a program dependency graph of the program to be tested into a plurality of hash buckets through a preset locality sensitive hash function;

specifically, after nodes with certain similarity in a program dependence graph of a program to be tested are subjected to mapping transformation of a preset locality sensitive hash function, the probability that the nodes are still similar in a new data space is high, and the nodes fall into the same hash bucket after being subjected to locality hash transformation; after dissimilar nodes in a program dependence graph of a program to be tested are subjected to mapping transformation of a preset locality sensitive hash function, the probability that the dissimilar nodes are still dissimilar in a new data space is high, and the dissimilar nodes can fall into different hash buckets after the partial hash transformation. Therefore, after the partial hash transformation, a large number of nodes contained in the program dependency graph of the program to be tested are dispersed into a plurality of hash buckets, and each bucket contains some node data.

The predetermined locality sensitive hash function refers to one or more hash functions selected in advance, such as a simhash function, a minhash function, and the like, which is not limited by the present disclosure.

S1022: acquiring first starting nodes of all the fragile code fingerprint subgraphs, and respectively calculating a to-be-queried barrel number for the first starting nodes through the preset locality sensitive hash function;

specifically, first start nodes of all fragile code fingerprint subgraphs are obtained, and the data of the nodes are mapped and transformed by using the same preset locality sensitive hash function as that in step S1021, so that a hash bucket number corresponding to each first start node, that is, a bucket number to be queried, is obtained.

S1023: calculating the distance between the first starting node and a node in a program dependency graph of the program to be tested in the hash bucket corresponding to the bucket number to be inquired;

specifically, the nodes of the program dependency graphs of all the programs to be tested in the hash bucket corresponding to the bucket number to be queried corresponding to each first starting node are obtained, and the distance between the first starting node and the nodes of the program dependency graphs of the programs to be tested is calculated. Common distance calculation methods include euclidean distance, jaccard distance, hamming distance, cosine distance, manhattan distance, and the like, which are not limited by the present disclosure.

S1024: and marking the node in the program dependence graph of the program to be tested with the distance smaller than the preset threshold value as a second starting node.

Specifically, the similarity between the nodes can be determined by calculating the distance between the nodes, and the node in the program dependency graph of the program to be tested, which has the distance between the node and the first start node smaller than the preset threshold, is determined as the node matched with the first start node and is marked as the second start node. And the preset threshold is used for evaluating the similarity between the nodes, and when the distance between a certain node and the first starting node in the program dependence graph of the program to be tested is smaller than the preset threshold, the node is considered to be matched with the first starting node.

By using the locality sensitive hashing algorithm, massive nodes in a program dependency graph of a program to be tested can be dispersed into different hash buckets, the same locality sensitive hashing transformation is carried out on first initial nodes of all fragile code fingerprint subgraphs to obtain corresponding bucket numbers to be queried, and therefore similarity matching is carried out only on the nodes of the program dependency graph of the program to be tested in the hash buckets corresponding to the first initial nodes and the bucket numbers to be queried, node matching times are greatly reduced, and the effect of quickly locating second initial nodes in the program dependency graph of the program to be tested is achieved.

S103: performing slice subgraph matching in parallel based on all the starting node clone pairs to the fragile code fingerprint subgraph and the program dependency graph of the program to be tested, wherein the slice subgraph matching comprises, for any starting node clone pair:

Specifically, step S103 performs slice subgraph matching in parallel based on the starting node clone pair determined in step S102, and searches for the remaining nodes matched in the program dependency graph of the program to be tested and the fragile code fingerprint subgraph from the starting node clone pair. The slicing subgraph matching is to slice the program dependency graph and the fragile code fingerprint subgraph of the program to be tested from the second initial node and the first initial node respectively based on a preset slicing rule, and judge whether the rest nodes in the fragile code fingerprint subgraph can be matched with the same or similar nodes in the program dependency graph of the program to be tested, so as to judge whether the program to be tested multiplexes the fragile code segments corresponding to the fragile fingerprint subgraph.

Further, referring to fig. 3, performing slice subgraph matching in parallel on the fragile code fingerprint subgraph and the program dependency graph of the program to be tested based on all the starting node clone pairs may include the following steps:

s1031: creating a subtask for performing slice subgraph matching for each of the starting node clone pairs;

specifically, each starting node clone pair comprises a first starting node and a second starting node, and a first starting node corresponds to a fragile code fingerprint subgraph, so that for each starting node clone pair, the fragile code fingerprint subgraph corresponding to each first starting node needs to be subjected to slice subgraph matching once with the program dependency graph of the program to be tested, and a subtask for performing slice subgraph matching is created for each starting node clone pair. If there are m initial node clone pairs, m subtasks are created.

S1032: sending the fragile code fingerprint subgraph where the first starting node in the starting node clone pair is located and the program dependency graph of the program to be tested to the subtasks;

specifically, for each subtask, a fragile code fingerprint subgraph where a first start node in a start node clone pair is located and a program dependency graph of a program to be tested marked with a second start node in the start node clone pair are respectively issued. That is, if there are m start node clone pairs corresponding to m subtasks, the m fragile code fingerprint subgraphs corresponding to the first start node in the m start node clone pairs are allocated to the m subtasks one by one.

In an optional implementation manner, the program dependency graph of the program to be tested is copied into m copies, and the m copies are also sent to the m subtasks one by one, and each copy marks a second start node in the start node clone pair corresponding to the subtask. In this embodiment, the data processed between the sub-tasks are independent and isolated from each other.

In another optional implementation manner, only one piece of information of the second start node is sent to the m subtasks, and the m subtasks share the program dependency graph data of the program to be tested during execution.

The form of data distribution among subtasks can be selected by those skilled in the art according to practical situations, and the present disclosure is not limited thereto.

S1033: all of the subtasks are executed in parallel.

In particular, a single machine or distributed system is employed to perform all subtasks in parallel. When a single machine is adopted, parallel computing can be realized by a plurality of virtual machines, a plurality of containers, a plurality of processors or a plurality of kernels in one processor; when a distributed system is adopted, the subtasks can be distributed to different nodes in the cluster to realize parallel computation. The present disclosure is not limited to particular ways of implementing parallel computing.

In an alternative embodiment, a separate thread is created for each subtask, and the subtasks are executed in parallel by multiple threads.

In another alternative embodiment, a distributed system is used to perform the multiple subtasks.

After the second starting node is quickly positioned in the program dependency graph of the program to be detected in the step S102, the step S103 creates sub-tasks for slice sub-graph matching by cloning pairs of all the starting nodes, and issues the fragile code fingerprint sub-graph for sub-graph matching and the program dependency graph of the program to be detected to each sub-task, thereby executing sub-graph matching in parallel, and greatly improving the efficiency of fragile code segment clone detection.

In an alternative embodiment, referring to fig. 4, the present disclosure proposes a method for extracting a fragile code fingerprint subgraph, including steps S201 to S204:

s201: acquiring a fragile code segment and a fragile code statement in the fragile code segment;

specifically, the fragile code segment refers to a code segment containing security flaws (bugs, etc.), and contains a plurality of code statements, and the length of the code segment is not limited. Fragile code statements refer to code statements that are directly related to the generation of security flaws. Through some public websites, public databases such as CVE, github, etc., it is possible to collect some known vulnerable code segments in advance and locate the vulnerable code statements therein that are directly related to the security defect.

Optionally, for some known open-source component vulnerabilities, the complete source code containing those known open-source component vulnerabilities and their corresponding repair patches may be obtained. By comparing vulnerability fix patches, the vulnerable code segments and vulnerable code statements that cause the vulnerability can be determined. Optionally, a function body of the vulnerability function is determined as a vulnerable code segment, and a statement in the vulnerability function directly related to the vulnerability trigger is determined as a vulnerable code statement.

Optionally, if the patch file is not provided in the public database, the repaired version number of the software where the vulnerability is located may be obtained according to the relevant vulnerability description, and the vulnerable code segment and the vulnerable code statement may be located by comparing the difference between the two software versions before and after the repair.

The present disclosure does not specifically limit the manner in which the fragile code fragments and the fragile code statements are obtained.

S202: generating a program dependency graph for the fragile code fragment;

specifically, a program dependency graph of the fragile code fragments is generated, which contains a series of nodes and edges. Wherein, each node represents a statement (such as an assignment statement, a method call statement, etc.) or a control predicate; each edge represents a control or data dependency relationship, the data dependency edge reflects the influence of one variable on the value of another variable, and the control dependency edge is the edge of the control dependency corresponding to the influence of a statement on the variable value. For example, joern, an open source analysis tool, may be used to generate a program dependency graph for C/C + + source code. FIG. 5 is an example of a program dependency graph of a fragile code fragment shown in the present disclosure, which contains several nodes and edges, where solid line edges represent control dependencies and dashed line edges represent data dependencies.

S203: backward slicing the program dependence graph of the fragile code segment by taking the nodes of the fragile code statement as slicing points to obtain a sliced program dependence graph, wherein the sliced program dependence graph comprises all nodes and corresponding edges which have control or data dependence relation with the fragile code statement;

specifically, program slicing is performed on the program dependency graph obtained in step S202. Program slicing is a program decomposition technique that aims to extract needed or interesting parts from the program code. Wherein, the slice point is the starting point of the program slice. Backward slicing is to obtain a set of statements and conditions that may have an impact on the slicing procedure points. Therefore, the backward slicing is performed with the nodes of the fragile code words as slicing points, in order to acquire all the nodes that affect the fragile code words. Nodes and edges obtained by backward slicing contain information such as a path for executing to a fragile code statement and a path for data transfer, and are important above vulnerability triggering.

Illustratively, fig. 6 is a slicing program dependency graph obtained by backward slicing the program dependency graph of the fragile code segment shown in fig. 5 with the code statement p + + = c as a slicing point according to the present disclosure.

S204: determining the slicer dependency graph as a fragile code fingerprint subgraph.

Specifically, the slicer dependency graph obtained in step S203 is determined as a fragile code fingerprint sub-graph corresponding to the respective fragile code segment.

By carrying out backward slicing on the program dependency graph of the fragile code segment by taking the nodes of the fragile code statements as slicing points, on one hand, important information of the fragile code statements is reserved, and on the other hand, the nodes irrelevant to vulnerability triggering are removed. The slicing program dependency graph obtained in the mode is used as a fragile code fingerprint subgraph, the characteristics of the fragile code segment can be represented in a light weight mode, the accuracy of safety defect detection can be improved, and the complexity of subgraph matching can be reduced.

Alternatively, the node of the fragile code statement may be taken as the first starting node of the fragile code fingerprint subgraph.

Further, the slicing is performed on the fragile code fingerprint sub-graph where the first start node is located and the program dependency graph of the program to be tested from the first start node and the second start node at the same time, and whether all the other nodes in the fragile code fingerprint sub-graph where the first start node is located can be matched with corresponding nodes in the program dependency graph of the program to be tested is determined, including steps S301 to S302:

s301: respectively taking the first starting node and the second starting node as a fragile code fingerprint subgraph where the first starting node is located and a slicing point of a program dependency graph of the program to be tested, wherein a node of a fragile code statement in the fragile code fingerprint subgraph is taken as the first starting node;

in particular, program slicing is a program decomposition technique that aims to extract a desired or interesting part from program code. Wherein, the slice point is the starting point of the program slice. Embodiments of the present disclosure take a first start node as a slicing point for a fragile code fingerprint subgraph slice where the first start node is located, and a second start node as a slicing point for a program dependency graph slice of a program under test. Wherein a node of a fragile code statement in the fragile code fingerprint subgraph is taken as a first starting node.

In an embodiment provided by the present disclosure, since the fragile code fingerprint subgraph is obtained by backward slicing the program dependency graph of the fragile code segment with the nodes of the fragile code statement as slicing points, the nodes of the fragile code statement are the only one node with zero degree of departure in the fragile code fingerprint subgraph, and taking the node as the first starting node, it can be facilitated to traverse all the nodes in the fragile code fingerprint subgraph from this node when step S302 performs backward slicing, so as to complete slice subgraph matching more quickly.

S302: and from the slicing point, simultaneously carrying out backward slicing on the fragile code fingerprint subgraph where the first starting node is located and the program dependency graph of the program to be tested, respectively obtaining a first precursor node which has a control or data dependency relationship with the first starting node and a second precursor node which has a control or data dependency relationship with the second starting node, judging whether the first precursor node and the second precursor node are matched, if so, repeating the step, and if not, finishing the slicing, thereby judging whether all the other nodes in the fragile code fingerprint subgraph where the first starting node is located can be matched with the corresponding nodes in the program dependency graph of the program to be tested.

Specifically, from a slicing point, a backward slicing operation is synchronously performed on a fragile code fingerprint subgraph where a first starting node is located and a program dependency graph of a program to be tested, a first precursor node with a control or data dependency relationship with the first starting node and a second precursor node with a control or data dependency relationship with the second starting node are gradually acquired through backward slicing, then the similarity of the first precursor node and the second precursor node is compared, and whether the nodes are matched or not is judged. The first predecessor node refers to a previous node acquired along a certain control or data dependent edge when backward slicing is performed from the first start node, and the second predecessor node is the same.

If the first precursor node is matched with the second precursor node, backward slicing is continuously carried out to obtain the precursor nodes of the first precursor node and the second precursor node, similarity matching is carried out again, the steps are repeated until all the nodes in the fragile code fingerprint subgraph are matched, and slicing is finished; and if the matched precursor node pair is not found, the node in the fragile code fingerprint subgraph is proved to be incapable of being completely matched with the corresponding node in the program dependency graph of the program to be tested, and slicing is finished.

Based on the combination of synchronous backward slicing and similarity comparison, the nodes of the fragile code fingerprint subgraph and the program dependency graph of the program to be tested can be matched step by step, so that whether all the other nodes in the fragile code fingerprint subgraph where the first starting node is located can be matched with the corresponding nodes in the program dependency graph of the program to be tested is judged.

Optionally, the method of similarity comparison is based on a pre-stored node hash value.

In an alternative embodiment, fig. 7 is an example of a program dependency graph and two fragile code fingerprint subgraphs of a program to be tested, where (a) is the program dependency graph of the program to be tested, and contains several nodes and several edges, and (b) and (c) are the two fragile code fingerprint subgraphs, and also contains several nodes and several edges. Assume that a node labeled 3 is used as a first start node of (b) and (c), and at the same time, a same node 3 also exists in the program dependency graph (a) matched to the program to be tested, and is determined as a second start node. Steps S301 to S302 are performed on the program dependency graph and the fragile code fingerprint subgraph of the program to be tested shown in fig. 7, including:

and taking the node 3 in the program dependence graph (a) of the program to be tested as a slicing point of the program dependence graph (a) of the program to be tested, and taking the node 3 in the fragile code fingerprint subgraph (b) as a slicing point of the fragile code fingerprint subgraph (b). Program dependency graph (a) and fragile code fingerprint subgraph (b) of the program to be tested are backward sliced starting from node 3 respectively. In the fragile code fingerprint sub-graph (b), first predecessor nodes 2, 4 are obtained along the control dependent edge pointing to node 3, and in the program dependency graph (a) of the program under test, second predecessor nodes 2, 4, 8 are obtained along the control dependent edge pointing to node 3, wherein the nodes 2, 4 in the program dependency graph (a) and the fragile code fingerprint sub-graph (b) of the program under test match. Continuing, in the program dependence graph (a) of the program to be tested, obtaining a precursor node 1 along a control dependence edge pointing to a node 2; in the fragile code fingerprint subgraph (b), a precursor node 1 is obtained along a control dependent edge pointing to a node 2, and the program dependent graph (a) of the program to be tested is matched with the precursor node 1 in the fragile code fingerprint subgraph (b). By this point, slicing of the fragile code fingerprint sub-graph (b) ends, and a sub-graph isomorphic with the fragile code fingerprint sub-graph (b) is matched in the program dependency graph (a) of the program to be tested.

By combining program slicing and similarity comparison at the same time, similar nodes and corresponding edges between a program dependence graph and a fragile code fingerprint subgraph of a program to be detected can be extracted, and thus, the fragile code clone detection is realized.

Further, before the step S102 of locating all second start nodes matching with the first start nodes of the fragile code fingerprint subgraphs in the program dependency graph of the program to be tested by using the locality sensitive hashing algorithm, the method further includes steps S401-S403:

s401: acquiring a node set of the program dependency graph of the program to be tested and all node sets of the fragile code fingerprint subgraphs, and calculating an intersection;

specifically, a node set of a program dependency graph of the program to be tested and all node sets of a plurality of fragile code fingerprint subgraphs are obtained, and an intersection is calculated for the two sets.

S402: based on the intersection, removing nodes which do not exist in the intersection in the program dependence graph of the program to be tested;

specifically, since the nodes in the intersection are the nodes included in both the program dependency graph and the fragile code fingerprint subgraph of the program to be tested, for the nodes not existing in the intersection, it is impossible to form clone node pairs in the subsequent slice subgraph matching, and therefore the nodes can be excluded. Based on the intersection result, the nodes which do not exist in the intersection are removed from the program dependence graph of the program to be tested, so that the number of the nodes in the program dependence graph of the program to be tested can be greatly reduced.

S403: based on the intersection, removing ones of the number of fragile code fingerprint subgraphs that contain nodes not present in the intersection.

Specifically, since the nodes in the intersection are the nodes included in both the program dependency graph and the fragile code fingerprint subgraph of the program to be tested, for the nodes not existing in the intersection, it is impossible to form clone node pairs in the subsequent slice subgraph matching, and therefore the nodes can be excluded. Based on the intersection result, the fragile code fingerprint subgraphs containing nodes which are not in the intersection are removed from the fragile code fingerprints, so that the number of the fragile code fingerprint subgraphs to be matched can be greatly reduced.

By solving the intersection of all the node sets of the program dependency graph of the program to be tested and all the node sets of all the fragile code fingerprint subgraphs and filtering the nodes in the program dependency graph of the program to be tested and the fragile code fingerprint subgraphs based on the intersection result, the program dependency graph of the program to be tested is reduced, the fragile code fingerprint subgraphs to be matched are also reduced, and therefore the clone comparison space is greatly reduced, and the effects of reducing the calculation load and shortening the calculation time can be achieved.

The first embodiment of the disclosure provides a fragile code segment clone detection method, which takes a program dependency graph as a code representation mode, quickly locates an initial node of subgraph matching through a locality sensitive hash algorithm, and then executes the subgraph matching in parallel, thereby greatly improving the fragile code clone detection efficiency.

Example two:

based on the same inventive concept, the second embodiment of the present disclosure provides a fragile code fragment clone detection device, and the specific implementation of the fragile code fragment clone detection device can refer to the related description of the first embodiment of the method, and the repeated parts are not repeated, as shown in fig. 8, the fragile code fragment clone detection device 800 mainly includes:

a graph generating unit 810, configured to generate a program dependency graph of a program to be tested;

an initial node clone pair determining unit 820, configured to locate, by using a locality sensitive hashing algorithm, all second initial nodes that match first initial nodes of a plurality of fragile code fingerprint subgraphs in a program dependency graph of the program to be tested, and determine a pair of the first initial nodes and the second initial nodes that match as an initial node clone pair, where the fragile code fingerprint subgraph is a partial program dependency graph extracted from the program dependency graph of the fragile code segment;

a slice subgraph matching unit 830, configured to perform slice subgraph matching in parallel on the fragile code fingerprint subgraph and the program dependency graph of the program to be tested based on all the starting node clone pairs, wherein the slice subgraph matching includes, for any starting node clone pair:

Example three:

referring to fig. 9, an embodiment of the present disclosure also proposes an electronic device 900, the electronic device 900 comprising at least one processor 910; and a memory 920 communicatively coupled to the at least one processor 910; wherein the memory 920 stores instructions executable by the at least one processor 910, the instructions being executable by the at least one processor 910 to enable the at least one processor 910 to perform the fragile code fragment clone detection method of embodiment one of the present disclosure.

The above elements in the electronic device 900 may be connected to each other by a bus, such as one of a data bus, an address bus, a control bus, an expansion bus, and a local bus, or any combination thereof.

Example four:

embodiments of the present disclosure also provide a storage medium having a computer program stored thereon, where the computer program is executed by a processor to implement the fragile code segment clone detection method according to the first embodiment of the present disclosure.

As will be appreciated by one of skill in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, devices and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only exemplary of the present disclosure and is not intended to limit the present disclosure, so that any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A fragile code fragment clone detection method is characterized by comprising the following steps:

generating a program dependence graph of a program to be tested;

locating all second starting nodes matched with first starting nodes of a plurality of fragile code fingerprint subgraphs in a program dependence graph of the program to be tested by using a locality sensitive hashing algorithm, and determining a pair of matched first starting nodes and second starting nodes as a starting node clone pair, wherein the fragile code fingerprint subgraphs are partial program dependence graphs extracted from the program dependence graph of the fragile code segment;

2. The fragile code segment clone detection method of claim 1, wherein said locating all second start nodes matching with first start nodes of a plurality of fragile code fingerprint subgraphs in the program dependency graph of the program under test by using a locality sensitive hashing algorithm comprises the following steps:

acquiring first starting nodes of all the fragile code fingerprint subgraphs, and respectively calculating a to-be-queried bucket number for the first starting nodes through the preset locality sensitive hash function;

and marking the node in the program dependence graph of the program to be tested with the distance smaller than the preset threshold value as a second starting node.

3. The fragile code fragment clone detection method of claim 1, wherein said performing slice subgraph matching in parallel based on all said starting node clone pairs on said fragile code fingerprint subgraph and a program dependency graph of said program under test comprises the steps of:

creating a subtask for performing slice subgraph matching for each of the starting node clone pairs;

all of the subtasks are executed in parallel.

4. The fragile code fragment clone detection method of claim 1, wherein said fragile code fingerprint subgraph extraction method comprises the steps of:

generating a program dependency graph of the fragile code fragments;

backward slicing the program dependence graph of the fragile code segment by taking the nodes of the fragile code statement as slicing points to obtain a sliced program dependence graph, wherein the sliced program dependence graph comprises all nodes and corresponding edges which have control or data dependence relation with the fragile code statement;

determining the slicer dependency graph as a fragile code fingerprint subgraph.

5. The fragile code fragment clone detection method of claim 4, wherein said slicing the fragile code fingerprint subgraph in which the first starting node is located and the program dependency graph of the program under test from the first starting node and the second starting node simultaneously, and determining whether all the remaining nodes in the fragile code fingerprint subgraph in which the first starting node is located can be matched to corresponding nodes in the program dependency graph of the program under test, comprises the following steps:

6. The fragile code segment clone detection method of claim 1, wherein before locating all second start nodes matching with the first start nodes of the plurality of fragile code fingerprint subgraphs in the program dependency graph of the program under test by using the locality sensitive hashing algorithm, further comprising the steps of:

7. A fragile code fragment clone detection device, comprising:

slicing the fragile code fingerprint subgraph where the first starting node is located and the program dependency graph of the program to be tested from the first starting node and the second starting node at the same time, and judging whether all other nodes in the fragile code fingerprint subgraph where the first starting node is located can be matched with corresponding nodes in the program dependency graph of the program to be tested;

8. An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the fragile code fragment clone detection method of any one of claims 1 to 6.

9. A storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the fragile code fragment clone detection method of any one of claims 1 to 6.