CN109918128B - Code similarity detection method and system based on relation variable graph - Google Patents

Code similarity detection method and system based on relation variable graph Download PDF

Info

Publication number
CN109918128B
CN109918128B CN201910225678.2A CN201910225678A CN109918128B CN 109918128 B CN109918128 B CN 109918128B CN 201910225678 A CN201910225678 A CN 201910225678A CN 109918128 B CN109918128 B CN 109918128B
Authority
CN
China
Prior art keywords
directed
relation
relationship
graph
variable graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910225678.2A
Other languages
Chinese (zh)
Other versions
CN109918128A (en
Inventor
邹娟
伍兵
侯长民
聂奇隆
安晨
陈勇
邓彬
沈鹏
郑金华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiangtan University
Original Assignee
Xiangtan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiangtan University filed Critical Xiangtan University
Priority to CN201910225678.2A priority Critical patent/CN109918128B/en
Publication of CN109918128A publication Critical patent/CN109918128A/en
Application granted granted Critical
Publication of CN109918128B publication Critical patent/CN109918128B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a code similarity detection method and system based on a relation variable diagram, relating to the field of code similarity detection and comprising the steps of removing annotation information in an original code and a detection code; determining identifiers with a data transmission relationship in the processed original code and the processed detection code; determining a first corresponding relation and a second corresponding relation; the first corresponding relation is the corresponding relation among all identifiers in the processed original codes; the second corresponding relation is the corresponding relation among all identifiers in the processed detection codes; establishing a first directed relationship variable graph according to the first corresponding relationship, and establishing a second directed relationship variable graph according to the second corresponding relationship; and calculating the similarity of the first directed relation variable graph and the second directed relation variable graph so as to determine the similarity of the original code and the detected code. By adopting the method or the system provided by the invention, the similarity between the changed part of the code and the original code can be detected more efficiently.

Description

Code similarity detection method and system based on relation variable graph
Technical Field
The invention relates to the field of code similarity detection, in particular to a code similarity detection method and system based on a relation variable diagram.
Background
In the information age at present, in the field of education, many students apply other programs or modify classmate programs to deal with academic tasks arranged by teachers through surfing the internet, and great trouble is brought to the teaching and assessment of the teachers; in the business field, many enterprises and individuals apply programs in other people projects for their own projects privately, which causes various right disputes. Therefore, it is increasingly important to find out how to effectively duplicate the code.
At present, the existing code duplication checking methods are carried out by a statistical method based on character string matching, a method based on Token and a method based on a program structure chart, but the methods have the defects, for example, the statistical method based on character string matching has low detection precision and cannot deal with the problems of variable name modification and redundant code implantation; the method based on Token has low detection accuracy, can resist confusion such as replacing variable names, modifying function positions and the like, but has low confusion resistance and is difficult to deal with the implantation of redundant codes; the method based on the program structure diagram has higher precision, but the space-time complexity of the calculation is higher, so that the method brings inconvenience to the heavy-task check work with larger task load.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides a code similarity detection method and system based on a relation variable diagram, which can efficiently detect the similarity between a code with more changed parts and an original code.
In order to achieve the purpose, the invention provides the following scheme:
a code similarity detection method based on a relation variable graph comprises the following steps:
removing the annotation information in the original code and the detection code;
determining identifiers with a data transmission relationship in the processed original code and the processed detection code;
determining a first corresponding relation and a second corresponding relation; the first corresponding relation is the corresponding relation among all the identifiers in the processed original codes; the second corresponding relation is the corresponding relation between all the identifiers in the processed detection codes;
establishing a first directed relation variable graph according to the first corresponding relation, and establishing a second directed relation variable graph according to the second corresponding relation;
and calculating the similarity of the first directed relation variable graph and the second directed relation variable graph, thereby determining the similarity of the original code and the detected code.
Optionally, the removing of the annotation information in the original code and the detection code specifically includes:
and removing the annotation information and the header file information in the original code and the detection code.
Optionally, the determining that the processed original code and the processed detection code have an identifier of a relationship of transferring data specifically includes:
extracting the first identifier and the second identifier; the first identifier is an identifier with a data transmission relationship in the processed original code; the second identifier is an identifier with a data transmission relationship in the processed detection code;
judging whether the first identifier is outside the main function and the user-defined function or not to obtain a first judgment result;
if the first judgment result shows that the first identifier is outside the main function and the user-defined function, directly storing the first identifier into a first variable table;
if the first judgment result shows that the first identifier is in the main function and the self-defined function, storing the first identifier and the function name into a first variable table;
judging whether the second identifier is outside the main function and the user-defined function or not to obtain a second judgment result;
if the second judgment result shows that the second identifier is outside the main function and the user-defined function, directly storing the second identifier into a second variable table;
And if the second judgment result shows that the second identifier is in the main function and the self-defined function, storing the second identifier and the function name into a second variable table.
Optionally, the determining the first corresponding relationship and the second corresponding relationship specifically includes:
determining the corresponding relation between any two first identifiers through the assignment relation and the array relation so as to obtain a plurality of first corresponding relations;
and determining the corresponding relation between any two second identifiers through the assignment relation and the array relation so as to obtain a plurality of second corresponding relations.
Optionally, the establishing a first directed relationship variable graph according to the first corresponding relationship, and the establishing a second directed relationship variable graph according to the second corresponding relationship specifically include:
establishing a first directed relation variable graph by taking all the first identifiers as vertexes of the directed relation variable graph and taking all the first corresponding relations as edges of the directed relation variable graph;
and establishing a second directed relation variable graph by taking all the second identifiers as the vertexes of the directed relation variable graph and taking all the second corresponding relations as the edges of the directed relation variable graph.
Optionally, the calculating the similarity between the first directed relationship variable graph and the second directed relationship variable graph to determine the similarity between the original code and the detected code specifically includes:
Comparing the number of vertexes in the first directed relationship variable graph and the second directed relationship variable graph, determining the directed relationship variable graph with less number of vertexes as a basic directed relationship variable graph, and determining the directed relationship variable graph with more number of vertexes as a search directed relationship variable graph;
counting vertexes, of which the outgoing degree is greater than the outgoing degree of the first vertex and the incoming degree is greater than the incoming degree of the first vertex, in the search directed relation variable graph, storing the vertexes into a first set, and analogizing in sequence until the first set corresponding to each vertex in the basic directed relation variable graph is determined;
searching the largest and same subgraph containing a first vertex and a first point in the basic directed-relationship variable graph and the searching directed-relationship variable graph, storing the number of edges of the largest and same subgraph into a second set, wherein the first vertex is the first vertex in the basic directed-relationship variable graph, and the first point is the first point in the first set corresponding to the first vertex, and repeating the operation until all the points in the first set corresponding to the first vertex are searched;
When all the points in the first set corresponding to the first vertex are searched, repeating the operation until all the vertices in the basic directed relation variable graph are searched; wherein the number of the second set is the same as the number of the vertexes of the basic directed relation variable graph;
and according to all the second sets, calculating the similarity of the first directed relation variable graph and the second directed relation variable graph, thereby determining the similarity of the original code and the detected code.
Optionally, the calculating, according to all the second sets, a similarity between the first directed relationship variable graph and the second directed relationship variable graph, so as to determine a similarity between the original code and the detected code, specifically includes:
judging whether the number of the second set is equal to the number of edges of the basic directed relationship variable graph or not to obtain a third judgment result;
if the third judgment result indicates that the number of the second set is equal to the number of edges of the basic directed relationship variable graph, determining that the similarity between the first directed relationship variable graph and the second directed relationship variable graph is 100%, and thus determining that the similarity between the original code and the detected code is 100%;
And if the third judgment result shows that the number of the second sets is not equal to the number of the edges of the basic directed relationship variable graph, storing the maximum value in each second set into a third set, and calculating the similarity between the first directed relationship variable graph and the second directed relationship variable graph according to the third set and the number of the vertexes of the basic directed relationship variable graph, so as to determine the similarity between the original code and the detection code.
Optionally, the calculating, according to the third set and the number of vertices of the base directed-relationship variable graph, a similarity between the first directed-relationship variable graph and the second directed-relationship variable graph, so as to determine a similarity between the original code and the detected code, specifically includes:
calculating the similarity of the first directed relation variable graph and the second directed relation variable graph according to the following formula;
the formula is
Figure BDA0002005100850000041
Wherein sim represents the similarity between the first directed-relational variable graph and the second directed-relational variable graph or the similarity between the original code and the detection code, n represents the number of vertices of the basic directed-relational variable graph, and Σ max [ ] represents the sum of all elements in the third set.
A code similarity detection system based on a relation variable graph comprises:
the removing module is used for removing the annotation information in the original code and the detection code;
the identifier determining module is used for determining an identifier with a data transmission relationship in the processed original code and the processed detection code;
a corresponding relation determining module for determining a first corresponding relation and a second corresponding relation; the first corresponding relation is the corresponding relation among all the identifiers in the processed original codes; the second corresponding relation is the corresponding relation between all the identifiers in the processed detection codes;
the directed relation variable graph establishing module is used for establishing a first directed relation variable graph according to the first corresponding relation and establishing a second directed relation variable graph according to the second corresponding relation;
and the similarity calculation module is used for calculating the similarity of the first directed relation variable graph and the second directed relation variable graph so as to determine the similarity of the original code and the detection code.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention provides a code similarity detection method and system based on a relation variable diagram, which aim to analyze the data flow direction of a program, and established directed relation variable diagrams are certainly similar as long as the data flow direction in the program is the same, so that the similarity can be efficiently detected, namely, the similarity between a code with a plurality of changed parts and an original code can be efficiently detected.
In addition, the invention eliminates the plagiarism means of changing the structure, changing the name of the identifier, adding redundant information and the like, and has stronger anti-interference capability. In addition, the method does not completely adopt a greedy algorithm when calculating the similarity of the two graphs, but adopts the greedy algorithm to calculate part points in the graphs, so that the complexity can be effectively reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a schematic flow chart of a code similarity detection method based on a relationship variable graph according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a code similarity detection system based on a relationship variable graph according to an embodiment of the present invention;
FIG. 3 is a diagram of a process for extracting identifiers using a code program according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating correspondence between identifiers according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating the transformation between a directed relationship variable graph and an adjacent matrix according to an embodiment of the present invention;
FIG. 6 is a diagram of directed relationship variables according to an embodiment of the present invention; fig. 6(a) is a directed relationship variable map with a small number of vertices, and fig. 6(B) is a directed relationship variable map with a large number of vertices.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a schematic flowchart of a code similarity detection method based on a relationship variable diagram according to an embodiment of the present invention, and as shown in fig. 1, the code similarity detection method provided in this embodiment includes:
step 101: and removing the annotation information in the original code and the detection code.
Step 102: and determining the identifiers with the data transmission relation in the processed original code and the processed detection code.
Step 103: determining a first corresponding relation and a second corresponding relation; the first corresponding relation is the corresponding relation among all identifiers in the processed original codes; the second corresponding relation is the corresponding relation between the identifiers in the processed detection codes.
Step 104: and establishing a first directed relationship variable graph according to the first corresponding relationship and establishing a second directed relationship variable graph according to the second corresponding relationship.
Step 105: and calculating the similarity of the first directed relation variable graph and the second directed relation variable graph so as to determine the similarity of the original code and the detected code.
Preferably, step 101 specifically includes:
and removing parts such as comment information, header file information and the like in the original code and the detection code.
In the programming language, the identifier is a word having a specific meaning, such as a class name, an attribute name, a variable name, etc., which is specified by the programmer himself.
In this embodiment, taking the C-voice program as an example, preferably, step 102 specifically includes:
and 2.1, searching the processed original code and the processed detection code sequentially from the beginning to the back respectively.
Step 2.2, extracting the first identifier and the second identifier; the first identifier is an identifier with a data transmission relationship in the processed original code; the second identifier is an identifier having a relationship of passing data in the processed detection code.
Step 2.3, if the identifier is defined outside the main function and the self-defined function, directly storing the identifier into the variable table; if the identifier is defined in the main function and the custom function, the identifier is stored in the variable table in the format of function name + identifier. The method specifically comprises the following steps:
judging whether the first identifier is outside the main function and the user-defined function or not to obtain a first judgment result; if the first judgment result shows that the first identifier is outside the main function and the user-defined function, the first identifier is directly stored in the first variable table; and if the first judgment result shows that the first identifier is in the main function and the self-defined function, storing the first identifier and the function name into the first variable table.
Judging whether the second identifier is outside the main function and the user-defined function or not to obtain a second judgment result; if the second judgment result shows that the second identifier is outside the main function and the user-defined function, the second identifier is directly stored in the second variable table; and if the second judgment result shows that the second identifier is in the main function and the self-defined function, storing the second identifier and the function name into the second variable table.
Preferably, step 103 specifically includes:
establishing a corresponding relation between the identifiers through an assignment relation, an array relation and the like, namely determining the corresponding relation between any two first identifiers through the assignment relation and the array relation so as to obtain a plurality of first corresponding relations; and determining the corresponding relation between any two second identifiers through the assignment relation and the array relation so as to obtain a plurality of second corresponding relations.
The identifiers are used as the vertexes of the graph, the relationship between the identifiers is used as the edge of the graph, for example, in the assignment relationship, when the value of a first identifier is assigned to a second identifier, a directed edge pointing to the second identifier from the first identifier can be established, for example, when an array is used, if the index of the array is an identifier, an identifier pointing to the name of the array from the identifier indicating the index is established, and the like.
Based on this, preferably, step 104 specifically includes:
and establishing a first directed relation variable graph by taking all the first identifiers as the vertexes of the directed relation variable graph and taking all the first corresponding relations as the edges of the directed relation variable graph.
And establishing a second directed relation variable graph by taking all the second identifiers as the vertexes of the directed relation variable graph and taking all the second corresponding relations as the edges of the directed relation variable graph.
Preferably, step 105 specifically includes:
and 5.1, comparing the number of vertexes in the first directed relation variable graph and the second directed relation variable graph, determining the directed relation variable graph with less number of vertexes as a basic directed relation variable graph, namely graph a, and determining the directed relation variable graph with more number of vertexes as a searching directed relation variable graph, namely graph b.
And 5.2, counting the first vertexes in the basic directed relation variable graph, wherein the out-degree of each vertex in the searched directed relation variable graph is greater than the out-degree of the first vertex, the in-degree of each vertex in the searched directed relation variable graph is greater than the in-degree of the first vertex, storing the vertexes into a first set, and repeating the steps until the first set corresponding to each vertex in the basic directed relation variable graph is determined, namely counting the vertexes bi in the graph b, which have the out-degree greater than the out-degree of the vertex ai and the in-degree greater than or equal to the in-degree of the vertex ai, of each vertex ai in the graph a, and putting the vertexes into the first set ai { }.
And 5.3, searching the largest and same subgraph containing the first vertex and the first point in the basic directed relation variable graph and the searching directed relation variable graph, storing the number of the edges of the largest and same subgraph into a second set, wherein the first vertex is the first vertex in the basic directed relation variable graph, the first point is the first point in the first set corresponding to the first vertex, repeating the operation until all the points in the first set corresponding to the first vertex are searched, namely selecting the first point in the first set ai { } corresponding to the vertex ai and the vertex ai of the graph a, searching the largest and same subgraph containing the two points in the graph, storing the number of the edges of the largest and same subgraph into a second set Leni [ ], repeating the operation, and waiting for all the points in the first set ai { }tobe searched.
Step 5.4, after all the points in the first set corresponding to the first vertex are searched, repeating the operation until all the vertices in the basic directed relationship variable graph are searched, namely starting the next vertex of the graph a according to the step of step 5.3 until all the vertices in the graph a are searched; and the number of the second set is the same as that of the vertexes of the basic directed relation variable diagram.
And 5.5, calculating the similarity of the first directed relation variable graph and the second directed relation variable graph according to all the second sets, thereby determining the similarity of the original code and the detected code, namely calculating the similarity of the two graphs according to the values in all the second sets Leni [ ], and further obtaining the similarity of the two program codes. The method specifically comprises the following steps:
and judging whether the number of the second set is equal to the number of the edges of the basic directed relationship variable graph or not to obtain a third judgment result.
And if the third judgment result shows that the number of the second set is equal to the number of the edges of the basic directed relation variable graph, determining that the similarity of the first directed relation variable graph and the second directed relation variable graph is 100%, and determining that the similarity of the original code and the detected code is 100%.
And if the third judgment result shows that the number of the second sets is not equal to the number of the edges of the basic directed relation variable graph, storing the maximum value in each second set into the third set, and calculating the similarity between the first directed relation variable graph and the second directed relation variable graph according to the third set and the number of the vertexes of the basic directed relation variable graph, thereby determining the similarity between the original code and the detected code. The calculation formula is
Figure BDA0002005100850000091
Where sim represents the similarity between the first directed-relationship variable graph and the second directed-relationship variable graph or the similarity between the original code and the detected code, n represents the number of vertices of the basic directed-relationship variable graph, and Σ max [ ] represents the sum of all elements in the third set.
Example 2
Fig. 2 is a schematic structural diagram of a code similarity detection system based on a relationship variable graph according to an embodiment of the present invention, and as shown in fig. 2, the code similarity detection system based on a relationship variable graph according to the embodiment includes:
and the removing module 100 is used for removing the annotation information in the original code and the detection code.
And an identifier determining module 200, configured to determine an identifier having a relationship of transferring data in the processed original code and the processed detection code.
A corresponding relationship determining module 300, configured to determine a first corresponding relationship and a second corresponding relationship; the first corresponding relation is the corresponding relation among all identifiers in the processed original codes; the second corresponding relation is the corresponding relation between the identifiers in the processed detection codes.
The directed relationship variable graph establishing module 400 is configured to establish a first directed relationship variable graph according to the first corresponding relationship, and establish a second directed relationship variable graph according to the second corresponding relationship.
And the similarity calculation module 500 is configured to calculate a similarity between the first directed relationship variable graph and the second directed relationship variable graph, so as to determine a similarity between the original code and the detected code.
Example 3
The embodiment provides a code similarity detection method based on a relation variable diagram, which comprises the following steps:
step 1, removing annotation information in a code program; the code program includes an original code and a detection code.
Step 1 plays an auxiliary role in code detection, and removing annotation information and header file information is beneficial to the extraction of subsequent identifiers and the extraction of relationships, and the specific steps comprise:
removing annotation information in the code, such as: the contents between// and the contents after// are removed from the header information.
And 2, extracting identifiers capable of transmitting data relationships in the processed code program.
The step 2 aims to extract the identifier which can transfer the data relationship in the code, which lays a foundation for extracting the relationship later, and comprises the following specific steps:
taking C language program code as an example, the general variable name, pointer type variable name, structure type variable name, and array name defined therein are stored in a variable table, and the process of extracting the identifier can be seen in fig. 3.
And 3, extracting identifiers capable of transmitting data relationships in the code program and corresponding relationships among the identifiers.
Step 3 is to extract the relationship between two identifiers capable of transferring data relationship in the code, take the two identifier names as the vertices of the graph, and take their relationship as the edges of the graph, as shown in fig. 4, and the specific steps include:
in step 3.1, the identifiers are extracted at the same time as the extraction of the identifiers in step 2, and a structure similar to a [ B ], where a is B, is searched for.
And 3.2, judging whether the A and the B are both in the variable table established before.
And 3.3, if both A and B are in the variable table, establishing the relationship between A and B, which is represented as B- > A in the figure.
And 4, building the relationship extracted in the step 3 into a directed relationship graph.
The purpose of step 4 is to provide convenience for calculating the similarity later, and the specific steps comprise:
the extracted relations are stored by using an adjacency matrix, and a graph built by a program and an adjacency matrix conversion process can be seen in fig. 5, wherein 0 represents nothing, and 1 and 2 represent the directions of edges.
And 5, comparing the relationship variable graphs built by the two program codes to obtain the similarity of the two relationship variable graphs so as to obtain the similarity of the corresponding program codes.
The step 5 aims to reflect the similarity of two program codes through the similarity of two relation graphs, and comprises the following specific steps:
step 5.1, taking the diagram (a) and the diagram (B) in fig. 6 as an example, first, one diagram with a small number of vertices is selected and named as diagram a, and the other diagram is named as diagram B, and for each vertex Ai in the diagram a, the vertices in the diagram B, whose out-degrees are greater than the out-degrees of the vertices Ai and in-degrees are also greater than or equal to the in-degrees of the vertices Ai, are counted and put into the set Ai { }, that is, Aa ═ 1, 2, 4}, Ab ═ 2, 4}, Ac ═ 1, 2, 4}, Ad ═ 1, 2, 3, 4, 5 }.
And 5.2, selecting a first vertex a of the graph A, starting from the point a in the graph A, simultaneously selecting a first point 1 in the set Aa { }, and searching a largest and same subgraph containing the two points. In this example, the graph a is a sub-graph including points a, B, c, and d, and the graph B is a sub-graph including points 1, 2, 4, and 5.
And 5.3, storing the searched maximum and same edge number len of the subgraph into a set Leni [ ], and starting the search of the next vertex of the graph A according to the step 5.2 after all the points in the set Ai { } are searched.
In step 5.4, if the number of the sets Leni [ ] found in the search is the same as the number of edges in the graph a, the similarity Sim between the two program codes is 100%, and the program ends.
And 5.5, if the number of the sets Leni [ ] is different from the number of the edges of the graph A during searching, storing the maximum value in each set Leni [ ] into a set Max [ ].
Step 5.6, obtaining the similarity of the two graphs according to the following formula, and further obtaining the similarity of the two program codes; the calculation formula is
Figure BDA0002005100850000111
Where sim represents the similarity of two program codes, n represents the number of vertices in graph A, and Σ Max [ ] represents the sum of all the elements in the set Max [ ].
The invention can effectively avoid the problem of reducing the similarity by greatly changing the original program, and as long as the data flow directions of the two programs are similar, the relationship graphs of the two programs also have high repetition parts. Therefore, no matter how to change the structure of a program, how to add or subtract codes, replace codes and the like, the relationship between data is not changed too much and the relationship diagram is not changed too much to realize the function.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (5)

1. A code similarity detection method based on a relation variable graph is characterized by comprising the following steps:
removing the annotation information in the original code and the detection code;
determining identifiers with a data transmission relationship in the processed original code and the processed detection code;
determining a first corresponding relation and a second corresponding relation; the first corresponding relation is the corresponding relation among all the identifiers in the processed original codes; the second corresponding relation is the corresponding relation between all the identifiers in the processed detection codes; the determining the first corresponding relationship and the second corresponding relationship specifically includes: determining the corresponding relation between any two first identifiers through the assignment relation and the array relation so as to obtain a plurality of first corresponding relations; determining the corresponding relation between any two second identifiers through the assignment relation and the array relation so as to obtain a plurality of second corresponding relations;
Establishing a first directed relationship variable graph according to the first corresponding relationship, and establishing a second directed relationship variable graph according to the second corresponding relationship, which specifically comprises the following steps: establishing a first directed relation variable graph by taking all the first identifiers as vertexes of the directed relation variable graph and taking all the first corresponding relations as edges of the directed relation variable graph; establishing a second directed relation variable graph by taking all the second identifiers as vertexes of the directed relation variable graph and taking all the second corresponding relations as edges of the directed relation variable graph;
calculating the similarity of the first directed relation variable graph and the second directed relation variable graph so as to determine the similarity of the original code and the detected code;
wherein, the determining that the processed original code and the processed detection code have the identifier of the data transmission relationship specifically includes:
extracting the first identifier and the second identifier; the first identifier is an identifier with a data transmission relationship in the processed original code; the second identifier is an identifier with a data transmission relationship in the processed detection code;
judging whether the first identifier is outside the main function and the user-defined function or not to obtain a first judgment result;
If the first judgment result shows that the first identifier is outside the main function and the user-defined function, directly storing the first identifier into a first variable table;
if the first judgment result shows that the first identifier is in the main function and the self-defined function, storing the first identifier and the function name into a first variable table;
judging whether the second identifier is outside the main function and the user-defined function or not to obtain a second judgment result;
if the second judgment result shows that the second identifier is outside the main function and the user-defined function, directly storing the second identifier into a second variable table;
if the second judgment result shows that the second identifier is in the main function and the self-defined function, storing the second identifier and the function name into a second variable table;
the calculating the similarity between the first directed relationship variable graph and the second directed relationship variable graph so as to determine the similarity between the original code and the detected code specifically includes:
step 1: comparing the number of vertexes in the first directed relationship variable graph and the second directed relationship variable graph, determining the directed relationship variable graph with less number of vertexes as a basic directed relationship variable graph, and determining the directed relationship variable graph with more number of vertexes as a search directed relationship variable graph;
Step 2: counting vertexes, of which the outgoing degree is greater than the outgoing degree of the first vertex and the incoming degree is greater than the incoming degree of the first vertex, in the search directed relation variable graph, storing the vertexes into a first set, and analogizing in sequence until the first set corresponding to each vertex in the basic directed relation variable graph is determined;
and step 3: searching the largest and same subgraph containing a first vertex and a first point in the basic directed-relationship variable graph and the searching directed-relationship variable graph, storing the number of edges of the largest and same subgraph into a second set, wherein the first vertex is the first vertex in the basic directed-relationship variable graph, the first point is the first point in the first set corresponding to the first vertex, and repeating the step 3 until all the points in the first set corresponding to the first vertex are searched;
and 4, step 4: when all the points in the first set corresponding to the first vertex are searched, replacing the first vertex in the step 2 with the next vertex in the basic directed relation variable graph, and executing the step 2 until all the vertices in the basic directed relation variable graph are searched; wherein the number of the second set is the same as the number of the vertexes of the basic directed relation variable graph;
And 5: and according to all the second sets, calculating the similarity of the first directed relation variable graph and the second directed relation variable graph, thereby determining the similarity of the original code and the detected code.
2. The method for detecting code similarity according to claim 1, wherein the removing annotation information in the original code and the detected code specifically includes:
and removing the annotation information and the header file information in the original code and the detection code.
3. The method according to claim 1, wherein the calculating, according to all the second sets, a similarity between the first directed relationship variable graph and the second directed relationship variable graph to determine a similarity between the original code and the detected code specifically includes:
judging whether the number of the second set is equal to the number of edges of the basic directed relationship variable graph or not to obtain a third judgment result;
if the third judgment result indicates that the number of the second set is equal to the number of edges of the basic directed relationship variable graph, determining that the similarity between the first directed relationship variable graph and the second directed relationship variable graph is 100%, and thus determining that the similarity between the original code and the detected code is 100%;
And if the third judgment result shows that the number of the second sets is not equal to the number of the edges of the basic directed relationship variable graph, storing the maximum value in each second set into a third set, and calculating the similarity between the first directed relationship variable graph and the second directed relationship variable graph according to the third set and the number of the vertexes of the basic directed relationship variable graph, so as to determine the similarity between the original code and the detection code.
4. The method according to claim 3, wherein the determining the similarity between the original code and the detected code by calculating the similarity between the first directed relationship variable graph and the second directed relationship variable graph according to the number of vertices of the third set and the base directed relationship variable graph specifically includes:
calculating the similarity of the first directed relation variable graph and the second directed relation variable graph according to the following formula;
the formula is
Figure FDA0003478158280000051
Wherein sim represents the similarity between the first directed-relational variable graph and the second directed-relational variable graph or the similarity between the original code and the detection code, n represents the number of vertices of the basic directed-relational variable graph, and Σ max [ ] represents the sum of all elements in the third set.
5. A code similarity detection system based on a relational variable graph, the code similarity detection system comprising:
the removing module is used for removing the annotation information in the original code and the detection code;
the identifier determining module is used for determining an identifier with a data transmission relationship in the processed original code and the processed detection code;
a corresponding relation determining module for determining a first corresponding relation and a second corresponding relation; the first corresponding relation is the corresponding relation among all the identifiers in the processed original codes; the second corresponding relation is the corresponding relation between all the identifiers in the processed detection codes; the determining the first corresponding relationship and the second corresponding relationship specifically includes: determining the corresponding relation between any two first identifiers through the assignment relation and the array relation so as to obtain a plurality of first corresponding relations; determining the corresponding relation between any two second identifiers through the assignment relation and the array relation so as to obtain a plurality of second corresponding relations;
the directed relationship variable graph establishing module is configured to establish a first directed relationship variable graph according to the first corresponding relationship, and establish a second directed relationship variable graph according to the second corresponding relationship, and specifically includes: establishing a first directed relation variable graph by taking all the first identifiers as vertexes of the directed relation variable graph and taking all the first corresponding relations as edges of the directed relation variable graph; establishing a second directed relation variable graph by taking all the second identifiers as vertexes of the directed relation variable graph and taking all the second corresponding relations as edges of the directed relation variable graph;
The similarity calculation module is used for calculating the similarity of the first directed relation variable diagram and the second directed relation variable diagram so as to determine the similarity of the original code and the detection code;
wherein, the determining that the processed original code and the processed detection code have the identifier of the data transmission relationship specifically includes:
extracting the first identifier and the second identifier; the first identifier is an identifier with a data transmission relationship in the processed original code; the second identifier is an identifier with a data transmission relationship in the processed detection code;
judging whether the first identifier is outside the main function and the user-defined function or not to obtain a first judgment result;
if the first judgment result shows that the first identifier is outside the main function and the user-defined function, directly storing the first identifier into a first variable table;
if the first judgment result shows that the first identifier is in the main function and the self-defined function, storing the first identifier and the function name into a first variable table;
judging whether the second identifier is outside the main function and the user-defined function or not to obtain a second judgment result;
If the second judgment result shows that the second identifier is outside the main function and the user-defined function, directly storing the second identifier into a second variable table;
if the second judgment result shows that the second identifier is in the main function and the self-defined function, storing the second identifier and the function name into a second variable table;
the calculating the similarity between the first directed relationship variable graph and the second directed relationship variable graph so as to determine the similarity between the original code and the detected code specifically includes:
step 1: comparing the number of vertexes in the first directed relationship variable graph and the second directed relationship variable graph, determining the directed relationship variable graph with less number of vertexes as a basic directed relationship variable graph, and determining the directed relationship variable graph with more number of vertexes as a search directed relationship variable graph;
step 2: counting vertexes, of which the outgoing degree is greater than the outgoing degree of the first vertex and the incoming degree is greater than the incoming degree of the first vertex, in the search directed relation variable graph, storing the vertexes into a first set, and analogizing in sequence until the first set corresponding to each vertex in the basic directed relation variable graph is determined;
And step 3: searching the largest and same subgraph containing a first vertex and a first point in the basic directed-relationship variable graph and the searching directed-relationship variable graph, storing the number of edges of the largest and same subgraph into a second set, wherein the first vertex is the first vertex in the basic directed-relationship variable graph, the first point is the first point in the first set corresponding to the first vertex, and repeating the step 3 until all the points in the first set corresponding to the first vertex are searched;
and 4, step 4: when all the points in the first set corresponding to the first vertex are searched, replacing the first vertex in the step 2 with the next vertex in the basic directed relation variable graph, and executing the step 2 until all the vertices in the basic directed relation variable graph are searched; wherein the number of the second set is the same as the number of the vertexes of the basic directed relation variable graph;
and 5: and according to all the second sets, calculating the similarity of the first directed relation variable graph and the second directed relation variable graph, thereby determining the similarity of the original code and the detected code.
CN201910225678.2A 2019-03-25 2019-03-25 Code similarity detection method and system based on relation variable graph Active CN109918128B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910225678.2A CN109918128B (en) 2019-03-25 2019-03-25 Code similarity detection method and system based on relation variable graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910225678.2A CN109918128B (en) 2019-03-25 2019-03-25 Code similarity detection method and system based on relation variable graph

Publications (2)

Publication Number Publication Date
CN109918128A CN109918128A (en) 2019-06-21
CN109918128B true CN109918128B (en) 2022-04-08

Family

ID=66966534

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910225678.2A Active CN109918128B (en) 2019-03-25 2019-03-25 Code similarity detection method and system based on relation variable graph

Country Status (1)

Country Link
CN (1) CN109918128B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984883A (en) * 2014-05-21 2014-08-13 湘潭大学 Class dependency graph based Android application similarity detection method
CN105868108A (en) * 2016-03-28 2016-08-17 中国科学院信息工程研究所 Instruction-set-irrelevant binary code similarity detection method based on neural network
CN107169358A (en) * 2017-05-24 2017-09-15 中国人民解放军信息工程大学 Code homology detection method and its device based on code fingerprint

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9110769B2 (en) * 2010-04-01 2015-08-18 Microsoft Technology Licensing, Llc Code-clone detection and analysis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984883A (en) * 2014-05-21 2014-08-13 湘潭大学 Class dependency graph based Android application similarity detection method
CN105868108A (en) * 2016-03-28 2016-08-17 中国科学院信息工程研究所 Instruction-set-irrelevant binary code similarity detection method based on neural network
CN107169358A (en) * 2017-05-24 2017-09-15 中国人民解放军信息工程大学 Code homology detection method and its device based on code fingerprint

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
程序代码相似度检测方法研究及应用;胡正军;《中国优秀硕士学位论文全文数据库信息科技辑》;20130215;正文第26-44页 *

Also Published As

Publication number Publication date
CN109918128A (en) 2019-06-21

Similar Documents

Publication Publication Date Title
CN108470022B (en) Intelligent work order quality inspection method based on operation and maintenance management
CN106777644A (en) Automatic generation method and device for power plant identification system code
CN113127633B (en) Intelligent conference management method and device, computer equipment and storage medium
CN108304328B (en) Text description generation method, system and device for crowdsourcing test report
CN109740159B (en) Processing method and device for named entity recognition
CN107368480B (en) Method and device for locating and repeatedly identifying error types of point of interest data
CN102567565A (en) Cable parameter processing method and system utilizing same
CN106484913A (en) Method and server that a kind of Target Photo determines
CN112069833B (en) Log analysis method, log analysis device and electronic equipment
CN109918128B (en) Code similarity detection method and system based on relation variable graph
CN117636368A (en) Correction method, device, equipment and medium
CN112036843A (en) Flow element positioning method, device, equipment and medium based on RPA and AI
CN117033309A (en) Data conversion method and device, electronic equipment and readable storage medium
CN110427277B (en) Data verification method, device, equipment and storage medium
CN105469116A (en) Picture recognition and data extension method for infants based on man-machine interaction
CN110110280B (en) Curve integral calculation method, device and equipment for coordinates and storage medium
CN117540729A (en) Address detection method, address detection device, computer equipment and computer readable storage medium
CN114048211A (en) Data integration method and device and electronic equipment
CN111324820B (en) Inviting method, inviting device, terminal equipment and computer storage medium
CN113204945A (en) Application problem correcting method and device, computer readable medium and electronic equipment
CN111966734A (en) Data processing method and electronic equipment of spreadsheet combined with RPA and AI
CN111143643A (en) Element identification method and device, readable storage medium and electronic equipment
CN111832280B (en) Scenario information processing method and device, electronic equipment and storage medium
CN112364128B (en) Information processing method, device, computer equipment and storage medium
CN117493712B (en) PDF document navigable directory extraction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant