CN112416431B

CN112416431B - Source code segment pair comparison method based on coding sequence representation

Info

Publication number: CN112416431B
Application number: CN202011324413.7A
Authority: CN
Inventors: 黄志球; 喻垚慎; 李伟湋; 沈国华; 邵宜超
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2023-02-14
Anticipated expiration: 2040-11-23
Also published as: CN112416431A

Abstract

The invention discloses a source code segment pair comparison method based on coding sequence representation, which belongs to the technical field of computer programs and adopts a coding sequence source code representation method based on static program analysis to convert a source code text into coding sequence representation; performing data processing on the coding sequence of the source code segment by using Burrows-Wheeler conversion to obtain an index of the coding sequence; through seed screening, finding out the seed with high similarity from the index of the coding sequence; using a Smith-Waterman algorithm to take the high-similarity seeds as the initial positions of the sub-sequence comparison and expand the sub-sequences which keep a certain similarity threshold in the subsequent sequences; according to the source code line number information corresponding to the coding sequence, the high similarity parts between the source code segments are positioned, the technical problems that cross-granularity similarity matching cannot be supported and the positioning of the high similarity segments is not accurate enough are solved, cross-granularity source code similarity comparison can be supported, and source code texts not requiring to be compared have the same granularity.

Description

Source code segment pair comparison method based on coding sequence representation

Technical Field

The invention belongs to the technical field of computer programs, and relates to a source code segment pair comparison method based on coded sequence representation.

Background

The source code similarity detection has wide application in many software development tasks, for example, code plagiarism and repeated code detection are performed through clone detection, software failure is positioned through similarity matching, code recommendation is performed through high-similarity codes or repair patches are generated, and the like. In these tasks, a source code similarity matching algorithm is required to search and quantitatively analyze similar codes.

The common code similarity calculation method generally represents a source code text by a text, a symbol, a tree structure or a graph structure, and then calculates the similarity of two sections of source codes by using corresponding similarity definitions. The text-based method comprises the steps of taking a source code text as a character string sequence or a set, and performing text matching; abstracting a source code into a sequence or a set of symbols based on a symbol method, and performing similarity matching of symbol strings or sets; the method based on the tree structure is to convert the source code into a syntax tree structure of the code, and calculate the similarity by using algorithms such as subtree matching and the like; the method based on the graph structure is to represent a control flow graph or a data dependency graph of a source code as the source code, and calculate corresponding similarity by adopting a subgraph matching mode.

In the current common code similarity calculation method, two pieces of source codes to be compared are required to be of the same granularity, namely, the two pieces of source codes are both in the levels of functions, classes or files. Such methods typically require that the source code being compared is already programmed code and that similarity comparisons cannot be made for normally programmed source code fragments. According to software engineering experience, the earlier a possible problem is found if software codes are modified, the lower the solution cost is. Therefore, for some software development tasks, such as repeated code detection, fault location and code recommendation, if similarity comparison can be performed in the programming process, the solution efficiency of such tasks can be effectively improved. This requires that the source code fragment comparison method be able to support similarity matching across granularities (matching between incomplete programming code fragments and completed code text).

Disclosure of Invention

The invention aims to provide a source code segment pair comparison method based on coding sequence representation, and solves the technical problems that cross-granularity similarity matching cannot be supported and high-similarity segment positioning is not accurate enough in the conventional source code similarity matching algorithm.

In order to achieve the purpose, the invention adopts the following technical scheme:

a source code segment pair-wise comparison method based on coded sequence representation comprises the following steps:

step 1: establishing a source code database for storing a source code text;

establishing a code conversion module, and converting a source code fragment of a source code text into a code sequence in the code conversion module by using a code sequence source code representation method based on static program analysis;

step 2: in a code conversion module, performing data processing on a coding sequence by using a Burrows-Wheeler block ordering compression algorithm to obtain an index of the coding sequence;

and step 3: establishing a sequence alignment module, and finding out a subsequence alignment seed with high similarity from an index of a coding sequence through seed screening in the sequence alignment module;

and 4, step 4: in a sequence alignment module, a Smith-Waterman algorithm is adopted to take the high-similarity seeds as the initial positions of subsequence alignment, and the subsequences which keep a certain similarity threshold in the subsequent sequences are expanded;

and 5: and positioning high-similarity parts among the source code segments according to the line number information of the source code segments corresponding to the coding sequences.

Preferably, when step 1 is executed, the method for representing source code of a coding sequence by using static program analysis in the code conversion module specifically includes: and processing the code segments by taking the code segments as units, and converting the source code text into a coding sequence.

Preferably, when step 2 is executed, the method specifically includes using a Burrows-Wheeler block ordering compression algorithm to index all nodes of the coding sequence, and obtaining the index sequence of all nodes in the coding sequence.

Preferably, when step 3 is executed, the method specifically includes the following steps:

step A1: selecting seed position pairs with structural codes not greater than 0 and the same type of codes according to the indexes of the nodes of the coding sequences of the two segments of source code segments to be compared;

step A2: for any seed position pair, if the number of the same nodes at the corresponding positions in the subsequent K nodes is not less than K multiplied by r, wherein r is a similarity threshold, the seed position pair is marked as a subsequence candidate seed pair with high similarity.

Preferably, when step 4 is executed, the method specifically includes expanding a subsequent sequence by using a smith-waterman algorithm for each subsequence candidate seed pair, taking K subsequent nodes each time as expansion, where the length of the expanded subsequence is nK, and if the similarity of the expanded subsequence is less than r and r is a similarity threshold, the expansion process is stopped; otherwise, the expansion is continued.

Preferably, when the step 5 is executed, the method specifically includes obtaining the position range of the source code segment corresponding to the high-similarity subsequence according to the line number information of the source code segment corresponding to the coding sequence, so as to locate the position of the high-similarity part in the two segments of source code segments.

The invention relates to a source code segment pair-wise comparison method based on coding sequence representation, which solves the technical problems that cross-granularity similarity matching cannot be supported and high-similarity segment positioning is not accurate enough in the existing source code similarity matching algorithm; meanwhile, because the coding sequence obtained by conversion based on the abstract syntax tree is used for matching, the similar segments can be accurately positioned to the corresponding code lines, cross-line matching can be supported, and the method has better applicability and matching performance compared with the prior art.

Drawings

FIG. 1 is a general flow frame diagram of the present invention;

fig. 2 is a source code fragment text a provided in the present embodiment;

fig. 3 is a source code fragment text B provided in the present embodiment;

FIG. 4 is a coding sequence of text A conversion of a source code segment in the present embodiment;

FIG. 5 is a coding sequence of the source code segment text B conversion in the present embodiment;

FIG. 6 is an abstract syntax tree of the source code segment A and the coding sequence of each node in the present embodiment;

FIG. 7 is an abstract syntax tree of the source code segment B and the coding sequence of each node in the present embodiment;

FIG. 8 is a high similarity seed pair provided in the present embodiment;

FIG. 9 is a matching subsequence provided in the present embodiment;

fig. 10 is a high similarity part in the two pieces of source code text provided in the present embodiment.

Detailed Description

For those skilled in the art to better understand the technical solutions in the present invention, the technical solutions in the embodiments of the present invention will be described in detail below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, belong to the scope of the present invention.

As shown in fig. 1-10, a method for comparing pairs of source code segments based on coded sequence representation includes the following steps:

step 1: establishing a source code database for storing a source code text;

in this embodiment, a source code analysis tool is used in the code conversion module to convert the source code text into the original abstract syntax tree; combining the node type of the abstract syntax tree, simplifying redundant information of the original abstract syntax tree, and only reserving the tree structure and the node type of the abstract syntax tree; a coding sequence format is adopted, structure coding and type coding of nodes are simultaneously contained, the simplified abstract syntax tree is subjected to traversal coding, and coding sequence representation of a source code text is generated.

And converting the original coding sequence by using Burrows-Wheeler conversion to form an increasing sequence according to the value of each coding node, recording the corresponding position of each node in the increasing sequence in the original coding sequence, searching the starting position of the high-similarity seed by increasing the sequence index, and finding out the subsequent subsequence of the starting node by the position of the original node.

The invention firstly selects candidate seed pairs according to the characteristics of coding sequences, and the node N of each coding sequence _i Are all encoded by structure SC _i And type coding TC _i If there is a pair of nodes in the two coding sequences A and B

Both structural codes are not greater than 0 and the type codes are the same, i.e.

Then the node

May be referred to as a candidate seed pair.

Then, the candidate seed pairs are compared

K short sequences as starting positions, respectively

And

and

the number of the same nodes at the corresponding positions is t, if the number satisfies

Wherein r is ₁ If a person is a set similarity threshold, two seed pairs can be recorded as

Preferably, when step 4 is executed, the method specifically includes expanding a subsequent sequence by using a smith-waterman algorithm for each subsequence candidate seed pair, taking K subsequent nodes each time as expansion, where the length of the expanded subsequence is nK, and if the similarity of the expanded subsequence is less than r and r is a similarity threshold, the expansion process is stopped; otherwise, continuing to expand until the expansion reaches the starting position of the next seed pair.

High-similarity seed pair obtained by screening seeds

As the start position of the subsequence, the node of the Q term is expanded backward using the Smith-Waterman algorithm, i.e., smith-Waterman algorithm, and Q > K, i.e., Q > K

And

two subsequences, if

And

the similarity of the two subsequences in the Smith-Waterman algorithm is scored as maxS ≧ r ₂ Wherein r is ₂ If a person is a set similarity threshold, then

May be referred to as matching subsequences.

Each node of the coding sequence has corresponding code line number information when being generated, and the corresponding code line in the source code segment can be positioned according to the line number information of each node in the high-similarity subsequence, so that the high-similarity part between the source code segments can be positioned.

In this embodiment, as shown in fig. 4, the source code text a may be converted into the encoding sequence, each line corresponds to one abstract syntax tree node, and each node corresponds to the source code line number information and is stored in another file.

As shown in fig. 6, the source code fragment a can be parsed into the abstract syntax tree, where each node in the abstract syntax tree is an abstract syntax tree node, and each node has a C language abstract syntax tree node type defined by its corresponding Clang parser and a code of the node.

As shown in fig. 5, the source code text B may be converted into the encoding sequence, each line corresponds to an abstract syntax tree node, and each node corresponds to source code line number information and is stored in another file.

As shown in fig. 7, the source code fragment B can be parsed into the abstract syntax tree, where each node in the abstract syntax tree is an abstract syntax tree node, and each node has a C language abstract syntax tree node type defined by its corresponding Clang parser and a code of the node.

As shown in fig. 8, the screened high similarity seed pairs, if K =5 and the similarity threshold r1=0.8 is given in the seed screening process, two seed pairs can be obtained, the first column represents the start position of the coding sequence of the source code segment a, the second column represents the start position of the coding sequence of the source code segment B, and the third column represents the K value of the K short sequences.

As shown in fig. 9, smith-Waterman expansion is performed on two seed pairs to obtain two high similarity matching subsequences, wherein different colors represent different matching subsequences, and in fig. 9, the high similarity matching subsequences are indicated by boxes.

As shown in fig. 10, according to the matching subsequence, according to the source code line number information retained in the process of encoding sequence conversion, a code segment with high similarity in the source code text can be reversely found, as shown in a block in fig. 10.

The source code segment pair-wise comparison method based on coding sequence representation solves the technical problems that cross-granularity similarity matching cannot be supported and high-similarity segment positioning is not accurate enough in the existing source code similarity matching algorithm, can support cross-granularity source code similarity comparison, does not require the same granularity of source code texts to be compared, and can be applied to a similarity matching task of codes in a programming development stage; meanwhile, because the coding sequence obtained based on abstract syntax tree conversion is used for matching, the similar segments can be accurately positioned to the corresponding code lines, cross-line matching can be supported, and the method has better applicability and matching performance compared with the prior art.

In the present invention, any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Further, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method for comparing source code segments in pairs based on coded sequence representation, comprising: the method comprises the following steps:

step 1: establishing a source code database for storing a source code text;

and 2, step: in a code conversion module, performing data processing on a coding sequence by using a Burrows-Wheeler block ordering compression algorithm to obtain an index of the coding sequence;

when step 3 is executed, the method specifically comprises the following steps:

step A2: for any seed position pair, if the number of the same nodes at the corresponding positions in the subsequent K nodes is not less than K multiplied by r, wherein r is a similarity threshold, marking the seed position pair as a subsequence candidate seed pair with high similarity;

firstly, selecting candidate seed pairs according to the characteristics of coding sequences, and selecting node N of each coding sequence _i Are all encoded by structure SC _i And type coding TC _i If there is a pair of nodes in the two coding sequences A and B

Then the node

May be referred to as a candidate seed pair;

then, the candidate seed pairs are compared

K short sequences for the starting position, respectively

And

and

2. A method for pairwise comparison of source code segments based on coded sequence representations, according to claim 1, wherein: when the step 1 is executed, the method for representing the source code of the coding sequence by using the static program analysis-based code conversion module specifically comprises the following steps: and processing the code segments by taking the code segments as units, and converting the source code text into a coding sequence.

3. A method for pairwise comparison of source code segments based on coded sequence representations, according to claim 1, wherein: when the step 2 is executed, specifically, the method includes using a Burrows-Wheeler block ordering compression algorithm to index all nodes of the coding sequence, and obtaining the index sequence of all nodes in the coding sequence.

4. A method of pairwise comparison of source code segments based on coded sequence representation according to claim 3, characterized by: when the step 4 is executed, expanding a subsequent sequence by using a smith-waterman algorithm for each subsequence candidate seed pair, taking K subsequent nodes as expansion each time, wherein the length of the expanded subsequence is nK, and if the similarity of the expanded subsequence is less than r and r is a similarity threshold value, stopping the expansion process; otherwise, continuing to expand until the expansion reaches the starting position of the next seed pair.

5. A method of comparing pairs of source code fragments based on a representation of an encoded sequence as claimed in claim 4, wherein: when the step 5 is executed, specifically obtaining the position range of the source code segment corresponding to the high-similarity subsequence according to the line number information of the source code segment corresponding to the coding sequence, thereby positioning the position of the high-similarity part in the two segments of source code segments.