CN112416431B - Source code segment pair comparison method based on coding sequence representation - Google Patents

Source code segment pair comparison method based on coding sequence representation Download PDF

Info

Publication number
CN112416431B
CN112416431B CN202011324413.7A CN202011324413A CN112416431B CN 112416431 B CN112416431 B CN 112416431B CN 202011324413 A CN202011324413 A CN 202011324413A CN 112416431 B CN112416431 B CN 112416431B
Authority
CN
China
Prior art keywords
source code
similarity
coding sequence
sequence
seed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011324413.7A
Other languages
Chinese (zh)
Other versions
CN112416431A (en
Inventor
黄志球
喻垚慎
李伟湋
沈国华
邵宜超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202011324413.7A priority Critical patent/CN112416431B/en
Publication of CN112416431A publication Critical patent/CN112416431A/en
Application granted granted Critical
Publication of CN112416431B publication Critical patent/CN112416431B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a source code segment pair comparison method based on coding sequence representation, which belongs to the technical field of computer programs and adopts a coding sequence source code representation method based on static program analysis to convert a source code text into coding sequence representation; performing data processing on the coding sequence of the source code segment by using Burrows-Wheeler conversion to obtain an index of the coding sequence; through seed screening, finding out the seed with high similarity from the index of the coding sequence; using a Smith-Waterman algorithm to take the high-similarity seeds as the initial positions of the sub-sequence comparison and expand the sub-sequences which keep a certain similarity threshold in the subsequent sequences; according to the source code line number information corresponding to the coding sequence, the high similarity parts between the source code segments are positioned, the technical problems that cross-granularity similarity matching cannot be supported and the positioning of the high similarity segments is not accurate enough are solved, cross-granularity source code similarity comparison can be supported, and source code texts not requiring to be compared have the same granularity.

Description

Source code segment pair comparison method based on coding sequence representation
Technical Field
The invention belongs to the technical field of computer programs, and relates to a source code segment pair comparison method based on coded sequence representation.
Background
The source code similarity detection has wide application in many software development tasks, for example, code plagiarism and repeated code detection are performed through clone detection, software failure is positioned through similarity matching, code recommendation is performed through high-similarity codes or repair patches are generated, and the like. In these tasks, a source code similarity matching algorithm is required to search and quantitatively analyze similar codes.
The common code similarity calculation method generally represents a source code text by a text, a symbol, a tree structure or a graph structure, and then calculates the similarity of two sections of source codes by using corresponding similarity definitions. The text-based method comprises the steps of taking a source code text as a character string sequence or a set, and performing text matching; abstracting a source code into a sequence or a set of symbols based on a symbol method, and performing similarity matching of symbol strings or sets; the method based on the tree structure is to convert the source code into a syntax tree structure of the code, and calculate the similarity by using algorithms such as subtree matching and the like; the method based on the graph structure is to represent a control flow graph or a data dependency graph of a source code as the source code, and calculate corresponding similarity by adopting a subgraph matching mode.
In the current common code similarity calculation method, two pieces of source codes to be compared are required to be of the same granularity, namely, the two pieces of source codes are both in the levels of functions, classes or files. Such methods typically require that the source code being compared is already programmed code and that similarity comparisons cannot be made for normally programmed source code fragments. According to software engineering experience, the earlier a possible problem is found if software codes are modified, the lower the solution cost is. Therefore, for some software development tasks, such as repeated code detection, fault location and code recommendation, if similarity comparison can be performed in the programming process, the solution efficiency of such tasks can be effectively improved. This requires that the source code fragment comparison method be able to support similarity matching across granularities (matching between incomplete programming code fragments and completed code text).
Disclosure of Invention
The invention aims to provide a source code segment pair comparison method based on coding sequence representation, and solves the technical problems that cross-granularity similarity matching cannot be supported and high-similarity segment positioning is not accurate enough in the conventional source code similarity matching algorithm.
In order to achieve the purpose, the invention adopts the following technical scheme:
a source code segment pair-wise comparison method based on coded sequence representation comprises the following steps:
step 1: establishing a source code database for storing a source code text;
establishing a code conversion module, and converting a source code fragment of a source code text into a code sequence in the code conversion module by using a code sequence source code representation method based on static program analysis;
step 2: in a code conversion module, performing data processing on a coding sequence by using a Burrows-Wheeler block ordering compression algorithm to obtain an index of the coding sequence;
and step 3: establishing a sequence alignment module, and finding out a subsequence alignment seed with high similarity from an index of a coding sequence through seed screening in the sequence alignment module;
and 4, step 4: in a sequence alignment module, a Smith-Waterman algorithm is adopted to take the high-similarity seeds as the initial positions of subsequence alignment, and the subsequences which keep a certain similarity threshold in the subsequent sequences are expanded;
and 5: and positioning high-similarity parts among the source code segments according to the line number information of the source code segments corresponding to the coding sequences.
Preferably, when step 1 is executed, the method for representing source code of a coding sequence by using static program analysis in the code conversion module specifically includes: and processing the code segments by taking the code segments as units, and converting the source code text into a coding sequence.
Preferably, when step 2 is executed, the method specifically includes using a Burrows-Wheeler block ordering compression algorithm to index all nodes of the coding sequence, and obtaining the index sequence of all nodes in the coding sequence.
Preferably, when step 3 is executed, the method specifically includes the following steps:
step A1: selecting seed position pairs with structural codes not greater than 0 and the same type of codes according to the indexes of the nodes of the coding sequences of the two segments of source code segments to be compared;
step A2: for any seed position pair, if the number of the same nodes at the corresponding positions in the subsequent K nodes is not less than K multiplied by r, wherein r is a similarity threshold, the seed position pair is marked as a subsequence candidate seed pair with high similarity.
Preferably, when step 4 is executed, the method specifically includes expanding a subsequent sequence by using a smith-waterman algorithm for each subsequence candidate seed pair, taking K subsequent nodes each time as expansion, where the length of the expanded subsequence is nK, and if the similarity of the expanded subsequence is less than r and r is a similarity threshold, the expansion process is stopped; otherwise, the expansion is continued.
Preferably, when the step 5 is executed, the method specifically includes obtaining the position range of the source code segment corresponding to the high-similarity subsequence according to the line number information of the source code segment corresponding to the coding sequence, so as to locate the position of the high-similarity part in the two segments of source code segments.
The invention relates to a source code segment pair-wise comparison method based on coding sequence representation, which solves the technical problems that cross-granularity similarity matching cannot be supported and high-similarity segment positioning is not accurate enough in the existing source code similarity matching algorithm; meanwhile, because the coding sequence obtained by conversion based on the abstract syntax tree is used for matching, the similar segments can be accurately positioned to the corresponding code lines, cross-line matching can be supported, and the method has better applicability and matching performance compared with the prior art.
Drawings
FIG. 1 is a general flow frame diagram of the present invention;
fig. 2 is a source code fragment text a provided in the present embodiment;
fig. 3 is a source code fragment text B provided in the present embodiment;
FIG. 4 is a coding sequence of text A conversion of a source code segment in the present embodiment;
FIG. 5 is a coding sequence of the source code segment text B conversion in the present embodiment;
FIG. 6 is an abstract syntax tree of the source code segment A and the coding sequence of each node in the present embodiment;
FIG. 7 is an abstract syntax tree of the source code segment B and the coding sequence of each node in the present embodiment;
FIG. 8 is a high similarity seed pair provided in the present embodiment;
FIG. 9 is a matching subsequence provided in the present embodiment;
fig. 10 is a high similarity part in the two pieces of source code text provided in the present embodiment.
Detailed Description
For those skilled in the art to better understand the technical solutions in the present invention, the technical solutions in the embodiments of the present invention will be described in detail below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, belong to the scope of the present invention.
As shown in fig. 1-10, a method for comparing pairs of source code segments based on coded sequence representation includes the following steps:
step 1: establishing a source code database for storing a source code text;
establishing a code conversion module, and converting a source code fragment of a source code text into a code sequence in the code conversion module by using a code sequence source code representation method based on static program analysis;
in this embodiment, a source code analysis tool is used in the code conversion module to convert the source code text into the original abstract syntax tree; combining the node type of the abstract syntax tree, simplifying redundant information of the original abstract syntax tree, and only reserving the tree structure and the node type of the abstract syntax tree; a coding sequence format is adopted, structure coding and type coding of nodes are simultaneously contained, the simplified abstract syntax tree is subjected to traversal coding, and coding sequence representation of a source code text is generated.
Step 2: in a code conversion module, performing data processing on a coding sequence by using a Burrows-Wheeler block ordering compression algorithm to obtain an index of the coding sequence;
and step 3: establishing a sequence alignment module, and finding out a subsequence alignment seed with high similarity from an index of a coding sequence through seed screening in the sequence alignment module;
and 4, step 4: in a sequence alignment module, a Smith-Waterman algorithm is adopted to take the high-similarity seeds as the initial positions of subsequence alignment, and the subsequences which keep a certain similarity threshold in the subsequent sequences are expanded;
and 5: and positioning high-similarity parts among the source code segments according to the line number information of the source code segments corresponding to the coding sequences.
Preferably, when step 1 is executed, the method for representing source code of a coding sequence by using static program analysis in the code conversion module specifically includes: and processing the code segments by taking the code segments as units, and converting the source code text into a coding sequence.
Preferably, when step 2 is executed, the method specifically includes using a Burrows-Wheeler block ordering compression algorithm to index all nodes of the coding sequence, and obtaining the index sequence of all nodes in the coding sequence.
And converting the original coding sequence by using Burrows-Wheeler conversion to form an increasing sequence according to the value of each coding node, recording the corresponding position of each node in the increasing sequence in the original coding sequence, searching the starting position of the high-similarity seed by increasing the sequence index, and finding out the subsequent subsequence of the starting node by the position of the original node.
Preferably, when step 3 is executed, the method specifically includes the following steps:
step A1: selecting seed position pairs with structural codes not greater than 0 and the same type of codes according to the indexes of the nodes of the coding sequences of the two segments of source code segments to be compared;
step A2: for any seed position pair, if the number of the same nodes at the corresponding positions in the subsequent K nodes is not less than K multiplied by r, wherein r is a similarity threshold, the seed position pair is marked as a subsequence candidate seed pair with high similarity.
The invention firstly selects candidate seed pairs according to the characteristics of coding sequences, and the node N of each coding sequence i Are all encoded by structure SC i And type coding TC i If there is a pair of nodes in the two coding sequences A and B
Figure BDA0002793878430000051
Both structural codes are not greater than 0 and the type codes are the same, i.e.
Figure BDA0002793878430000052
Then the node
Figure BDA0002793878430000053
May be referred to as a candidate seed pair.
Then, the candidate seed pairs are compared
Figure BDA0002793878430000054
K short sequences as starting positions, respectively
Figure BDA0002793878430000055
And
Figure BDA0002793878430000056
Figure BDA0002793878430000057
and
Figure BDA0002793878430000058
the number of the same nodes at the corresponding positions is t, if the number satisfies
Figure BDA0002793878430000059
Wherein r is 1 If a person is a set similarity threshold, two seed pairs can be recorded as
Figure BDA00027938784300000510
Preferably, when step 4 is executed, the method specifically includes expanding a subsequent sequence by using a smith-waterman algorithm for each subsequence candidate seed pair, taking K subsequent nodes each time as expansion, where the length of the expanded subsequence is nK, and if the similarity of the expanded subsequence is less than r and r is a similarity threshold, the expansion process is stopped; otherwise, continuing to expand until the expansion reaches the starting position of the next seed pair.
High-similarity seed pair obtained by screening seeds
Figure BDA00027938784300000511
As the start position of the subsequence, the node of the Q term is expanded backward using the Smith-Waterman algorithm, i.e., smith-Waterman algorithm, and Q > K, i.e., Q > K
Figure BDA0002793878430000061
And
Figure BDA0002793878430000062
two subsequences, if
Figure BDA0002793878430000063
And
Figure BDA0002793878430000064
the similarity of the two subsequences in the Smith-Waterman algorithm is scored as maxS ≧ r 2 Wherein r is 2 If a person is a set similarity threshold, then
Figure BDA0002793878430000065
May be referred to as matching subsequences.
Preferably, when the step 5 is executed, the method specifically includes obtaining the position range of the source code segment corresponding to the high-similarity subsequence according to the line number information of the source code segment corresponding to the coding sequence, so as to locate the position of the high-similarity part in the two segments of source code segments.
Each node of the coding sequence has corresponding code line number information when being generated, and the corresponding code line in the source code segment can be positioned according to the line number information of each node in the high-similarity subsequence, so that the high-similarity part between the source code segments can be positioned.
In this embodiment, as shown in fig. 4, the source code text a may be converted into the encoding sequence, each line corresponds to one abstract syntax tree node, and each node corresponds to the source code line number information and is stored in another file.
As shown in fig. 6, the source code fragment a can be parsed into the abstract syntax tree, where each node in the abstract syntax tree is an abstract syntax tree node, and each node has a C language abstract syntax tree node type defined by its corresponding Clang parser and a code of the node.
As shown in fig. 5, the source code text B may be converted into the encoding sequence, each line corresponds to an abstract syntax tree node, and each node corresponds to source code line number information and is stored in another file.
As shown in fig. 7, the source code fragment B can be parsed into the abstract syntax tree, where each node in the abstract syntax tree is an abstract syntax tree node, and each node has a C language abstract syntax tree node type defined by its corresponding Clang parser and a code of the node.
As shown in fig. 8, the screened high similarity seed pairs, if K =5 and the similarity threshold r1=0.8 is given in the seed screening process, two seed pairs can be obtained, the first column represents the start position of the coding sequence of the source code segment a, the second column represents the start position of the coding sequence of the source code segment B, and the third column represents the K value of the K short sequences.
As shown in fig. 9, smith-Waterman expansion is performed on two seed pairs to obtain two high similarity matching subsequences, wherein different colors represent different matching subsequences, and in fig. 9, the high similarity matching subsequences are indicated by boxes.
As shown in fig. 10, according to the matching subsequence, according to the source code line number information retained in the process of encoding sequence conversion, a code segment with high similarity in the source code text can be reversely found, as shown in a block in fig. 10.
The source code segment pair-wise comparison method based on coding sequence representation solves the technical problems that cross-granularity similarity matching cannot be supported and high-similarity segment positioning is not accurate enough in the existing source code similarity matching algorithm, can support cross-granularity source code similarity comparison, does not require the same granularity of source code texts to be compared, and can be applied to a similarity matching task of codes in a programming development stage; meanwhile, because the coding sequence obtained based on abstract syntax tree conversion is used for matching, the similar segments can be accurately positioned to the corresponding code lines, cross-line matching can be supported, and the method has better applicability and matching performance compared with the prior art.
In the present invention, any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Further, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (5)

1. A method for comparing source code segments in pairs based on coded sequence representation, comprising: the method comprises the following steps:
step 1: establishing a source code database for storing a source code text;
establishing a code conversion module, and converting a source code fragment of a source code text into a code sequence in the code conversion module by using a code sequence source code representation method based on static program analysis;
and 2, step: in a code conversion module, performing data processing on a coding sequence by using a Burrows-Wheeler block ordering compression algorithm to obtain an index of the coding sequence;
and step 3: establishing a sequence alignment module, and finding out a subsequence alignment seed with high similarity from an index of a coding sequence through seed screening in the sequence alignment module;
when step 3 is executed, the method specifically comprises the following steps:
step A1: selecting seed position pairs with structural codes not greater than 0 and the same type of codes according to the indexes of the nodes of the coding sequences of the two segments of source code segments to be compared;
step A2: for any seed position pair, if the number of the same nodes at the corresponding positions in the subsequent K nodes is not less than K multiplied by r, wherein r is a similarity threshold, marking the seed position pair as a subsequence candidate seed pair with high similarity;
firstly, selecting candidate seed pairs according to the characteristics of coding sequences, and selecting node N of each coding sequence i Are all encoded by structure SC i And type coding TC i If there is a pair of nodes in the two coding sequences A and B
Figure FDA0003988227260000011
Both structural codes are not greater than 0 and the type codes are the same, i.e.
Figure FDA0003988227260000012
Then the node
Figure FDA0003988227260000013
May be referred to as a candidate seed pair;
then, the candidate seed pairs are compared
Figure FDA0003988227260000021
K short sequences for the starting position, respectively
Figure FDA0003988227260000022
And
Figure FDA0003988227260000023
and
Figure FDA0003988227260000024
the number of the same nodes at the corresponding positions is t, if the number satisfies
Figure FDA0003988227260000025
Wherein r is 1 If a person is a set similarity threshold, two seed pairs can be recorded as
Figure FDA0003988227260000026
And 4, step 4: in a sequence alignment module, a Smith-Waterman algorithm is adopted to take the high-similarity seeds as the initial positions of subsequence alignment, and the subsequences which keep a certain similarity threshold in the subsequent sequences are expanded;
and 5: and positioning high-similarity parts among the source code segments according to the line number information of the source code segments corresponding to the coding sequences.
2. A method for pairwise comparison of source code segments based on coded sequence representations, according to claim 1, wherein: when the step 1 is executed, the method for representing the source code of the coding sequence by using the static program analysis-based code conversion module specifically comprises the following steps: and processing the code segments by taking the code segments as units, and converting the source code text into a coding sequence.
3. A method for pairwise comparison of source code segments based on coded sequence representations, according to claim 1, wherein: when the step 2 is executed, specifically, the method includes using a Burrows-Wheeler block ordering compression algorithm to index all nodes of the coding sequence, and obtaining the index sequence of all nodes in the coding sequence.
4. A method of pairwise comparison of source code segments based on coded sequence representation according to claim 3, characterized by: when the step 4 is executed, expanding a subsequent sequence by using a smith-waterman algorithm for each subsequence candidate seed pair, taking K subsequent nodes as expansion each time, wherein the length of the expanded subsequence is nK, and if the similarity of the expanded subsequence is less than r and r is a similarity threshold value, stopping the expansion process; otherwise, continuing to expand until the expansion reaches the starting position of the next seed pair.
5. A method of comparing pairs of source code fragments based on a representation of an encoded sequence as claimed in claim 4, wherein: when the step 5 is executed, specifically obtaining the position range of the source code segment corresponding to the high-similarity subsequence according to the line number information of the source code segment corresponding to the coding sequence, thereby positioning the position of the high-similarity part in the two segments of source code segments.
CN202011324413.7A 2020-11-23 2020-11-23 Source code segment pair comparison method based on coding sequence representation Active CN112416431B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011324413.7A CN112416431B (en) 2020-11-23 2020-11-23 Source code segment pair comparison method based on coding sequence representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011324413.7A CN112416431B (en) 2020-11-23 2020-11-23 Source code segment pair comparison method based on coding sequence representation

Publications (2)

Publication Number Publication Date
CN112416431A CN112416431A (en) 2021-02-26
CN112416431B true CN112416431B (en) 2023-02-14

Family

ID=74777426

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011324413.7A Active CN112416431B (en) 2020-11-23 2020-11-23 Source code segment pair comparison method based on coding sequence representation

Country Status (1)

Country Link
CN (1) CN112416431B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015069393A (en) * 2013-09-27 2015-04-13 株式会社東芝 Document data comparison method, document data comparison apparatus, and document data comparison program
CN105051741A (en) * 2012-12-17 2015-11-11 微软技术许可有限责任公司 Parallel local sequence alignment
CN107066837A (en) * 2017-04-01 2017-08-18 上海交通大学 One kind has with reference to DNA sequence dna compression method and system
CN107615240A (en) * 2015-04-17 2018-01-19 巴特尔纪念研究所 For analyzing the scheme based on biological sequence of binary file
CN108345468A (en) * 2018-01-29 2018-07-31 华侨大学 Programming language code duplicate checking method based on tree and sequence similarity
CN108920902A (en) * 2018-06-29 2018-11-30 郑州云海信息技术有限公司 A kind of gene order processing method and its relevant device
CN109634594A (en) * 2018-11-05 2019-04-16 南京航空航天大学 A kind of code snippet recommended method considering code statement order information
CN110310705A (en) * 2018-03-16 2019-10-08 北京哲源科技有限责任公司 Support the sequence alignment method and device of SIMD
CN110737466A (en) * 2019-10-16 2020-01-31 南京航空航天大学 Source code coding sequence representation method based on static program analysis
CN111562920A (en) * 2020-06-08 2020-08-21 腾讯科技(深圳)有限公司 Method and device for determining similarity of small program codes, server and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063488A (en) * 2010-12-29 2011-05-18 南京航空航天大学 Code searching method based on semantics
EP3161618A4 (en) * 2014-06-30 2017-06-28 Microsoft Technology Licensing, LLC Code recommendation

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105051741A (en) * 2012-12-17 2015-11-11 微软技术许可有限责任公司 Parallel local sequence alignment
JP2015069393A (en) * 2013-09-27 2015-04-13 株式会社東芝 Document data comparison method, document data comparison apparatus, and document data comparison program
CN107615240A (en) * 2015-04-17 2018-01-19 巴特尔纪念研究所 For analyzing the scheme based on biological sequence of binary file
CN107066837A (en) * 2017-04-01 2017-08-18 上海交通大学 One kind has with reference to DNA sequence dna compression method and system
CN108345468A (en) * 2018-01-29 2018-07-31 华侨大学 Programming language code duplicate checking method based on tree and sequence similarity
CN110310705A (en) * 2018-03-16 2019-10-08 北京哲源科技有限责任公司 Support the sequence alignment method and device of SIMD
CN108920902A (en) * 2018-06-29 2018-11-30 郑州云海信息技术有限公司 A kind of gene order processing method and its relevant device
CN109634594A (en) * 2018-11-05 2019-04-16 南京航空航天大学 A kind of code snippet recommended method considering code statement order information
CN110737466A (en) * 2019-10-16 2020-01-31 南京航空航天大学 Source code coding sequence representation method based on static program analysis
CN111562920A (en) * 2020-06-08 2020-08-21 腾讯科技(深圳)有限公司 Method and device for determining similarity of small program codes, server and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SENSORY: Leveraging Code Statement Sequence Information for Code Snippets Recommendation;Lei Ai 等;《2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC)》;20190719;第1卷;27-36 *
基于数据驱动的学生程序代码推荐;滕昌志;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20200215(第02期);I138-763 *

Also Published As

Publication number Publication date
CN112416431A (en) 2021-02-26

Similar Documents

Publication Publication Date Title
CN112464641B (en) BERT-based machine reading understanding method, device, equipment and storage medium
JP2009512099A (en) Method and apparatus for restartable hashing in a try
US9229691B2 (en) Method and apparatus for programming assistance
US20220245056A1 (en) Automated program repair using stack traces and back translations
CN113296755A (en) Code structure tree library construction method and information push method
CN113076720B (en) Long text segmentation method and device, storage medium and electronic device
CN114579168A (en) Code updating method and device, electronic equipment and computer readable storage medium
CN112416431B (en) Source code segment pair comparison method based on coding sequence representation
CN113065322A (en) Code segment annotation generation method and system and readable storage medium
CN117407532A (en) Method for enhancing data by using large model and collaborative training
US20200159846A1 (en) Optimizing hash storage and memory during caching
JP6261669B2 (en) Query calibration system and method
CN114676155A (en) Code prompt information determining method, data set determining method and electronic equipment
CN112925874A (en) Similar code searching method and system based on case marks
JP5149063B2 (en) Data comparison apparatus and program
CN112509644A (en) Molecular optimization method, system, terminal equipment and readable storage medium
CN111581270A (en) Data extraction method and device
CN115543437B (en) Code annotation generation method and system
US11409806B2 (en) Apparatus and method for constructing Aho-Corasick automata for detecting regular expression pattern
US11809302B2 (en) Automated program repair using stack traces and back translations
CN116089491B (en) Retrieval matching method and device based on time sequence database
CN116991459B (en) Software multi-defect information prediction method and system
US20230138152A1 (en) Apparatus and method for generating valid neural network architecture based on parsing
CN117951221A (en) Multimode sentence processing method, multimode sentence processing device, computer equipment and storage medium
CN115718696A (en) Source code cryptography misuse detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant