CN107506204B - Code similarity comparison function reconstruction method based on cosine theorem - Google Patents

Code similarity comparison function reconstruction method based on cosine theorem Download PDF

Info

Publication number
CN107506204B
CN107506204B CN201710919613.9A CN201710919613A CN107506204B CN 107506204 B CN107506204 B CN 107506204B CN 201710919613 A CN201710919613 A CN 201710919613A CN 107506204 B CN107506204 B CN 107506204B
Authority
CN
China
Prior art keywords
similarity
assembly
fragments
comparison
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710919613.9A
Other languages
Chinese (zh)
Other versions
CN107506204A (en
Inventor
吴志坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Sinoregal Software Co ltd
Original Assignee
Fujian Sinoregal Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Sinoregal Software Co ltd filed Critical Fujian Sinoregal Software Co ltd
Priority to CN201710919613.9A priority Critical patent/CN107506204B/en
Publication of CN107506204A publication Critical patent/CN107506204A/en
Application granted granted Critical
Publication of CN107506204B publication Critical patent/CN107506204B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/72Code refactoring

Abstract

The invention provides a code similarity comparison function reconstruction method based on the cosine theorem, which comprises the steps of obtaining and storing an assembly instruction of a compiled executable file; reading the stored assembly instruction, dividing the assembly instruction into each method function by using separators, and obtaining the method name of each method function and a mapping table of the assembly instruction; segmenting each assembly instruction by using a segmentation symbol; setting a minimum assembly instruction fragment value and a fragment similarity threshold; selecting a reference comparison fragment according to the minimum assembly instruction fragment value, traversing assembly instructions in a mapping table of each method function, sequentially selecting assembly fragments with the same size as the reference comparison fragment, and comparing the sequentially selected assembly fragments with the reference comparison fragment in similarity one by one to find out similar fragments; and searching the corresponding assembly instruction according to the similar fragments for reconstruction. The invention has the advantages that: the method can realize the quick comparison of the similarity of the codes, so as to conveniently reconstruct the repeated codes and ensure the robustness of the software codes.

Description

Code similarity comparison function reconstruction method based on cosine theorem
Technical Field
The invention relates to the field of software development, in particular to a code similarity comparison function reconstruction method based on the cosine theorem.
Background
In software engineering, a robust code should require no duplicate code, which is considered by the industry as a source of 'vandalism'. Therefore, in the actual software development process, the code is reconstructed from time to eliminate the duplicate code. In the prior art, when the elimination of the repeated codes is realized, the codes are generally checked and compared artificially, but in large-scale software engineering, particularly for a novice who just touches software, whether codes similar to the code exist in vast code oceans or not is difficult to confirm, so that the codes are difficult to reconstruct.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a code similarity comparison function reconstruction method based on the cosine theorem, which is used for realizing the rapid comparison of code similarity so as to conveniently reconstruct repeated codes and ensure the robustness of software codes.
The invention is realized by the following steps: a code similarity comparison function reconstruction method based on the cosine theorem comprises the following steps:
step S1, acquiring an assembly instruction of the compiled executable file, and storing the acquired assembly instruction into an output file;
step S2, reading the stored assembly instruction from the output file, using a separator to separate the assembly instruction into each method function, and obtaining the method name of each method function and the mapping table of the assembly instruction of each method function;
step S3, performing word segmentation processing on each assembly instruction by using the segmentation character;
step S4, setting the minimum assembly instruction fragment value and setting the similarity threshold of the fragments;
step S5, selecting a reference comparison segment according to the minimum assembly instruction segment value; then traversing assembly instructions in the mapping tables of the method functions, sequentially selecting assembly fragments with the same size as the reference comparison fragments, and comparing the similarity of the sequentially selected assembly fragments with the reference comparison fragments one by one;
in the comparison process, when the compared similarity is greater than or equal to the set similarity threshold, adding 1 to the sizes of the selected assembly fragments and the reference comparison fragments, then performing similarity comparison, stopping adding 1 until the compared similarity is smaller than the set similarity threshold, simultaneously judging the selected assembly fragments and the reference comparison fragments as similar fragments when the comparison is stopped, and recording the two similar fragments;
and step S6, searching corresponding assembly instructions according to the recorded similar fragments to reconstruct functions.
Further, in step S5, the "comparing the similarity between the sequentially selected assembly fragments and the reference comparison fragment one by one" specifically includes:
and comparing the similarity of the sequentially selected assembly fragments with the reference comparison fragments one by one, wherein the similarity comparison algorithm comprises the following steps: taking the assembly segment as a vector, the words as vector dimensions, and the word frequency as the size of one dimension of the vector, substituting the size into an N-dimensional cosine theorem formula, wherein N is the number of the words in the assembly segment, and calculating the similarity; wherein, the formula of the N-dimensional cosine theorem is as follows:
Figure GDA0002426140120000021
further, in the step S5, the "recording the two similar segments" is specifically: the respective method function, starting position and segment size of these two similar segments are recorded.
Further, in the step S4, the minimum assembler fragment value is set to 3.
Further, in the step S4, the similarity threshold of the segment is set to 85%.
Further, in step S3, the separator is a space, a comma, or a tab.
The invention has the following advantages: the code similarity is compared quickly by skillfully applying the cosine law, so that in the actual software development process, developers can conveniently find whether codes submitted by the developers are similar to codes of others, and the developers can reconstruct repeated codes in time, so that the simplicity and the cleanliness of software codes are improved, and the robustness of the software codes is ensured.
Detailed Description
The invention relates to a code similarity comparison function reconstruction method based on the cosine theorem, which comprises the following steps:
step S1, acquiring an assembly instruction of the compiled executable file, and storing the acquired assembly instruction into an output file;
step S2, reading the stored assembly instruction from the output file, using a separator to separate the assembly instruction into each method function, and obtaining the method name of each method function and the mapping table of the assembly instruction of each method function;
step S3, performing word segmentation processing on each assembly instruction by using the segmentation character;
step S4, setting the minimum assembly instruction fragment value and setting the similarity threshold of the fragments;
step S5, selecting a reference comparison segment according to the minimum assembly instruction segment value; then traversing assembly instructions in the mapping tables of the method functions, sequentially selecting assembly fragments with the same size as the reference comparison fragments, and comparing the similarity of the sequentially selected assembly fragments with the reference comparison fragments one by one;
in the comparison process, when the compared similarity is greater than or equal to the set similarity threshold, adding 1 to the sizes of the selected assembly fragments and the reference comparison fragments, then performing similarity comparison, stopping adding 1 until the compared similarity is smaller than the set similarity threshold, simultaneously judging the selected assembly fragments and the reference comparison fragments as similar fragments when the comparison is stopped, and recording the two similar fragments;
and step S6, searching corresponding assembly instructions according to the recorded similar fragments to reconstruct functions.
In step S5, the "comparing the similarity between the sequentially selected assembly fragments and the reference comparison fragment one by one" specifically includes:
and comparing the similarity of the sequentially selected assembly fragments with the reference comparison fragments one by one, wherein the similarity comparison algorithm comprises the following steps: taking the assembly segment as a vector, the words as vector dimensions, and the word frequency as the size of one dimension of the vector, substituting the size into an N-dimensional cosine theorem formula, wherein N is the number of the words in the assembly segment, and calculating the similarity; wherein, the formula of the N-dimensional cosine theorem is as follows:
Figure GDA0002426140120000031
in step S5, the "recording the two similar segments" is specifically: and recording the function, the starting position and the segment size of the two similar segments, so that developers can quickly and accurately find the positions of the two similar segments.
In step S4, the minimum assembler fragment value is set to 3, and the similarity threshold of the fragments is set to 85%, which is the best implementation result after a lot of practice and statistical analysis.
In step S3, the separator is a space, a comma (,) or a tab (\ t).
The present invention is further illustrated below by using language C as an example:
1. the command provided by objdump-D Linux is used to obtain the assembly instructions of the executable file that we compiled, and store the obtained assembly instructions into an output file (this file can be created by itself as required).
2. Reading the stored assembly instruction from the output file, and dividing the assembly instruction into each method function by using a < function name > as a separator to obtain the method name of each method function and the mapping table of the assembly instruction of each method function;
for example,
<functionA>:
MOVregA,regB;
......;
ENG;
<function B>:
MOVregC,regD;
......;
ENG。
from the above, it can be seen that the section of assembler instruction can be divided into two method functions of < function a > and < function B >, the method names are a and B respectively, and the mapping tables of the obtained assembler instruction are [ function: [ MOV regA, regB; ...; ENG ] and [ function B: [ MOV regC, regD; ...; ENG).
3. Each assembler instruction is participled with a space, comma (,) or tab (\\ t) as a separator. For example, the assembler instruction is: MOV regA, regB; the result of the word segmentation may be MOVregAregB, MOV, regA, regB, etc., although different segregants may be used in combination in specific implementations.
4. The minimum assembler fragment value set is 3 and the similarity threshold for fragments set is 85%.
5. Firstly, selecting an assembly fragment with the fragment size of 3 as a reference comparison fragment; then traversing assembly instructions in the mapping table of each method function, sequentially selecting assembly fragments with the same size as the reference comparison fragments, and comparing the similarity of the sequentially selected assembly fragments with the reference comparison fragments one by one; for example, the reference comparison fragment selects 3 assembler instructions, i.e. 1,2, and 3, then 3 assembler instructions are sequentially selected from 2,3, and 4 to perform similarity comparison with the reference comparison fragment (when selecting an assembler instruction, all assembler instructions in the mapping table of each method function need to be traversed).
The similarity comparison algorithm is as follows: the assembly fragment is used as a vector, the word is used as a vector dimension, the word frequency is the size of one dimension of the vector, the vector is substituted into an N-dimensional cosine theorem formula, N is the number of words in the assembly fragment, the similarity is calculated, and for understanding, the following simple example is used for explanation: for example, the size of the assembly fragment is 1, and two fragments are selected as fragment a (test% eax,% eax) and fragment B (move% rbx,% rbx); according to the word segmentation rule, the segment A can be divided into three words, namely test,% eax,% eax; similarly, segment B can also be divided into three words, move,% rbx,% rbx; two dimensions are provided in the segment A, one dimension is test, the other dimension is% eax, the value is the word frequency, and the values are 1 and 2 respectively; similarly, there are two dimensions, move and% rbx, in segment B, with values of 1 and 2; establishing a four-dimensional coordinate system (test,% eax, move,% rbx), wherein the vector segment A is (1,2,0,0), the vector segment B is (0,0,1,2), and the rest chord values are calculated as: (1 x 0+2 x 0+0 x 1+0 x 2)/sqrt (1 x 2+2 x 2+0 x 2) + sqrt (0 x 2+1 x 2+2 x 2) ═ 0; since the cosine value is 0, it indicates that there is no similarity between segment a and segment B.
In the comparison process, when the compared similarity is greater than or equal to the set similarity threshold, adding 1 to the sizes of the selected assembly fragments and the reference comparison fragments, then performing similarity comparison, stopping the 1 adding comparison until the compared similarity is smaller than the set similarity threshold, simultaneously judging the selected assembly fragments and the reference comparison fragments when the comparison is stopped as the similar fragments, and recording the two similar fragments, wherein the method function, the starting position and the fragment size of each of the two similar fragments are recorded. For example, when the calculated similarity is equal to or greater than 85% of the set, and the previously selected assembly fragment and the benchmark comparison fragment have a fragment size of 3, the fragment size is increased to 4, i.e., the selected assembly fragment and the benchmark comparison fragment are both increased to 4 assembly instructions, then the similarity comparison is continued, and if the calculated similarity is still equal to or greater than 85% of the set, the fragment size is increased to 5 and the comparison is continued until the calculated similarity is less than 85%.
5. And searching a corresponding assembly instruction according to the recorded similar fragments to reconstruct the function.
In summary, the invention has the following beneficial effects: the code similarity is compared quickly by skillfully applying the cosine law, so that in the actual software development process, developers can conveniently find whether codes submitted by the developers are similar to codes of others, and the developers can reconstruct repeated codes in time, so that the simplicity and the cleanliness of software codes are improved, and the robustness of the software codes is ensured.
Although specific embodiments of the invention have been described above, it will be understood by those skilled in the art that the specific embodiments described are illustrative only and are not limiting upon the scope of the invention, and that equivalent modifications and variations can be made by those skilled in the art without departing from the spirit of the invention, which is to be limited only by the appended claims.

Claims (5)

1. A code similarity comparison function reconstruction method based on the cosine theorem is characterized in that: the method comprises the following steps:
step S1, acquiring an assembly instruction of the compiled executable file, and storing the acquired assembly instruction into an output file;
step S2, reading the stored assembly instruction from the output file, using a separator to separate the assembly instruction into each method function, and obtaining the method name of each method function and the mapping table of the assembly instruction of each method function;
step S3, performing word segmentation processing on each assembly instruction by using the segmentation character;
step S4, setting the minimum assembly instruction fragment value and setting the similarity threshold of the fragments;
step S5, selecting a reference comparison segment according to the minimum assembly instruction segment value; then traversing the assembly instructions in the mapping table of each method function, sequentially selecting assembly fragments with the same size as the reference comparison fragments, and comparing the similarity of the sequentially selected assembly fragments with the reference comparison fragments one by one, wherein the similarity comparison algorithm is as follows: taking the assembly segment as a vector, the words as vector dimensions, and the word frequency as the size of one dimension of the vector, substituting the size into an N-dimensional cosine theorem formula, wherein N is the number of the words in the assembly segment, and calculating the similarity; wherein, the formula of the N-dimensional cosine theorem is as follows:
Figure FDA0002426140110000011
in the comparison process, when the compared similarity is greater than or equal to the set similarity threshold, adding 1 to the sizes of the selected assembly fragments and the reference comparison fragments, then performing similarity comparison, stopping adding 1 until the compared similarity is smaller than the set similarity threshold, simultaneously judging the selected assembly fragments and the reference comparison fragments as similar fragments when the comparison is stopped, and recording the two similar fragments;
and step S6, searching corresponding assembly instructions according to the recorded similar fragments to reconstruct functions.
2. The method for reconstructing a function based on comparison of similarity of codes according to claim 1, wherein: in step S5, the "recording the two similar segments" is specifically: the respective method function, starting position and segment size of these two similar segments are recorded.
3. The method for reconstructing a function based on comparison of similarity of codes according to claim 1, wherein: in step S4, the minimum assembler fragment value is set to 3.
4. The method for reconstructing a function based on comparison of similarity of codes according to claim 1, wherein: in step S4, the similarity threshold of the set segment is 85%.
5. The method for reconstructing a function based on comparison of similarity of codes according to claim 1, wherein: in step S3, the separator is a space, a comma, or a tab.
CN201710919613.9A 2017-09-30 2017-09-30 Code similarity comparison function reconstruction method based on cosine theorem Active CN107506204B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710919613.9A CN107506204B (en) 2017-09-30 2017-09-30 Code similarity comparison function reconstruction method based on cosine theorem

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710919613.9A CN107506204B (en) 2017-09-30 2017-09-30 Code similarity comparison function reconstruction method based on cosine theorem

Publications (2)

Publication Number Publication Date
CN107506204A CN107506204A (en) 2017-12-22
CN107506204B true CN107506204B (en) 2020-08-25

Family

ID=60700463

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710919613.9A Active CN107506204B (en) 2017-09-30 2017-09-30 Code similarity comparison function reconstruction method based on cosine theorem

Country Status (1)

Country Link
CN (1) CN107506204B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109976999B (en) * 2017-12-28 2022-09-06 北京京东尚科信息技术有限公司 Method and device for measuring coverage rate of test cases
CN113238796A (en) * 2021-05-17 2021-08-10 北京京东振世信息技术有限公司 Code reconstruction method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120005202A1 (en) * 2010-06-30 2012-01-05 International Business Machines Corporation Method for Acceleration of Legacy to Service Oriented (L2SOA) Architecture Renovations
CN105426711A (en) * 2015-11-18 2016-03-23 北京理工大学 Similarity detection method of computer software source code
CN105488023A (en) * 2015-03-20 2016-04-13 广州爱九游信息技术有限公司 Text similarity assessment method and device
CN106095737A (en) * 2016-06-07 2016-11-09 杭州凡闻科技有限公司 Documents Similarity computational methods and similar document the whole network retrieval tracking

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120005202A1 (en) * 2010-06-30 2012-01-05 International Business Machines Corporation Method for Acceleration of Legacy to Service Oriented (L2SOA) Architecture Renovations
CN105488023A (en) * 2015-03-20 2016-04-13 广州爱九游信息技术有限公司 Text similarity assessment method and device
CN105426711A (en) * 2015-11-18 2016-03-23 北京理工大学 Similarity detection method of computer software source code
CN106095737A (en) * 2016-06-07 2016-11-09 杭州凡闻科技有限公司 Documents Similarity computational methods and similar document the whole network retrieval tracking

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A hybrid-token and textual based approach to find similar code segments;Akshat Agrawal 等;《2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT)》;20140130;第1-4页 *
代码相似度检测算法的研究与实现;冯振扬;《中国优秀硕士学位论文全文数据库(信息科技辑)》;20170531;第I138-522页 *

Also Published As

Publication number Publication date
CN107506204A (en) 2017-12-22

Similar Documents

Publication Publication Date Title
CN108133045B (en) Keyword extraction method and system, and keyword extraction model generation method and system
CN106980623B (en) Data model determination method and device
CN108491302B (en) Method for detecting spark cluster node state
US10235234B2 (en) Method and apparatus for determining failure similarity in computing device
EP3311311A1 (en) Automatic entity resolution with rules detection and generation system
CN108874889B (en) Target body retrieval method, system and device based on target body image
CN111243601B (en) Voiceprint clustering method and device, electronic equipment and computer-readable storage medium
CN107506204B (en) Code similarity comparison function reconstruction method based on cosine theorem
CN113448935B (en) Method, electronic device and computer program product for providing log information
CN111831852A (en) Video retrieval method, device, equipment and storage medium
CN104424263A (en) Data recording method and data recording device
CN104699796A (en) Data cleaning method based on data warehouse
CN109697240B (en) Image retrieval method and device based on features
CN111767320A (en) Data blood relationship determination method and device
US20210326615A1 (en) System and method for automatically detecting and repairing biometric crosslinks
CN111026736A (en) Data blood margin management method and device and data blood margin analysis method and device
US9547651B1 (en) Establishing file relationships based on file operations
CN110895582A (en) Data processing method and device
Han et al. iGraph in action: performance analysis of disk-based graph indexing techniques
US10055341B2 (en) To-be-stubbed target determining apparatus, to-be-stubbed target determining method and non-transitory recording medium storing to-be-stubbed target determining program
CN110287943B (en) Image object recognition method and device, electronic equipment and storage medium
CN109190039B (en) Method and device for determining similar objects and computer readable storage medium
US20150066384A1 (en) System and method for aligning genome sequence
WO2015045091A1 (en) Method and program for extraction of super-structure in structural learning of bayesian network
CN110825453A (en) Data processing method and device based on big data platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 350000 21 / F, building 5, f District, Fuzhou Software Park, 89 software Avenue, Gulou District, Fuzhou City, Fujian Province

Applicant after: FUJIAN SINOREGAL SOFTWARE Co.,Ltd.

Address before: 350000, No. 5, building F, zone 20-21, Fuzhou Software Park, 89 software Avenue, Gulou District, Fujian, Fuzhou

Applicant before: FUJIAN SINOREGAL SOFTWARE Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant