CN107506204A - A kind of function reconstructing method of the code similarity-rough set based on the cosine law - Google Patents
A kind of function reconstructing method of the code similarity-rough set based on the cosine law Download PDFInfo
- Publication number
- CN107506204A CN107506204A CN201710919613.9A CN201710919613A CN107506204A CN 107506204 A CN107506204 A CN 107506204A CN 201710919613 A CN201710919613 A CN 201710919613A CN 107506204 A CN107506204 A CN 107506204A
- Authority
- CN
- China
- Prior art keywords
- similarity
- fragments
- assembly
- comparison
- assembly instruction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
- G06F8/72—Code refactoring
Abstract
The present invention provides a kind of function reconstructing method of the code similarity-rough set based on the cosine law, including obtains assembly instruction and the preservation of the executable file of compiling;The assembly instruction of reading and saving, assembly instruction is divided into each method function with separator, obtains the method name of each method function and the mapping table of assembly instruction;Every assembly instruction is segmented with decollator;The similarity threshold of minimum assembly instruction fragment values and fragment is set;Benchmark compared pieces are chosen according to the assembly instruction fragment values of minimum, travel through the assembly instruction in the mapping table of each method function, the compilation fragment with the equal size of benchmark compared pieces is chosen successively, and the compilation fragment chosen successively is subjected to similarity-rough set with benchmark compared pieces one by one, find out similar fragments;Assembly instruction is reconstructed according to corresponding to being searched similar fragments.Advantage of the present invention:The quick comparison of code similarity can be achieved, the code repeated is reconstructed with facilitating, it is ensured that the robustness of software code.
Description
Technical Field
The invention relates to the field of software development, in particular to a code similarity comparison function reconstruction method based on the cosine theorem.
Background
In software engineering, a robust code should require no duplicate code, which is considered by the industry as a source of 'vandalism'. Therefore, in the actual software development process, the code is reconstructed from time to eliminate the duplicate code. In the prior art, when the elimination of the repeated codes is realized, the codes are generally checked and compared artificially, but in large-scale software engineering, particularly for a novice who just touches software, whether codes similar to the code exist in vast code oceans or not is difficult to confirm, so that the codes are difficult to reconstruct.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a code similarity comparison function reconstruction method based on the cosine theorem, which is used for realizing the rapid comparison of code similarity so as to reconstruct repeated codes conveniently and ensure the robustness of software codes.
The invention is realized by the following steps: a code similarity comparison function reconstruction method based on the cosine theorem comprises the following steps:
s1, acquiring an assembly instruction of a compiled executable file, and storing the acquired assembly instruction into an output file;
s2, reading the stored assembly instruction from the output file, and separating the assembly instruction into each method function by using separators to obtain the method name of each method function and the mapping table of the assembly instruction of each method function;
s3, performing word segmentation processing on each assembly instruction by using the segmentation character;
s4, setting a minimum assembly instruction fragment value and setting a similarity threshold of the fragments;
s5, selecting a reference comparison fragment according to the minimum assembly instruction fragment value; then traversing assembly instructions in the mapping tables of the method functions, sequentially selecting assembly fragments with the same size as the reference comparison fragments, and comparing the similarity of the sequentially selected assembly fragments with the reference comparison fragments one by one;
in the comparison process, when the compared similarity is greater than or equal to the set similarity threshold, adding 1 to the sizes of the selected assembly fragments and the reference comparison fragments, then performing similarity comparison, stopping adding 1 until the compared similarity is smaller than the set similarity threshold, judging the assembly fragments and the reference comparison fragments which are selected when the comparison is stopped as similar fragments, and recording the two similar fragments;
and S6, searching a corresponding assembly instruction according to the recorded similar fragments to reconstruct a function.
Further, in the step S5, the step of comparing the similarity between the sequentially selected assembly fragments and the reference comparison fragment one by one specifically includes:
and comparing the similarity of the sequentially selected assembly fragments with the reference comparison fragments one by one, wherein the similarity comparison algorithm comprises the following steps: taking the assembly segment as a vector, the words as vector dimensions, and the word frequency as the size of one dimension of the vector, substituting the size into an N-dimensional cosine theorem formula, wherein N is the number of the words in the assembly segment, and calculating the similarity; wherein, the formula of the N-dimensional cosine theorem is as follows:
further, in the step S5, the "recording the two similar segments" is specifically: the method function, starting position and fragment size of each of the two similar fragments are recorded.
Further, in the step S4, the minimum assembler instruction fragment value is set to 3.
Further, in the step S4, the similarity threshold of the set segment is 85%.
Further, in the step S3, the separator is a space, a comma, or a tab.
The invention has the following advantages: the code similarity is compared quickly by skillfully applying the cosine law, so that in the actual software development process, developers can conveniently find whether codes submitted by the developers are similar to codes of others, and the developers can reconstruct repeated codes in time, so that the simplicity and the cleanliness of software codes are improved, and the robustness of the software codes is ensured.
Detailed Description
The invention relates to a code similarity comparison function reconstruction method based on the cosine theorem, which comprises the following steps:
s1, acquiring an assembly instruction of a compiled executable file, and storing the acquired assembly instruction into an output file;
s2, reading the stored assembly instruction from the output file, and separating the assembly instruction into each method function by using separators to obtain the method name of each method function and the mapping table of the assembly instruction of each method function;
s3, performing word segmentation processing on each assembly instruction by using the segmentation character;
s4, setting a minimum assembly instruction fragment value and setting a similarity threshold of the fragments;
s5, selecting a reference comparison fragment according to the minimum assembly instruction fragment value; then traversing assembly instructions in the mapping tables of the method functions, sequentially selecting assembly fragments with the same size as the reference comparison fragments, and comparing the similarity of the sequentially selected assembly fragments with the reference comparison fragments one by one;
in the comparison process, when the compared similarity is greater than or equal to the set similarity threshold, adding 1 to the sizes of the selected assembly fragments and the reference comparison fragments, then performing similarity comparison, stopping adding 1 until the compared similarity is smaller than the set similarity threshold, judging the assembly fragments and the reference comparison fragments which are selected when the comparison is stopped as similar fragments, and recording the two similar fragments;
and S6, searching a corresponding assembly instruction according to the recorded similar fragments to reconstruct a function.
In step S5, the "comparing the similarity between the sequentially selected assembly fragments and the reference comparison fragment one by one" specifically includes:
and comparing the similarity of the sequentially selected assembly fragments with the reference comparison fragments one by one, wherein the similarity comparison algorithm comprises the following steps: taking the assembly segment as a vector, the words as vector dimensions, and the word frequency as the size of one dimension of the vector, substituting the size into an N-dimensional cosine theorem formula, wherein N is the number of the words in the assembly segment, and calculating the similarity; wherein, the formula of the N-dimensional cosine theorem is as follows:
in step S5, the "recording the two similar fragments" is specifically: and recording the function, the starting position and the segment size of the two similar segments, so that developers can quickly and accurately find the positions of the two similar segments.
In step S4, the set minimum assembler fragment value is 3, and the set similarity threshold of the fragments is 85%, which are the best implementation results after a large amount of practice and statistical analysis.
In step S3, the separator is a space, a comma (,) or a tab (\ t).
The present invention is further illustrated below by taking language C as an example:
1. the objdump-D, a command provided by Linux, is used to obtain assembly instructions of an executable file that we compile, and store the obtained assembly instructions into an output file (this file can be created by itself as needed).
2. Reading the stored assembly instruction from the output file, and dividing the assembly instruction into each method function by using a < function name > as a separator to obtain the method name of each method function and the mapping table of the assembly instruction of each method function;
for example,
<functionA>:
MOVregA,regB;
......;
ENG;
<function B>:
MOVregC,regD;
......;
ENG。
from the above, it can be seen that this section of assembler instruction can be divided into two method functions of < function a > and < function B >, the method names are a and B respectively, and the mapping tables of the obtained assembler instruction are [ function a: [ MOV regA, regB; a cut-out; ENG ] and [ function B: [ MOVRegC, regD; a cut-out; ENG).
3. Each assembler instruction is participled with a space, comma (,) or tab (\\ t) as a separator. For example, the assembler instruction is: MOV regA, regB; the result of the word segmentation may be MOVregAregB, MOV, regA, regB, etc., although different segregants may be used in combination in specific implementations.
4. The minimum assembler fragment value set is 3 and the similarity threshold for fragments set is 85%.
5. Firstly, selecting an assembly fragment with the fragment size of 3 as a reference comparison fragment; then traversing assembly instructions in the mapping tables of all the method functions, sequentially selecting assembly fragments with the same size as the reference comparison fragments, and comparing the similarity of the sequentially selected assembly fragments with the reference comparison fragments one by one; for example, the benchmark comparison segment selects 3 assembler instructions of 1,2,3, then 3 assembler instructions are sequentially selected from 2,3,4 to perform similarity comparison with the benchmark comparison segment (when selecting an assembler instruction, all assembler instructions in the mapping table of each method function need to be traversed).
The similarity comparison algorithm is as follows: the assembly fragment is used as a vector, the word is used as a vector dimension, the word frequency is the size of one dimension of the vector, and the vector is substituted into an N-dimensional cosine theorem formula, wherein N is the number of the words in the assembly fragment, and the similarity is calculated, so that for understanding, the following description has a simple example: for example, the size of the assembly fragment is 1, and two fragments are selected as a fragment a (test% eax,% eax) and a fragment B (move% rbx,% rbx); according to the word segmentation rule, the fragment A can be divided into three words, namely test,% eax and% eax; similarly, segment B can also be divided into three words, move,% rbx,% rbx; two dimensions are provided in the fragment A, one dimension is test, the other dimension is% eax, the value is the word frequency of the fragment A, and the values are 1 and 2 respectively; similarly, there are two dimensions, move and% rbx, in segment B, with values of 1 and 2; establishing a four-dimensional coordinate system (test,% eax, move,% rbx), wherein the vector segment A is (1, 2, 0), the vector segment B is (0, 1, 2), and the rest chord values are calculated as: (1 as 0+ 2+ 0+ 1+ 0+ 2)/sqrt (1 as 2+2 as 2+0 as 2+ 2) + sqrt (0 as 2+1 as 2+ 2) as 2) =0; since the cosine value is 0, it indicates that there is no similarity between segment a and segment B.
In the comparison process, when the compared similarity is greater than or equal to the set similarity threshold, adding 1 to the sizes of the selected assembly fragments and the reference comparison fragments, then performing similarity comparison, stopping the 1 adding comparison until the compared similarity is smaller than the set similarity threshold, simultaneously judging the selected assembly fragments and the reference comparison fragments when the comparison is stopped as the similar fragments, and recording the two similar fragments, wherein the method function, the starting position and the fragment size of each of the two similar fragments are recorded. For example, when the calculated similarity is equal to or greater than 85% of the set, and the previously selected assembly fragment and the benchmark comparison fragment have a fragment size of 3, the fragment size is increased to 4, i.e., the selected assembly fragment and the benchmark comparison fragment are both increased to 4 assembly instructions, then the similarity comparison is continued, and if the calculated similarity is still equal to or greater than 85% of the set, the fragment size is increased to 5 and the comparison is continued until the calculated similarity is less than 85%.
5. And searching a corresponding assembly instruction according to the recorded similar fragments to reconstruct the function.
In summary, the invention has the following beneficial effects: the code similarity is compared quickly by skillfully applying the cosine law, so that in the actual software development process, developers can conveniently find whether codes submitted by the developers are similar to codes of others, and the developers can reconstruct repeated codes in time, so that the simplicity and the cleanliness of software codes are improved, and the robustness of the software codes is ensured.
Although specific embodiments of the invention have been described above, it will be understood by those skilled in the art that the specific embodiments described are illustrative only and are not limiting upon the scope of the invention, and that equivalent modifications and variations can be made by those skilled in the art without departing from the spirit of the invention, which is to be limited only by the appended claims.
Claims (6)
1. A code similarity comparison function reconstruction method based on the cosine theorem is characterized in that: the method comprises the following steps:
s1, acquiring an assembly instruction of an executable file which is compiled, and storing the acquired assembly instruction into an output file;
s2, reading the stored assembly instruction from the output file, and separating the assembly instruction into each method function by using separators to obtain the method name of each method function and the mapping table of the assembly instruction of each method function;
s3, performing word segmentation processing on each assembly instruction by using the segmentation character;
s4, setting a minimum assembly instruction fragment value and setting a similarity threshold of the fragments;
s5, selecting a reference comparison fragment according to the minimum assembly instruction fragment value; then traversing assembly instructions in mapping tables of all method functions, sequentially selecting assembly fragments with the same size as the reference comparison fragments, and comparing the similarity of the sequentially selected assembly fragments with the reference comparison fragments one by one;
in the comparison process, when the compared similarity is greater than or equal to the set similarity threshold, adding 1 to the sizes of the selected assembly fragments and the reference comparison fragments, then performing similarity comparison, stopping adding 1 until the compared similarity is smaller than the set similarity threshold, simultaneously judging the selected assembly fragments and the reference comparison fragments as similar fragments when the comparison is stopped, and recording the two similar fragments;
and S6, searching a corresponding assembly instruction according to the recorded similar fragments to reconstruct a function.
2. The method for reconstructing a function based on comparison of similarity of codes according to claim 1, wherein: in step S5, the "comparing the similarity between the sequentially selected assembly fragments and the reference comparison fragment one by one" specifically includes:
and comparing the similarity of the sequentially selected assembly fragments with the reference comparison fragments one by one, wherein the similarity comparison algorithm comprises the following steps: taking the assembly segment as a vector, the words as vector dimensions, and the word frequency as the size of one dimension of the vector, substituting the size into an N-dimensional cosine theorem formula, wherein N is the number of the words in the assembly segment, and calculating the similarity; wherein, the formula of the N-dimensional cosine theorem is as follows:
3. the method for reconstructing the function based on the cosine theorem code similarity comparison according to claim 1, wherein: in step S5, the "recording the two similar fragments" is specifically: the method function, starting position and fragment size of each of the two similar fragments are recorded.
4. The method for reconstructing a function based on comparison of similarity of codes according to claim 1, wherein: in step S4, the minimum assembler instruction fragment value is set to 3.
5. The method for reconstructing a function based on comparison of similarity of codes according to claim 1, wherein: in step S4, the similarity threshold of the set segment is 85%.
6. The method for reconstructing the function based on the cosine theorem code similarity comparison according to claim 1, wherein: in step S3, the separator is a space, a comma, or a tab.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710919613.9A CN107506204B (en) | 2017-09-30 | 2017-09-30 | Code similarity comparison function reconstruction method based on cosine theorem |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710919613.9A CN107506204B (en) | 2017-09-30 | 2017-09-30 | Code similarity comparison function reconstruction method based on cosine theorem |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107506204A true CN107506204A (en) | 2017-12-22 |
CN107506204B CN107506204B (en) | 2020-08-25 |
Family
ID=60700463
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710919613.9A Active CN107506204B (en) | 2017-09-30 | 2017-09-30 | Code similarity comparison function reconstruction method based on cosine theorem |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107506204B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109976999A (en) * | 2017-12-28 | 2019-07-05 | 北京京东尚科信息技术有限公司 | The measure and measurement apparatus of test case coverage rate |
CN113238796A (en) * | 2021-05-17 | 2021-08-10 | 北京京东振世信息技术有限公司 | Code reconstruction method, device, equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120005202A1 (en) * | 2010-06-30 | 2012-01-05 | International Business Machines Corporation | Method for Acceleration of Legacy to Service Oriented (L2SOA) Architecture Renovations |
CN105426711A (en) * | 2015-11-18 | 2016-03-23 | 北京理工大学 | Similarity detection method of computer software source code |
CN105488023A (en) * | 2015-03-20 | 2016-04-13 | 广州爱九游信息技术有限公司 | Text similarity assessment method and device |
CN106095737A (en) * | 2016-06-07 | 2016-11-09 | 杭州凡闻科技有限公司 | Documents Similarity computational methods and similar document the whole network retrieval tracking |
-
2017
- 2017-09-30 CN CN201710919613.9A patent/CN107506204B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120005202A1 (en) * | 2010-06-30 | 2012-01-05 | International Business Machines Corporation | Method for Acceleration of Legacy to Service Oriented (L2SOA) Architecture Renovations |
CN105488023A (en) * | 2015-03-20 | 2016-04-13 | 广州爱九游信息技术有限公司 | Text similarity assessment method and device |
CN105426711A (en) * | 2015-11-18 | 2016-03-23 | 北京理工大学 | Similarity detection method of computer software source code |
CN106095737A (en) * | 2016-06-07 | 2016-11-09 | 杭州凡闻科技有限公司 | Documents Similarity computational methods and similar document the whole network retrieval tracking |
Non-Patent Citations (2)
Title |
---|
AKSHAT AGRAWAL 等: "A hybrid-token and textual based approach to find similar code segments", 《2013 FOURTH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATIONS AND NETWORKING TECHNOLOGIES (ICCCNT)》 * |
冯振扬: "代码相似度检测算法的研究与实现", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109976999A (en) * | 2017-12-28 | 2019-07-05 | 北京京东尚科信息技术有限公司 | The measure and measurement apparatus of test case coverage rate |
CN109976999B (en) * | 2017-12-28 | 2022-09-06 | 北京京东尚科信息技术有限公司 | Method and device for measuring coverage rate of test cases |
CN113238796A (en) * | 2021-05-17 | 2021-08-10 | 北京京东振世信息技术有限公司 | Code reconstruction method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN107506204B (en) | 2020-08-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106980623B (en) | Data model determination method and device | |
US10521224B2 (en) | Automatic identification of relevant software projects for cross project learning | |
CN108491302B (en) | Method for detecting spark cluster node state | |
US7966609B2 (en) | Optimal floating-point expression translation method based on pattern matching | |
US10664383B2 (en) | Automated software program repair of similar code snippets | |
CN107016018B (en) | Database index creation method and device | |
JP7110789B2 (en) | Selection of automated software program repair candidates | |
Schulz | Fingerprint indexing for paramodulation and rewriting | |
CN107506204B (en) | Code similarity comparison function reconstruction method based on cosine theorem | |
US20080127043A1 (en) | Automatic Extraction of Programming Rules | |
CN111831852A (en) | Video retrieval method, device, equipment and storage medium | |
US11288266B2 (en) | Candidate projection enumeration based query response generation | |
CN111767320A (en) | Data blood relationship determination method and device | |
CN114153496A (en) | Block chain-based high-speed parallelizable code similarity comparison method and system | |
US20180239687A1 (en) | Method invocation synthesis for software program repair | |
US8689327B2 (en) | Method for characterization of a computer program part | |
CN111026736A (en) | Data blood margin management method and device and data blood margin analysis method and device | |
CN110737469A (en) | Source code similarity evaluation method based on semantic information on functional granularities | |
US10055341B2 (en) | To-be-stubbed target determining apparatus, to-be-stubbed target determining method and non-transitory recording medium storing to-be-stubbed target determining program | |
Han et al. | iGraph in action: performance analysis of disk-based graph indexing techniques | |
CN110895582A (en) | Data processing method and device | |
WO2015045091A1 (en) | Method and program for extraction of super-structure in structural learning of bayesian network | |
CN110825453A (en) | Data processing method and device based on big data platform | |
Cholakov et al. | Duplicate code detection algorithm | |
US11886849B2 (en) | System and method to compare modules for the common code, remove the redundancy and run the unique workflows |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 350000 21 / F, building 5, f District, Fuzhou Software Park, 89 software Avenue, Gulou District, Fuzhou City, Fujian Province Applicant after: FUJIAN SINOREGAL SOFTWARE Co.,Ltd. Address before: 350000, No. 5, building F, zone 20-21, Fuzhou Software Park, 89 software Avenue, Gulou District, Fujian, Fuzhou Applicant before: FUJIAN SINOREGAL SOFTWARE Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |