CN110471835B - Similarity detection method and system based on code files of power information system - Google Patents

Similarity detection method and system based on code files of power information system Download PDF

Info

Publication number
CN110471835B
CN110471835B CN201910593863.7A CN201910593863A CN110471835B CN 110471835 B CN110471835 B CN 110471835B CN 201910593863 A CN201910593863 A CN 201910593863A CN 110471835 B CN110471835 B CN 110471835B
Authority
CN
China
Prior art keywords
text
vector
similarity
semantic
structure vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910593863.7A
Other languages
Chinese (zh)
Other versions
CN110471835A (en
Inventor
钱琳
俞俊
朱广新
庞恒茂
任晓龙
胡鑫
许明杰
王琳
梅竹
陈海洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NARI Group Corp
Nari Technology Co Ltd
State Grid Shaanxi Electric Power Co Ltd
Original Assignee
NARI Group Corp
Nari Technology Co Ltd
State Grid Shaanxi Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NARI Group Corp, Nari Technology Co Ltd, State Grid Shaanxi Electric Power Co Ltd filed Critical NARI Group Corp
Priority to CN201910593863.7A priority Critical patent/CN110471835B/en
Publication of CN110471835A publication Critical patent/CN110471835A/en
Application granted granted Critical
Publication of CN110471835B publication Critical patent/CN110471835B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a similarity detection method and a system based on a code file of an electric power information system, wherein the similarity detection method comprises the following steps: acquiring a first code file and a second code file which need to be judged for similarity, and preprocessing the first code file and the second code file to respectively obtain a first text and a second text; obtaining the text semantic word vector according to the TF-IDF value of the word, respectively searching the function call tree structures of the first text and the second text from the function call inlets of the first text and the second text, and calculating to obtain a first text structure vector and a second text structure vector; calculating an intermediate semantic word vector through the text semantic word vector, and calculating a first intermediate structure vector and a second intermediate structure vector after merging the first text structure vector and the second text structure vector; and further obtaining the similarity of the first text and the second text. The invention firstly adopts the preprocessing function to simplify the codes of the code file, thereby improving the detection efficiency and reducing the detection error rate.

Description

Similarity detection method and system based on code files of power information system
Technical Field
The invention relates to the technical field of code similarity detection, in particular to a similarity detection method and system based on a code file of an electric power information system.
Background
The code similarity detection technology is mainly applied to code plagiarism detection at present, is an important task in computer software development and maintenance activities, and is widely applied to a plurality of fields of source code plagiarism detection, software component library inquiry, software defect detection, program understanding and the like. The method can help teachers to detect the copying condition of the student program homework, and has good practical significance for identifying software copyright.
In the article entitled "plagiarism detection Metrics" published in the sixth university of north america Annual meeting proceedings of the computer science society (Metrics based on displayed plate monitoring. paper presented at the 6th Annual CCSC north tea Conference, Middlebury vt.2001), Jones (Jones) concluded ten means of plagiarism. Changing annotation sentences (3) for (1) copying word by word (2), changing blank areas (4), changing the sequence of code blocks by renaming identifiers, (6) changing the sequence of sentences in the code blocks, (7) changing the sequence of operators and operands in expressions, (8) changing the sequences of data types, (9) adding redundant sentences and variables (10) and replacing the original control structure with an equivalent control structure.
From the current research situation at home and abroad, the program similarity discrimination research at home is relatively few, and most of the research focuses on the research on Chinese word segmentation and semantics. Reference may be made to: the BUAASIM system is used for detecting whether the students submit program jobs to copy in the Bei boat advanced program course teaching auxiliary platform, and the like.
Foreign software tools are currently available to detect whether a source program is plagiarism, such as the MOSS system of Stanford university, the SIM system of Wirgita State university, the GPLAG system of Illinois university, the JPlag system of Karlsruhe university, Germany, and the YAP3 system of Sydney university, Australia. At present, the most important code similarity detection technologies are mainly divided into two categories, namely an attribute counting technology and a structure measurement technology, and the method comprises the following specific methods: textual compatibility, Token compatibility, Metric compatibility, compatibility of Abstract Syntax Trees (AST), compatibility of Program Dependency Graphs (PDG), and other related methods.
In actual use and research, the above methods which are mainly popular at present have different defects. The two token-based and text-based approaches work generally and do not differ significantly. The AST-based method has good effect, but the algorithm flow is complex, the realization is difficult, the execution time is long, the algorithm is greatly changed for realizing the matching of different languages, and the PDG-based method has poor performance. Another article indicates that the above five methods all reject a large number of truly similar codes and have a low detection rate for some special cases such as injected codes.
In summary, the currently mainstream code similarity detection method generally has the problems of low detection effect, complex partial method, long execution time, high error rate in some cases, difficulty in application to different programming languages, and the like.
Disclosure of Invention
The invention aims to: in order to overcome the defects of the prior art, the invention provides a similarity detection method and system based on a code file of a power information system, which can solve the problems of low detection effect, complex partial method, long execution time and high error rate in some cases in the process of detecting the code similarity.
The technical scheme is as follows: the invention discloses a similarity detection method based on a power information system code file, which comprises the following steps:
acquiring a first code file and a second code file which need to be judged for similarity, and preprocessing the first code file and the second code file to respectively obtain a first text and a second text;
obtaining the first text semantic word vector according to the TF-IDF value of the word in the first text, and obtaining the second text semantic word vector according to the TF-IDF value of each word in the second text;
respectively searching function call tree structures of the first text and the second text from function call entries of the first text and the second text, and calculating to obtain a first text structure vector and a second text structure vector;
merging the first text semantic word vector and the second text semantic word vector, calculating a first intermediate semantic word vector and a second intermediate semantic vector, merging the first text structure vector and the second text structure vector, and calculating a first intermediate structure vector and a second intermediate structure vector;
and calculating the semantic similarity corresponding to the first text and the second text according to the first intermediate semantic word vector and the second intermediate semantic vector by adopting a cosine similarity algorithm, and calculating the structural similarity corresponding to the first text and the second text according to the first intermediate structural vector and the second intermediate structural vector by adopting the cosine similarity algorithm, so as to obtain the similarity of the first text and the second text.
Further, it includes:
the method for searching the function call tree structures of the first text and the second text respectively from the function call entries of the first text and the second text comprises the following steps:
reading codes in a first text and a second text, and calling a code preprocessing function to respectively preprocess the first text and the second text;
respectively traversing the preprocessed first text and the preprocessed second text from a function call inlet in a hierarchical traversal mode, wherein the function call inlet is used as a root node of the function call tree, a function called by a function of the previous layer is searched according to lines, and a newly called function is used as a child node of a parent function node until the function is returned;
and respectively acquiring the function call tree structures of the first text and the second text, and numbering each node according to layers.
Further, comprising:
the calling code preprocessing function respectively preprocesses the first text and the second text, and the calling code preprocessing function comprises the following steps:
the method comprises the steps of firstly setting annotation characters in a first text and a second text to be in different states, secondly traversing the first text and the second text respectively, converting code characters into conversion between states in the traversing process, reserving characters in a non-annotation state, deleting characters which represent that the characters are in the annotation state at present, and achieving code simplification.
Further, it includes:
the calculation to obtain the first text structure vector and the second text structure vector comprises the following specific steps:
traversing the function call tree of the first text in a depth-first mode, stopping traversing to leaf nodes, recording the number sequences from the root node to the leaf nodes until all the leaves are traversed to obtain a number sequence set, and taking the normalized combination as a first text structure vector;
and traversing the function call tree of the second text in a depth-first mode, stopping traversing to the leaf nodes, recording the number sequences from the root node to the leaf nodes until all the leaves are traversed to obtain a number sequence set, and using the normalized combination as a second text structure vector.
Further, comprising:
the normalization process is as follows: if there is no more number sequences branching from the same node in the number sequence set, it is represented as SeqmWherein m is the number of the numbers in the numbering sequence; if there are multiple number sequences branching from the same node, it is denoted as SeqmnWherein n is the node number corresponding to the simultaneous bifurcation of a plurality of numbering sequences.
Further, comprising:
after the first text structure vector and the second text structure vector are subjected to union set, a first intermediate structure vector and a second intermediate structure vector are calculated, and the method comprises the following steps:
traversing a union vector of the first text structure vector and the second text structure vector from left to right, if a subscript of a corresponding numbering sequence in the union vector is m, then: filling zeros in the first intermediate structure vector if the corresponding number sequence does not exist in the first text structure vector; otherwise, filling the number of the corresponding serial number sequences in the first text structure vector or the second text structure vector;
if m and n exist in the subscripts of the corresponding numbering sequences in the union vector in a traversal mode, judging whether the numbering sequences corresponding to the same subscripts exist in the other text structure vector, and if yes, then: if the number of all the same numbered sequences with the same subscript in the two text structure vectors is the same, filling the corresponding number into a first intermediate structure vector and a second intermediate structure vector;
otherwise, filling the first intermediate structure vector and the second intermediate structure vector with smaller number for the same number sequences with different numbers, replacing the rest number sequences with the number sequences only with subscript m according to the completion degree multiplied by the number sequences with larger number and filling the corresponding number into the first intermediate structure vector;
and if the result does not exist, replacing the completion degree multiplied by the number sequence only with the subscript m into the text structure vector with m and n, adding the result and the number sequence with the same subscript m in the original text structure vector to obtain a new text structure vector, and filling the first text structure vector and the second text structure vector with the subscript m one by one.
Further, comprising:
the completion degree is the ratio of the first forked child node in the function call tree structure to the total number of connecting edges between the nodes in the branch.
Further, it includes:
after merging the first text semantic word vector and the second text semantic word vector, calculating a first intermediate semantic word vector and a second intermediate semantic vector, comprising the following steps:
traversing a union vector of the first text semantic word vector and the second text semantic word vector from left to right, filling zeros into a first intermediate semantic word vector if a word in the union vector does not exist in the first text semantic word vector, otherwise, filling a TF-IDF value of the corresponding word, and obtaining a final first intermediate semantic word vector after traversing;
traversing the union vector of the first text semantic word vector and the second text semantic word vector from left to right, filling zero into the second intermediate semantic word vector if the word in the union vector does not exist in the second text semantic word vector, otherwise, filling the TF-IDF value of the corresponding word, and obtaining the final second intermediate semantic word vector after traversing.
Further, it includes:
the similarity S of the first text and the second text is expressed as: s ═ S1w1+s2w2Wherein s is1For the semantic similarity correspondence value, s2For the structural similarity correspondence value, w1Corresponding weight, w, to the semantic similarity2And corresponding weights to the structural similarity.
A similarity detection system based on a power information system code file comprises:
the acquisition module is used for acquiring a first code file and a second code file which need to be judged for similarity, and respectively acquiring a first text and a second text after preprocessing;
the semantic word vector calculation module is used for obtaining the first text semantic word vector according to the TF-IDF value of the words in the first text and obtaining the second text semantic word vector according to the TF-IDF value of each word in the second text;
the text structure vector generating module is used for respectively searching function call tree structures of the first text and the second text from function call inlets of the first text and the second text and calculating to obtain a first text structure vector and a second text structure vector;
the intermediate vector calculation module is used for calculating a first intermediate semantic word vector and a second intermediate semantic vector after merging the first text semantic word vector and the second text semantic word vector, and calculating a first intermediate structure vector and a second intermediate structure vector after merging the first text structure vector and the second text structure vector;
and the similarity calculation module is used for calculating the semantic similarity corresponding to the first text and the second text according to the first intermediate semantic word vector and the second intermediate semantic vector by adopting a cosine similarity calculation method, and calculating the structural similarity corresponding to the first text and the second text according to the first intermediate structural vector and the second intermediate structural vector by adopting the cosine similarity calculation method, so as to obtain the similarity of the first text and the second text.
Has the beneficial effects that:
(1) according to the invention, firstly, a preprocessing function is adopted to carry out code simplification on a code file, so that the detection efficiency is improved, and the detection error rate is reduced;
(2) the invention considers the semantic content and the function calling structure of the code, calculates the similarity between the codes from two aspects, judges whether the code is suspected to be stolen or not, and has higher detection accuracy, stronger comprehensiveness and high execution efficiency.
Drawings
Fig. 1 is a flowchart of a similarity detection method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a state machine transition according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a function call tree structure of a first text according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a function call tree structure of a second text according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a similarity detection system according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Lucene segmentation device: lucene is a sub-item of the Apache software foundation item group, is an open source code full-text search engine toolkit, but is not a complete full-text search engine but a full-text search engine architecture, and provides a complete query engine, an index engine and a partial text analysis engine (both English and German western languages). Lucene offers an excellent tokenizer, e.g., IKAnalyzer is an open source, lightweight tokenizer. The IKAnalyzer provides two word segmentation modes: fine-grained word segmentation and intelligent word segmentation.
TF-IDF algorithm: inputting: documents requiring computation, word libraries requiring computation
TF: word frequency refers to the number of times a given word appears in the document.
Calculating the formula: TF-i/L
IDF: reverse file frequency is a measure of the general importance of a word.
Calculating the formula:
Figure BDA0002116999500000061
the TF-IDF values are: TF-IDF ═ TF X IDF
And (3) outputting: the TF-IDF value for each word represents how important the word is in the document.
Cosine similarity algorithm: inputting: two vectors
Cosine similarity measures the similarity between two vectors by measuring their cosine values of their angle. The cosine of an angle of 0 degrees is 1, while the cosine of any other angle is not greater than 1, and its minimum value is-1. The cosine of the angle between the two vectors thus determines whether the two vectors point in approximately the same direction. When the two vectors have the same direction, the cosine similarity value is 1; when the included angle of the two vectors is 90 degrees, the value of the cosine similarity is 0; the cosine similarity has a value of-1 when the two vectors point in completely opposite directions. The result is independent of the length of the vector, only the pointing direction of the vector, and the calculation formula is as follows:
Figure BDA0002116999500000062
where a, b are two vectors.
And (3) outputting: similarity of vectors.
Referring to fig. 1, a similarity detection method based on a code file of an electric power information system includes the following steps:
step 1, acquiring a first code file and a second code file which need to be judged for similarity, and preprocessing the first code file and the second code file to respectively obtain a first text and a second text;
the first code file and the second code file are converted into pure text files, the Lucene segmenter is adopted to segment the code content into a plurality of words, and then the words belonging to code keyword reserved words and the like are removed aiming at the segmented words, because the words appear in a large amount in the code, the similarity calculation result is influenced.
Step 2, obtaining the semantic word vector alpha of the first text according to the TF-IDF value of the words in the first text0Obtaining the second text semantic word vector alpha according to the TF-IDF value of each word in the second text1
And calculating TF-IDF values of all words of the first text in the rest word set, wherein the IDF values need to be calculated together with the second text code file, and finally taking the first N important words and the corresponding TF-IDF values of the TF-IDF values in descending order as semantic word vectors of the codes to be detected.
Step 3, opening function call entries of the first text and the second textFirstly, respectively searching the function call tree structures of the first text and the second text, and calculating to obtain a first text structure vector beta0And a second text structure vector beta1
Specifically, the method comprises the following steps:
step 31, reading codes in the first text and the second text, and calling a code preprocessing function to respectively preprocess the first text and the second text;
the pretreatment process comprises the following steps:
reading the code to be detected, calling a code preprocessing function to remove unnecessary contents such as comments, annotations, blank lines and the like, equivalently replacing complex logics in the code, such as loop structures for and while, branch structures if … elseif, switch … case and the like, converting the complex logics into a sequential structure, adding a closed loop mark to the recursive structure to avoid dead loops and the like, and simplifying the code.
The preprocessing function uses the thought of a finite state machine, converts the traversing process of code characters into the conversion between states, finally retains some characters in the states and deletes some characters which represent the current annotated state characters. Finally, the annotation reduction operation of the code file can be realized. The specific state machine transition diagram is shown in FIG. 2:
(00) set the normal state to 0 and initiate to the normal state
And checking the following conditions in turn every time a character is traversed, and returning to check the next character if the conditions are met or all the checks are finished.
(01) If/, indicating that comments may be encountered, are encountered in state 0, then state 1 is entered
(02) State 2 is entered when state 1 is encountered, indicating entry into the multi-line comment section
(03) If/, in state 1, is encountered, indicating entry into the single line comment section, then state 4 is entered
(04) If state 1 does not encounter a or, and if it is a path symbol or division, then state 0 is restored
(05) State 3 is entered when state 2 is encountered, indicating that multiple line annotations may be about to end
(06) State 2 is maintained if state 2 is not encountered, indicating that multiple line annotations continue
(07) If/, in state 3, it indicates that the multi-line annotation is to end, then state 0 is restored
(08) If state 3 is not encountered, indicating that the multi-line annotation is encountered only, and continues, then state 2 is restored
(09) If the description possibly enters the discount comment part when encountering \ in the state 4, the state 9 is entered
(10) If the situation is met \ in the state 9 and the explanation possibly enters the discount comment part, the state 9 is maintained
(11) If other characters are encountered in the state 9, the explanation enters the fold line annotation part, and the state 4 is restored
(12) If the enter symbol \ n is met in the state 4, the single-row annotation is finished, and the state 0 is recovered
(13) If' is encountered in state 0, indicating entry into the character constant, then state 5 is entered
(14) If the escape character is encountered in the state 5, the state 6 is entered
(15) Any character encountered in state 6 restores state 5
(16) If' is met in state 5, indicating that the character constant is over, then state 0 is entered
(17) If "encountered in state 0 indicates entry into string constant, then state 7 is entered
(18) If the character is encountered in state 7, then enter state 8
(19) Any character encountered in state 8 restores state 7
(20) If the character is encountered in the state 7, which indicates that the character string constant is finished, the state 0 is recovered
Corresponding actions can be performed in different states, the current characters need to be output in the states 0, 5, 6, 7 and 8, and finally, empty lines are deleted according to the line traversal codes.
Step 32, traversing the preprocessed first text and second text from a function call entry in a hierarchical traversal mode, wherein the function call entry is used as a root node of the function call tree, searching a function called by a function of the previous layer according to a row, and using a newly called function as a child node of a parent function node until the function is returned;
a function call entry, in which a main function is used as a root node of a function call tree structure in the embodiment; starting from a function entry, using a mode of hierarchical traversal to search functions called by the function of the previous layer according to row traversal, then finding out the function body declaration position of each function, identifying code characteristics such as function overloading and the like through a parameter list, taking the newly called function as a child node of a parent function node, and repeating the step until the function returns and is not called downwards.
And step 33, respectively obtaining the function call tree structures of the first text and the second text, and numbering each node according to layers. The nodes include a root node and a leaf node.
Step 4, carrying out first text semantic word vector alpha0And the second text semantic word vector alpha1After the union is taken, a first intermediate semantic word vector alpha 'is calculated'0And a second intermediate semantic vector α'1The first text structure vector beta is used0And a second text structure vector beta1After merging, calculating a first intermediate structure vector beta'1And a second intermediate structure vector β'1
Specifically, the method comprises the following steps:
step 41, traversing the function call tree of the first text in a depth-first mode, stopping traversing to a leaf node, recording the number sequences from the root node to the leaf node until all leaves are traversed to obtain a number sequence set, and taking the normalized combination as a first text structure vector;
and 42, traversing the function call tree of the second text in a depth-first mode, stopping traversing to a leaf node, recording the number sequence from the root node to the leaf node until all leaves are traversed to obtain a number sequence set, and taking the normalized combination as a second text structure vector.
Further, it includes:
the normalization process is as follows: if a plurality of number sequences which are branched from the same node do not exist in the number sequence set, the number sequence set indicates thatIs SeqmWherein m is the number of the numbers in the numbering sequence; if there are multiple number sequences branching from the same node, it is denoted as SeqmnAnd n is the node number corresponding to the simultaneous bifurcation of a plurality of numbering sequences.
Since different tree numbers will be different, it is necessary to identify the numbering sequence, e.g. 0-1-2 consisting of 3 numbers, counted as Seq3If there are two sequences 0-1-2 and 0-1-3, the two sequences branch from the second node, then count as two Seq32Taking the structure vectors beta of all Seq as codes0
Step 43, traversing a union vector a of the first text semantic word vector and the second text semantic word vector from left to right, if a word in the union vector does not exist in the first text semantic word vector, filling zero into a first intermediate semantic word vector, otherwise, filling a TF-IDF value of a corresponding word, and obtaining a final first intermediate semantic word vector after traversing;
step 44, similar to the first text, traversing a union vector a of the first text semantic word vector and the second text semantic word vector from left to right, if the word in the union vector does not exist in the second text semantic word vector, filling zero into the second intermediate semantic word vector, otherwise, filling a TF-IDF value of the corresponding word, and obtaining a final second intermediate semantic word vector after traversing.
Step 45, traversing a union vector B of the first text structure vector and the second text structure vector from left to right, and if a subscript of a corresponding serial number sequence in the union vector is m, proving that no branch exists: filling zeros in the first intermediate structure vector if the corresponding number sequence does not exist in the first text structure vector; otherwise, filling the number of the corresponding serial number sequences in the first text structure vector or the second text structure vector;
if m and n exist in the subscripts of the corresponding numbering sequences in the union vector in a traversal mode, judging whether the numbering sequences corresponding to the same subscripts exist in the other text structure vector, and if yes, then: if the number of all the same number sequences with the same subscript in the two text structure vectors is the same, filling the corresponding number into the first intermediate structure vector and the second intermediate structure vector;
otherwise, filling the first intermediate structure vector and the second intermediate structure vector with smaller number for the same number sequences with different numbers, replacing the rest number sequences with the number sequences only with subscript m according to the completion degree multiplied by the number sequences with larger number and filling the corresponding number into the first intermediate structure vector; for example, the first copy of the code has three Seq32The second has two, then the code in the vector Seq32All become 2, a Seq of the first code remains32Can be equivalent to 0.5 Seq3And adding the vector to the corresponding position, wherein 0.5 is the completion degree.
And if the result does not exist, replacing the completion degree multiplied by the number sequence only with the subscript m into the text structure vector with m and n, adding the result and the number sequence with the same subscript m in the original text structure vector to obtain a new text structure vector, and filling the first text structure vector and the second text structure vector with the subscript m one by one.
The completion degree is the ratio of the first forked child node in the function call tree structure to the total number of connecting edges between the nodes in the branch and the node. As shown in fig. 3, the text structure vector is β ═ (0-1,0-2-4-7,0-2-5,0-3-6-8), the number sequence is 0-2-4-7, three connecting edges are formed between nodes, and the nodes diverge from the first leaf node, so the completion degree is 1/3. Again, as numbered in the sequence 0-2-5, there are two consecutive edges between the nodes and the node branches from the first leaf node, so the completion is 1/2.
After normalization, the first intermediate semantic word vector α'0And a second intermediate semantic vector α'1Equal length, a union vector A and a first intermediate structure vector beta 'which are both a first text semantic word vector and the second text semantic word vector'1And a second intermediate structure vector β'1Are equal length vectors with the length being a first text structure vector beta0And a second text structure vector beta1And taking a union set B.
S140 adopts cosine similarity calculation method according to the first intermediate semantic word vector alpha'0And a second intermediate semantic vector α'1Calculating the semantic similarity corresponding to the first text and the second text, and adopting a cosine similarity calculation method to calculate according to a first intermediate structure vector beta'1And a second intermediate structure vector β'1And calculating the structural similarity corresponding to the first text and the second text, and further obtaining the similarity of the first text and the second text.
For resulting equal length vector α'0And alpha'1,β′0And beta'1The similarity is calculated using cosine similarity algorithms, respectively. Alpha's'0And alpha'1The result of settlement of (2) is recorded as s1And the semantic similarity of the codes. Beta'0And beta'1The calculation result of (a) is denoted as s2The structural similarity of the codes.
The similarity S of the first text and the second text is expressed as: s is1w1+s2w2Wherein w is1Corresponding weight, w, to the semantic similarity2Corresponding weights to the structural similarity, wherein w1+w2The sizes of the two weights can be adjusted for different scenes, a machine learning mode can be used, a plurality of stealing samples are learned for the target detection environment to which the algorithm is applied, and finally a reasonable proportion is given. Against this background, the present embodiment selects w1=0.7,w2=0.3。
The following describes the implementation process of the present algorithm in detail by taking practical examples:
inputting: after preprocessing, the important words of the first text comprise: test1, Test2, Test3, Test4 and Test5, and the important words of the second file include: test1, Test2, Test3, Test4, Test5 and Test 6.
Assuming that both codes have been participled and the TF-IDF value has been calculated, the number of words without keyword reserved words and corresponding TF-IDF value of 5 top ranks can be adjusted by themselves when the algorithm is actually implemented, 5 words are convenient for calculation and demonstration, and the results are shown in table 1 and table 2.
TABLE 1 semantic vector Table for the first text
Figure BDA0002116999500000111
Table 2 semantic vector table of second file
Figure BDA0002116999500000112
The union A of the semantic vectors of the two codes has 6 elements, from Test1 to Test 6. Thus:
α′0=(0.3,0.23,0.15,0.12,0.11,0),α′1=(0,0.2,0.21,0.12,0.1,0.08)。
calculating the semantic similarity between the two codes as follows:
Figure BDA0002116999500000121
then, assuming that the calling tree structures of the two codes are respectively as shown in fig. 3 and fig. 4, the similarity of the structure vectors of the codes is calculated.
After the code call tree is obtained, the structure vectors of the first text and the second text are respectively:
β0=(0-1,0-2-4-7,0-2-5,0-3-6-8),β1=(0-1-4,0-2-5-7,0-3-6-8),
the numbering sequence normalization can be equivalent to:
β0=(Seq2,Seq42,Seq32,Seq4),β1=(Seq3,2Seq4)。
the union of the two sets has 4 type elements, B ═ Seq2,Seq3,Seq32,Seq4,Seq42) Of only beta0In the middle has Seq42And Seq32,β1There is no double subscript numbering sequence, so both sequences will eventually be identical by a factor of threeSeq of one4And one-half Seq3The number does not remain in Seq42And Seq32Corresponding to the position of the vector of (c), 1/3 and 1/2 are the degrees of completion of the different branches.
Thus, it can be concluded that:
Figure BDA0002116999500000122
β′1=(0,1,0,2,0)。
calculating the cosine similarity of two structure vectors
Figure BDA0002116999500000123
Because some of the branched sequences were assimilated, the results were higher and some reduction could be done manually.
And finally, calculating the similarity S of the two integrated code files to be 0.7 to 0.693 to 0.3 to 0.81 to 0.7281.
And (3) outputting: similarity of two code files.
According to the above embodiment, the present invention further provides the following two similar codes, and provides the corresponding output of the algorithm.
Referring to fig. 5, in the embodiment of the present invention, based on the same concept as the detection method, there is also provided a similarity detection system based on a code file of an electric power information system, including:
the acquisition module is used for acquiring a first code file and a second code file which need to be judged for similarity, and respectively acquiring a first text and a second text after preprocessing;
a semantic word vector calculation module, configured to obtain the first text semantic word vector according to a TF-IDF value of a word in the first text, and obtain the second text semantic word vector according to a TF-IDF value of each word in the second text;
the text structure vector generating module is used for respectively searching function call tree structures of the first text and the second text from function call inlets of the first text and the second text and calculating to obtain a first text structure vector and a second text structure vector;
the intermediate vector calculation module is used for calculating a first intermediate semantic word vector and a second intermediate semantic vector after merging the first text semantic word vector and the second text semantic word vector, and calculating a first intermediate structure vector and a second intermediate structure vector after merging the first text structure vector and the second text structure vector;
and the similarity calculation module is used for calculating the semantic similarity corresponding to the first text and the second text according to the first intermediate semantic word vector and the second intermediate semantic vector by adopting a cosine similarity calculation method, and calculating the structural similarity corresponding to the first text and the second text according to the first intermediate structural vector and the second intermediate structural vector by adopting the cosine similarity calculation method, so as to obtain the similarity of the first text and the second text.
For system/apparatus embodiments, the description is relatively simple because it is substantially similar to the method embodiments, and reference may be made to some description of the method embodiments for relevant points.
Referring to fig. 6, in an embodiment of the invention, a structural schematic diagram of an electronic device is shown.
An embodiment of the present invention provides an electronic device, which may include a processor 310 (CPU), a memory 320, an input device 330, an output device 340, and the like, where the input device 330 may include a keyboard, a mouse, a touch screen, and the like, and the output device 340 may include a Display device, such as a Liquid Crystal Display (LCD), a Cathode Ray Tube (CRT), and the like.
Memory 320 may include Read Only Memory (ROM) and Random Access Memory (RAM), and provides processor 310 with program instructions and data stored in memory 320. In an embodiment of the present invention, the memory 320 may be used to store the program of the similarity detection method based on the power information system code file.
The processor 310 is configured to execute any of the above-mentioned steps of the similarity detection method based on the power information system code file according to the obtained program instructions by calling the program instructions stored in the memory 320.
Based on the above embodiments, in the embodiments of the present invention, there is provided a computer-readable storage medium on which a computer program is stored, the computer program, when executed by a processor, implementing the similarity detection method based on the power information system code file in any of the above method embodiments.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims (10)

1. A similarity detection method based on a power information system code file is characterized by comprising the following steps:
acquiring a first code file and a second code file which need to be judged for similarity, and preprocessing the first code file and the second code file to respectively obtain a first text and a second text;
obtaining the first text semantic word vector according to the TF-IDF value of the words in the first text, and obtaining the second text semantic word vector according to the TF-IDF value of each word in the second text;
respectively searching function call tree structures of the first text and the second text from function call inlets of the first text and the second text, and calculating to obtain a first text structure vector and a second text structure vector;
merging the first text semantic word vector and the second text semantic word vector, calculating a first intermediate semantic word vector and a second intermediate semantic vector, merging the first text structure vector and the second text structure vector, and calculating a first intermediate structure vector and a second intermediate structure vector;
and calculating the semantic similarity corresponding to the first text and the second text according to the first intermediate semantic word vector and the second intermediate semantic vector by adopting a cosine similarity algorithm, and calculating the structural similarity corresponding to the first text and the second text according to the first intermediate structural vector and the second intermediate structural vector by adopting the cosine similarity algorithm, so as to obtain the similarity of the first text and the second text.
2. The similarity detection method based on the power information system code file according to claim 1, wherein the step of respectively searching the function call tree structures of the first text and the second text from the function call entries of the first text and the second text comprises the following steps:
reading codes in a first text and a second text, and calling a code preprocessing function to respectively preprocess the first text and the second text;
respectively traversing the preprocessed first text and the preprocessed second text from a function call inlet in a hierarchical traversal mode, wherein the function call inlet is used as a root node of the function call tree, a function called by a function of the previous layer is searched according to lines, and a newly called function is used as a child node of a parent function node until the function is returned;
and respectively acquiring the function call tree structures of the first text and the second text, and numbering each node according to layers.
3. The similarity detection method based on the power information system code file according to claim 2, wherein the calling code preprocessing function respectively preprocesses the first text and the second text, and comprises:
the method comprises the steps of firstly setting annotation characters in a first text and a second text to be in different states, secondly traversing the first text and the second text respectively, converting code characters into conversion between states in the traversing process, reserving characters in a non-annotation state, deleting characters which represent that the characters are in the annotation state at present, and achieving code simplification.
4. The similarity detection method based on the power information system code file according to claim 2, wherein the step of calculating to obtain the first text structure vector and the second text structure vector comprises the following specific steps:
traversing the function call tree of the first text in a depth-first mode, stopping traversing to a leaf node, recording the number sequence from the root node to the leaf node until all leaves are traversed to obtain a number sequence set, and taking the normalized combination as a first text structure vector;
and traversing the function call tree of the second text in a depth-first mode, stopping traversing to the leaf nodes, recording the number sequences from the root node to the leaf nodes until all the leaves are traversed to obtain a number sequence set, and using the normalized combination as a second text structure vector.
5. The similarity detection method based on the power information system code file according to claim 4, wherein the normalization process is as follows: if there is no more number sequences branching from the same node in the number sequence set, it is represented as SeqmWherein m is the number of the numbers in the numbering sequence; if there are multiple number sequences branching from the same node, it is denoted as SeqmnAnd n is the node number corresponding to the simultaneous bifurcation of a plurality of numbering sequences.
6. The similarity detection method based on the power information system code file according to claim 5, wherein the merging the first text structure vector and the second text structure vector is performed to calculate a first intermediate structure vector and a second intermediate structure vector, and the method comprises the following steps:
traversing a union vector of the first text structure vector and the second text structure vector from left to right, if a subscript of a corresponding numbering sequence in the union vector is m, then: filling zeros in the first intermediate structure vector if the corresponding number sequence does not exist in the first text structure vector; otherwise, filling the number of the corresponding serial number sequences in the first text structure vector or the second text structure vector;
if m and n exist in the subscripts of the corresponding numbering sequences in the union vector in a traversal mode, judging whether the numbering sequences corresponding to the same subscripts exist in the other text structure vector, and if yes, then: if the number of all the same number sequences with the same subscript in the two text structure vectors is the same, filling the corresponding number into the first intermediate structure vector and the second intermediate structure vector;
otherwise, filling the first intermediate structure vector and the second intermediate structure vector with smaller number into the same number sequences with different numbers, replacing the rest number sequences with the number sequences only with subscript m according to the completeness multiplied by the number sequences with larger number and filling the corresponding number into the first intermediate structure vector;
and if the index vector does not exist, replacing the completion degree multiplied by the number sequence only with the subscript m into the text structure vector with m and n, adding the completion degree multiplied by the number sequence with the same subscript m in the original text structure vector to obtain a new text structure vector, and filling the first text structure vector with the subscript m and the second text structure vector with the subscript m one by one.
7. The similarity detection method based on the electric power information system code file according to claim 6, wherein the completion degree is a ratio of a child node of a first fork in the function call tree structure and a total number of connecting edges between nodes in the branch.
8. The method for detecting similarity of code files based on an electric power information system according to claim 1, wherein the merging the first text semantic word vector and the second text semantic word vector to calculate a first intermediate semantic word vector and a second intermediate semantic vector comprises the following steps:
traversing a union vector of the first text semantic word vector and the second text semantic word vector from left to right, filling zeros into a first intermediate semantic word vector if a word in the union vector does not exist in the first text semantic word vector, otherwise, filling a TF-IDF value of the corresponding word, and obtaining a final first intermediate semantic word vector after traversing;
traversing the union vector of the first text semantic word vector and the second text semantic word vector from left to right, filling zero into a second intermediate semantic word vector if the word in the union vector does not exist in the second text semantic word vector, otherwise, filling the TF-IDF value of the corresponding word, and obtaining the final second intermediate semantic word vector after traversing.
9. The similarity detection method based on the power information system code file according to claim 1, wherein the similarity S of the first text and the second text is expressed as: s ═ S1w1+s2w2Wherein s is1For the semantic similarity correspondence value, s2For the structural similarity correspondence value, w1Corresponding weight, w, to the semantic similarity2And corresponding weights to the structural similarity.
10. A similarity detection system based on a code file of a power information system is characterized by comprising:
the acquisition module is used for acquiring a first code file and a second code file which need to be judged for similarity, and respectively acquiring a first text and a second text after preprocessing;
a semantic word vector calculation module, configured to obtain the first text semantic word vector according to a TF-IDF value of a word in the first text, and obtain the second text semantic word vector according to a TF-IDF value of each word in the second text;
the text structure vector generating module is used for respectively searching function call tree structures of the first text and the second text from function call inlets of the first text and the second text, and calculating to obtain a first text structure vector and a second text structure vector;
the intermediate vector calculation module is used for calculating a first intermediate semantic word vector and a second intermediate semantic vector after merging the first text semantic word vector and the second text semantic word vector, and calculating a first intermediate structure vector and a second intermediate structure vector after merging the first text structure vector and the second text structure vector;
and the similarity calculation module is used for calculating the semantic similarity corresponding to the first text and the second text according to the first intermediate semantic word vector and the second intermediate semantic vector by adopting a cosine similarity algorithm, calculating the structural similarity corresponding to the first text and the second text according to the first intermediate structural vector and the second intermediate structural vector by adopting the cosine similarity algorithm, and further obtaining the similarity of the first text and the second text.
CN201910593863.7A 2019-07-03 2019-07-03 Similarity detection method and system based on code files of power information system Active CN110471835B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910593863.7A CN110471835B (en) 2019-07-03 2019-07-03 Similarity detection method and system based on code files of power information system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910593863.7A CN110471835B (en) 2019-07-03 2019-07-03 Similarity detection method and system based on code files of power information system

Publications (2)

Publication Number Publication Date
CN110471835A CN110471835A (en) 2019-11-19
CN110471835B true CN110471835B (en) 2022-07-19

Family

ID=68507522

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910593863.7A Active CN110471835B (en) 2019-07-03 2019-07-03 Similarity detection method and system based on code files of power information system

Country Status (1)

Country Link
CN (1) CN110471835B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111126031A (en) * 2019-12-12 2020-05-08 南京谦萃智能科技服务有限公司 Code text processing method and related product
CN112749131A (en) * 2020-06-11 2021-05-04 腾讯科技(上海)有限公司 Information duplicate elimination processing method and device and computer readable storage medium
CN111813444A (en) * 2020-07-10 2020-10-23 北京思特奇信息技术股份有限公司 Method, system and electronic equipment for analyzing similarity of source codes
CN113934450B (en) * 2020-07-13 2024-08-23 阿里巴巴集团控股有限公司 Method, apparatus, computer device and medium for generating annotation information
CN114065726A (en) * 2021-11-18 2022-02-18 北京迪力科技有限责任公司 Data processing method and device
CN114968778A (en) * 2022-05-24 2022-08-30 中电科网络空间安全研究院有限公司 Large-scale source code similarity detection method, system and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101697121A (en) * 2009-10-26 2010-04-21 哈尔滨工业大学 Method for detecting code similarity based on semantic analysis of program source code
CN108628825A (en) * 2018-04-10 2018-10-09 平安科技(深圳)有限公司 Text message Similarity Match Method, device, computer equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101697121A (en) * 2009-10-26 2010-04-21 哈尔滨工业大学 Method for detecting code similarity based on semantic analysis of program source code
CN108628825A (en) * 2018-04-10 2018-10-09 平安科技(深圳)有限公司 Text message Similarity Match Method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN110471835A (en) 2019-11-19

Similar Documents

Publication Publication Date Title
CN110471835B (en) Similarity detection method and system based on code files of power information system
Wang et al. Bridging pre-trained models and downstream tasks for source code understanding
CN105426711B (en) A kind of computer software source code similarity detection method
CN111124487B (en) Code clone detection method and device and electronic equipment
Huang et al. Towards automatically generating block comments for code snippets
US20130103662A1 (en) Methods and apparatuses for generating search expressions from content, for applying search expressions to content collections, and/or for analyzing corresponding search results
Li et al. A survey on tree edit distance lower bound estimation techniques for similarity join on XML data
Ge et al. Keywords guided method name generation
Wang et al. Keml: A knowledge-enriched meta-learning framework for lexical relation classification
Chen et al. Nquad: 70,000+ questions for machine comprehension of the numerals in text
Zhang et al. An accurate identifier renaming prediction and suggestion approach
CN116029280A (en) Method, device, computing equipment and storage medium for extracting key information of document
Kuila et al. A Neural Network based Event Extraction System for Indian Languages.
Belkhouche et al. Plagiarism detection in software designs
Vu et al. Revising FUNSD dataset for key-value detection in document images
Yuan et al. Dependloc: A dependency-based framework for bug localization
CN109032946A (en) A kind of test method and device, computer readable storage medium
CN111858961B (en) Multi-language knowledge matching method and device for nodes and links in knowledge graph
Zhou et al. Big data validity evaluation based on MMTD
Chao et al. DeepCrash: deep metric learning for crash bucketing based on stack trace
Li et al. Hybrid model with multi-level code representation for multi-label code smell detection (077)
Li et al. Self‐admitted technical debt detection by learning its comprehensive semantics via graph neural networks
Raeymaekers et al. Learning (k, l)-contextual tree languages for information extraction from web pages
Li et al. Senti-EGCN: An Aspect-Based Sentiment Analysis System Using Edge-Enhanced Graph Convolutional Networks
Kuang et al. Suggesting method names based on graph neural network with salient information modelling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant