CN108446540B

CN108446540B - Program code plagiarism type detection method and system based on source code multi-label graph neural network

Info

Publication number: CN108446540B
Application number: CN201810226651.0A
Authority: CN
Inventors: 万海; 刘欣怡
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-03-19
Filing date: 2018-03-19
Publication date: 2022-02-25
Anticipated expiration: 2038-03-19
Also published as: CN108446540A

Abstract

The invention relates to a program code plagiarism type detection method based on a source code multi-label graph neural network, which comprises the following steps: s1, for a code text, generating a plagiarism version for the code text by using a self-defined code micro-confusion tool, and simultaneously recording plagiarism types; s2, extracting feature vectors of the code attribute graph of the code text and the plagiarism version of the code text; s3, integrating the code text and the code attribute map feature vectors of the plagiarism versions of the code text to provide good input for a neural network, and enabling the integrated code text and the code attribute map feature vectors of the plagiarism versions of the code text to be a positive example; s4, integrating the code text-code attribute graph feature vectors of the code text by using the methods in the steps S2-S3, and taking the feature vectors as a counterexample; s5, defining a multi-task learning network model by using a neural network, simultaneously training 10 classifiers according to the input of each positive example/negative example, and finally outputting a 10-dimensional vector, wherein each dimension represents a plagiarism type defined, thereby providing plagiarism evidences for evaluators.

Description

Program code plagiarism type detection method and system based on source code multi-label graph neural network

Technical Field

The invention relates to the field of code plagiarism detection, in particular to a program code plagiarism type detection method and system based on a source code multi-label graph neural network.

Background

The program code plagiarism means that the program code of a plagiarism person is obtained by directly copying the source code of other people or slightly changing the source code of other people; the code plagiarism detection refers to a process of calculating the similarity degree of two codes by extracting code feature strings or fingerprints and utilizing a certain matching algorithm, and plagiarism evidence collection refers to recording the possibility of certain plagiarism means in the plagiarism detection process and taking the possibility as a reference basis for plagiarism phenomena. With the continuous development of information technology, the practicability and the necessity of program code plagiarism detection in some specific occasions become more and more remarkable, and are particularly embodied in various colleges and universities with thick academic atmospheres.

Traditional code plagiarism detection techniques are generally based on three aspects:

(1) based on attribute counting-based. The method utilizes the global statistical information of the code to form an n-dimensional vector to indirectly represent the code file, such as the number of code lines, the number of variables, the number of functions, the number of loops, the number of declarations, the number of different operators, the number of different operands, the total number of occurrences of the operators, the total number of occurrences of the operands and the like, wherein each dimension represents a measurement index of the code. But Verco and Wise^[1]Experiments prove that even if the feature dimension is increased, the detection effect of the method is still inferior to that of a detection method based on structural measurement.

(2) Text-based matching. This kind of method only removes useless information such as blank space, blank line, comment, etc. in the code, and then finds the longest public subsequence matching, similar to the related art using natural language processing, but does not extract the structural information of the code essentially.

(3) Based on structural metrics-based. The method analyzes from the grammatical structure level of the code, mostly calculates the similarity between two codes (the value is generally in the (0,1) interval) by using one or more detection algorithms, and the design idea is roughly divided into four steps^[5]: preprocessing codes, converting intermediate codes, generating a comparison unit and selecting a matching algorithm; according to different intermediate representation forms of the code, the three representation forms can be classified into detection based on identifiers (tokens), detection based on Abstract Syntax Trees (AST) and detection based on Program Dependency Graphs (PDG), the three representation forms are gradually abstracted, included code syntax semantic information is gradually increased, and research shows that the later detection effect is better, the result is more accurate, but the space-time efficiency is lower. The specific introduction is as follows:

a) based on the detection of the identifier. The identifier token is a character string sequence, which can be a character string irrelevant to a programming language, can also be a basic element in the programming language, and can also customize a mapping rule of a word. Famous Moss^[2]And JPlag^[3]The system is a detection tool based on identifier technology.

b) Abstract syntax tree based detection^[4]. The abstract syntax tree can be calculated as a first-level intermediate representation form in the program compiling stage, and is a tree representation of the abstract syntax structure of the source code, and each node on the tree represents one structure in the source code. The method comprises the steps of firstly obtaining a syntax tree structure of a source code file by calling a lexical method and a syntax analysis program, then improving the detection effect by removing redundant and unrelated nodes and combining associated nodes, leaving a relatively key and simplified tree structure, and finally converting the tree structure into a character string or token sequence by utilizing a forward traversal.

c) Program dependency graph based detection^[5]. The program dependency graph is a graph representation of a source code, is a directed multi-graph with marks, and can represent control dependency and data dependency of a program, nodes of the graph correspond to basic statements in the source code, each node comprises information such as node types, node numbers and corresponding statements, and directed edges among the nodes indicate dependency among the statements and are divided into two types, namely control dependency edges and data dependency edges. As the PDG captures the dependency and coding logic of programs from the semantic level, the attacker is difficult to modify the PDG on the premise of not understanding codes, so that the PDG can more effectively deal with various artificial confusion means compared with other methods. However, the construction process of the graph is very expensive, the sub-graph isomorphism belongs to the NP problem, and the similarity of the graph is very difficult to directly calculate, so the space-time efficiency of the method is very low.

At present, the detection systems widely used and the detection techniques used by the detection systems are as follows:

(1)Moss

moss (Measure of Software library) is an online source code detection system developed by Alex Aiken, Stanford university, and mayBy detecting various programming languages such as C, C + +, Java, C #, Python and the like, a user submits a file to be detected to the system through a script, and the system then returns a detection result to the user in the form of a webpage. The detection process is divided into two stages: firstly, extracting a code feature string. Firstly, dividing a program code into a series of k-grams, wherein each k-gram is a substring (namely token) with the length of k, then carrying out hash operation on each k-gram, and extracting the k-grams through a Winnowing fingerprint extraction algorithm^[6]Screening out a part of hash values as a final code characteristic string; and secondly, calculating the similarity by using a matching algorithm. The matching algorithm Alex at this stage is not disclosed, but the algorithm idea is similar to the gst (greedy String tying) algorithm, and belongs to unordered matching.

(2)JPlag

Jplac is an online source code detection system developed by l.prechelt, g.malpohl, and m.phippsen, written in Java language, currently supports detection of Java, C #, C + +, Scheme, and natural language text. The detection process is also divided into two stages: firstly, extracting a code feature string. Firstly, through a series of steps of deleting comments, capital and lower case letter conversion, deleting illegal identifiers, extracting words containing structural information and mapping synonyms into the same form, more grammatical information is added, for example, the words BEGIN _ METHOD are used for replacing the words OPEN _ BRACE and the like, so that the probability of accidental matching errors is reduced; and secondly, calculating the similarity by using an RKR-GST (Running Karp-Rabin Greedy String matching) unordered String matching algorithm, wherein some maximum and minimum matching parameters are set to avoid repeated matching.

(3)Plague Doctor

Plague Doctor^[7]Based on the software measurement index, the analysis results of other detection technologies (such as token-based detection method) are taken into consideration by adopting the idea of group learning, so that the analysis results form the characteristics of a source code together and serve as the input of a BP (back propagation) neural network model, and finally a [0,1] is output]The interval value indicates the possibility of plagiarism between two pieces of program code, and a larger value means a higher possibility of plagiarism. The input of the network uses 12 numerical features including analysis results of Moss, similarity degree of annotations, spelling error rate and codesLength differences, etc., with 7 hidden layer elements in between, while the relative importance of each feature can be identified when analyzing the weights of the network connections. Experiments show that of these 12 features, the analysis result of Moss is of the highest importance, and the standard weight value is 0.2390, but the authors indicate that the result may be related to the training set used.

Moss and JPlag both use a single token-based detection method, and both can detect simpler plagiarism means such as annotation change, variable renaming, code block sequence change and the like, but cannot detect higher plagiarism means such as redundant statement addition, equivalent control transformation and the like; the platue vector combines a machine learning technology, utilizes a neural network to train 12 characteristics between two codes, but the method has more limiting conditions, a large amount of training data needs to be calibrated in advance by a professional, the output result is relatively single, only similar values of a program can be given, a similarity threshold value for deciding plagiarism needs to be selected manually, the selection of the value directly influences plagiarism judgment results, if the threshold value is too large, the false judgment can be caused, and if the threshold value is too small, the false judgment can be caused.

In summary, most of the existing code plagiarism detection systems detect from text, lexical and grammatical levels, and provide one or two detection techniques, so that high accuracy can be obtained for some specific plagiarism measures, but high-level plagiarism measures such as adding redundant sentences, if and switch sentences, and for and while sentences cannot be detected. The main reason is that these detection techniques represent too little semantic features. The program code has strong structural characteristics, the detection of a single technology can only aim at one or two plagiarism means generally, each technology has respective technical characteristics and advantages and disadvantages, different expressions are realized in the aspects of detection accuracy, anti-confusion capacity, space-time complexity of calculation and the like, and the more plagiarism means can be detected, which means that more structural information is considered, so that the intermediate representation form of the code as input is more complicated.

Yet another reason is that the matter of determining whether two codes are plagiarism is inherently subjective and colorful. The analysis result of the existing detection system returns a value of [0,1] interval of the user to represent the similarity degree between two codes, the larger the value is, the higher the similarity degree is, the higher the possibility of suspected plagiarism is, but the two codes to be compared do not actually have any plagiarism, so except for the calculation of the similarity degree, the user hopes that the detection system can automatically provide evidence of how the two codes are similar.

Disclosure of Invention

In view of some defects of the technology, the method for detecting the plagiarism type of the program code based on the source code multi-label graph neural network is provided, and aims to combine various intermediate representation forms of the program code, consider grammatical and semantic information of the code as much as possible, enable the detection capability of the system to be stronger than tools such as Moss, JPlag and the like, and provide more valuable and convincing detection results for evaluators.

In order to realize the purpose, the technical scheme is as follows:

the program code plagiarism type detection method based on the source code multi-label graph neural network comprises the following steps:

s1, for a code text, generating a plagiarism version for the code text by using a self-defined code micro-confusion tool, and simultaneously recording plagiarism types;

s2, extracting feature vectors of the code attribute graph of the code text and the plagiarism version of the code text;

s3, integrating the code text and the code attribute map feature vectors of the plagiarism versions of the code text to provide good input for a neural network, and enabling the integrated code text and the code attribute map feature vectors of the plagiarism versions of the code text to be a positive example;

s4, integrating the code text-code attribute graph feature vectors of the code text by using the methods in the steps S2-S3, and taking the feature vectors as a counterexample;

s5, defining a multi-task learning network model by using a neural network, simultaneously training 10 classifiers according to the input of each positive example/negative example, and finally outputting a 10-dimensional vector, wherein each dimension represents a plagiarism type defined, thereby providing plagiarism evidences for evaluators.

Preferably, the plagiarism types include: full copy, adding and deleting comments, changing empty lines and line feeds, renaming identifiers, changing the order of function bodies, changing the order of statements inside functions, changing the order of operands of operators in expressions, changing the data types of variables, adding redundant statements or variables, controlling the equivalent transformation of structures.

Preferably, the specific process of generating the code text plagiarism version in step S1 is as follows:

s11, inputting an original code data set;

s12, specifying a plagiarism type;

s13, traversing and executing corresponding confusion scripts;

s14, writing a label file;

and S15, outputting the plagiarism version.

Preferably, the specific implementation procedure of step S2 is as follows:

s21, inputting a folder path to be analyzed, a database file storage path and a file node ID storage path;

s22, analyzing and specifying all files by joern;

s23, starting a neo4j database;

s24, inquiring and traversing the CPG;

and S25, outputting the characteristic vector representation of the code attribute graph.

Preferably, the step S3 is integrated as follows:

s31, inputting a characteristic vector set;

s32, forming a plagiarism right case and a plagiarism reverse case according to the file names;

s33, vector splicing is carried out;

s34, normalization processing;

and S35, outputting the characteristic vector pair with the label.

Meanwhile, the invention also provides a system applying the method, and the specific scheme is as follows:

the method comprises a data generation module, a feature extraction module, a data processing module and a neural network module, wherein the data generation module is used for executing the step S1, the feature extraction module is used for executing the step S2, the data processing module is used for executing the step S3, and the neural network module is used for executing the step S5.

Compared with the prior art, the invention has the beneficial effects that:

according to the method, the grammatical and semantic features of the code are fully considered, blind spots and defects existing in a system can be effectively detected, and the method has high accuracy especially for advanced plagiarism means; compared with the existing plagiarism detection technology, the method does not calculate the similarity of codes, but utilizes the neural network to train and output 10-dimensional probability vectors to express how high the possibility of using various plagiarism means is, and the result has more reference value and teaching significance for evaluators.

Drawings

Fig. 1 is a schematic block diagram of a system.

FIG. 2 is a schematic flow chart of a method.

FIG. 3 is a flow chart of a data generation module.

FIG. 4 is a flow diagram of a feature extraction module.

FIG. 5 is a flow chart of a data processing module.

FIG. 6 is a schematic diagram of a neural network model.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

the invention is further illustrated below with reference to the figures and examples.

Example 1

As shown in fig. 2, the method for detecting a plagiarism type of a program code based on a source code multi-label graph neural network provided by the present invention includes the following steps:

In specific use, as shown in fig. 1, it implements its various functional processes through the following system structure: a data generation module: for executing step S1; a feature extraction module: step S2 is executed; a data processing module: step S3 is executed; a neural network module: step S5 is executed.

The system develops a self-defined code micro-obfuscating tool for the data according to the method, wherein the data generation module respectively carries out complete copying (T0), annotation adding and deleting (T1), empty line and line changing (T2), renaming identifier (T3), function body sequence changing (T4), function internal statement sequence changing (T5), operator operand sequence changing (T6) in an expression, variable data type changing (T7), redundant statement or variable adding (T8) and equivalent transformation of a control structure (T9) according to 10 categories of plagiarism types, and the system has the functions of automatically obfuscating a designated partial plagiarism type of input source code, generating a plagiarism version of code and simultaneously recording label values of corresponding positions. Fig. 3 shows a flow chart of the data generating module, which specifically includes the following steps:

s11, inputting an original code data set;

s12, specifying a plagiarism type;

s13, traversing and executing corresponding confusion scripts;

s14, writing a label file;

and S15, outputting the plagiarism version.

The feature extraction module mainly utilizes a joern tool to carry out lexical analysis and syntactic analysis on the source code to obtain a corresponding abstract languageThe method includes the steps of querying and traversing Graph structures stored in a Neo4j database by using Gremlin language, and obtaining tree structure and Graph structure information of source codes from a grammatical and semantic level. Joern is a C/C + + code analysis Tool developed by Fabian Yamaguchi, written based on java, using the syntax analysis Tool ANTLR (antenna Tool for Language recognition), whose main function is to generate abstract syntax tree, control flow Graph and program dependence from source code, and synthesize a Graph defined as Code Property Graph (CPG)^[8]And stores it in Neo4j graph database.

The coding content of the invention comprises the corresponding number of 66 node types in AST, the number of nodes in CFG, the number of edges in CFG, the number of nodes with the entry degrees of 0-4 and more than 4 respectively under various intermediate structures, the number of nodes with the exit degrees of 0-4 and more than 4 respectively, the number of class nodes and function nodes and the like, and then the coding is carried out to obtain 142-dimensional vectors which are used as a depth characteristic representation of codes. The flow chart of the feature extraction module is shown in fig. 4: the method comprises the following steps:

s22, analyzing and specifying all files by joern;

s23, starting a neo4j database;

s24, inquiring and traversing the CPG;

The data processing module is used for integrating the feature vector representations corresponding to the two codes and providing good input for the neural network. The flow chart of the data processing module is shown in fig. 5, and includes the following steps:

s31, inputting a characteristic vector set;

s33, vector splicing is carried out;

s34, normalization processing;

and S35, outputting the characteristic vector pair with the label.

The neural network module utilizes a semi-supervised learning method, utilizes the output of the data processing module as an input sample of the network, trains 10 classifiers for each sample at the same time, and finally outputs a 10-dimensional vector which represents the probability value of each plagiarism type. The network model is schematically shown in fig. 6.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Reference to the literature

[1]Verco K L,Wise M J.Software for detecting suspected plagiarism:comparing structure and attribute-counting systems[C]//Australasian Conference on Computer Science Education.ACM,1996:81-88.

[2]http://theory.stanford.edu/～aiken/moss/

[3]http://jplag.ipd.kit.edu/

[4]Baxter I D,Yahin A,Moura L,et al.Clone detection using abstract syntax trees[C]//International.Conferenceon Software Maintenance,ICSM 1998:368-377.

[5]Liu C,Chen C,Han J,et al.GPLAG:detection of software plagiarism by program dependence graph analysis[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2006:872-881.

[6]Schleimer S,Wilkerson D S,Aiken A.Winnowing:local algorithms for document fingerprinting[C]//ACM SIGMOD International Conference on Management of Data.ACM,2003:76-85.

[7]Engels S,Lakshmanan V,Craig M.Plagiarism detection using feature-based neural networks[J].Acm Sigcse Bulletin,2007,39(1):34-38.

[8]Yamaguchi F,Golde N,Arp D,et al.Modeling and Discovering Vulnerabilities with Code Property Graphs[C]//Security and Privacy.IEEE,2014:590-604.

Claims

1. The program code plagiarism type detection method based on the source code multi-label graph neural network is characterized by comprising the following steps: the method comprises the following steps:

s5, defining a multi-task learning network model by using a neural network, simultaneously training 10 classifiers aiming at the input of each positive example/negative example, and finally outputting a 10-dimensional vector, wherein each dimension represents a plagiarism type defined, so that plagiarism evidences are provided for evaluators;

the specific process of integrating in step S3 is as follows:

s31, inputting a characteristic vector set;

s33, vector splicing is carried out;

s34, normalization processing;

s35, outputting the characteristic vector pairs with the labels;

the plagiarism types include: complete copying, adding and deleting comments, changing empty lines and line feed, renaming identifiers, changing the sequence of a function body, changing the sequence of statements inside a function, changing the sequence of operation character operands in an expression, changing the data type of a variable, adding redundant statements or variables, and controlling equivalent transformation of a structure;

the specific implementation process of step S2 is as follows:

s22, analyzing and specifying all files by joern;

s23, starting a neo4j database;

s24, inquiring and traversing the CPG;

2. The method for detecting the plagiarism type of the program code based on the source code multi-tag graph neural network as claimed in claim 1, wherein: the specific process of generating the code text plagiarism version in the step S1 is as follows:

s11, inputting an original code data set;

s12, specifying a plagiarism type;

s13, traversing and executing corresponding confusion scripts;

s14, writing a label file;

and S15, outputting the plagiarism version.

3. A system according to the method of any one of claims 1 to 2, characterized in that: the method comprises a data generation module, a feature extraction module, a data processing module and a neural network module, wherein the data generation module is used for executing the step S1, the feature extraction module is used for executing the step S2, the data processing module is used for executing the step S3, and the neural network module is used for executing the step S5.