CN108446540B - Program code plagiarism type detection method and system based on source code multi-label graph neural network - Google Patents

Program code plagiarism type detection method and system based on source code multi-label graph neural network Download PDF

Info

Publication number
CN108446540B
CN108446540B CN201810226651.0A CN201810226651A CN108446540B CN 108446540 B CN108446540 B CN 108446540B CN 201810226651 A CN201810226651 A CN 201810226651A CN 108446540 B CN108446540 B CN 108446540B
Authority
CN
China
Prior art keywords
code
plagiarism
text
neural network
code text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810226651.0A
Other languages
Chinese (zh)
Other versions
CN108446540A (en
Inventor
万海
刘欣怡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201810226651.0A priority Critical patent/CN108446540B/en
Publication of CN108446540A publication Critical patent/CN108446540A/en
Application granted granted Critical
Publication of CN108446540B publication Critical patent/CN108446540B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/12Protecting executable software
    • G06F21/121Restricting unauthorised execution of programs
    • G06F21/125Restricting unauthorised execution of programs by manipulating the program code, e.g. source code, compiled code, interpreted code, machine code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Security & Cryptography (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Technology Law (AREA)
  • Computer Hardware Design (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a program code plagiarism type detection method based on a source code multi-label graph neural network, which comprises the following steps: s1, for a code text, generating a plagiarism version for the code text by using a self-defined code micro-confusion tool, and simultaneously recording plagiarism types; s2, extracting feature vectors of the code attribute graph of the code text and the plagiarism version of the code text; s3, integrating the code text and the code attribute map feature vectors of the plagiarism versions of the code text to provide good input for a neural network, and enabling the integrated code text and the code attribute map feature vectors of the plagiarism versions of the code text to be a positive example; s4, integrating the code text-code attribute graph feature vectors of the code text by using the methods in the steps S2-S3, and taking the feature vectors as a counterexample; s5, defining a multi-task learning network model by using a neural network, simultaneously training 10 classifiers according to the input of each positive example/negative example, and finally outputting a 10-dimensional vector, wherein each dimension represents a plagiarism type defined, thereby providing plagiarism evidences for evaluators.

Description

Program code plagiarism type detection method and system based on source code multi-label graph neural network
Technical Field
The invention relates to the field of code plagiarism detection, in particular to a program code plagiarism type detection method and system based on a source code multi-label graph neural network.
Background
The program code plagiarism means that the program code of a plagiarism person is obtained by directly copying the source code of other people or slightly changing the source code of other people; the code plagiarism detection refers to a process of calculating the similarity degree of two codes by extracting code feature strings or fingerprints and utilizing a certain matching algorithm, and plagiarism evidence collection refers to recording the possibility of certain plagiarism means in the plagiarism detection process and taking the possibility as a reference basis for plagiarism phenomena. With the continuous development of information technology, the practicability and the necessity of program code plagiarism detection in some specific occasions become more and more remarkable, and are particularly embodied in various colleges and universities with thick academic atmospheres.
Traditional code plagiarism detection techniques are generally based on three aspects:
(1) based on attribute counting-based. The method utilizes the global statistical information of the code to form an n-dimensional vector to indirectly represent the code file, such as the number of code lines, the number of variables, the number of functions, the number of loops, the number of declarations, the number of different operators, the number of different operands, the total number of occurrences of the operators, the total number of occurrences of the operands and the like, wherein each dimension represents a measurement index of the code. But Verco and Wise[1]Experiments prove that even if the feature dimension is increased, the detection effect of the method is still inferior to that of a detection method based on structural measurement.
(2) Text-based matching. This kind of method only removes useless information such as blank space, blank line, comment, etc. in the code, and then finds the longest public subsequence matching, similar to the related art using natural language processing, but does not extract the structural information of the code essentially.
(3) Based on structural metrics-based. The method analyzes from the grammatical structure level of the code, mostly calculates the similarity between two codes (the value is generally in the (0,1) interval) by using one or more detection algorithms, and the design idea is roughly divided into four steps[5]: preprocessing codes, converting intermediate codes, generating a comparison unit and selecting a matching algorithm; according to different intermediate representation forms of the code, the three representation forms can be classified into detection based on identifiers (tokens), detection based on Abstract Syntax Trees (AST) and detection based on Program Dependency Graphs (PDG), the three representation forms are gradually abstracted, included code syntax semantic information is gradually increased, and research shows that the later detection effect is better, the result is more accurate, but the space-time efficiency is lower. The specific introduction is as follows:
a) based on the detection of the identifier. The identifier token is a character string sequence, which can be a character string irrelevant to a programming language, can also be a basic element in the programming language, and can also customize a mapping rule of a word. Famous Moss[2]And JPlag[3]The system is a detection tool based on identifier technology.
b) Abstract syntax tree based detection[4]. The abstract syntax tree can be calculated as a first-level intermediate representation form in the program compiling stage, and is a tree representation of the abstract syntax structure of the source code, and each node on the tree represents one structure in the source code. The method comprises the steps of firstly obtaining a syntax tree structure of a source code file by calling a lexical method and a syntax analysis program, then improving the detection effect by removing redundant and unrelated nodes and combining associated nodes, leaving a relatively key and simplified tree structure, and finally converting the tree structure into a character string or token sequence by utilizing a forward traversal.
c) Program dependency graph based detection[5]. The program dependency graph is a graph representation of a source code, is a directed multi-graph with marks, and can represent control dependency and data dependency of a program, nodes of the graph correspond to basic statements in the source code, each node comprises information such as node types, node numbers and corresponding statements, and directed edges among the nodes indicate dependency among the statements and are divided into two types, namely control dependency edges and data dependency edges. As the PDG captures the dependency and coding logic of programs from the semantic level, the attacker is difficult to modify the PDG on the premise of not understanding codes, so that the PDG can more effectively deal with various artificial confusion means compared with other methods. However, the construction process of the graph is very expensive, the sub-graph isomorphism belongs to the NP problem, and the similarity of the graph is very difficult to directly calculate, so the space-time efficiency of the method is very low.
At present, the detection systems widely used and the detection techniques used by the detection systems are as follows:
(1)Moss
moss (Measure of Software library) is an online source code detection system developed by Alex Aiken, Stanford university, and mayBy detecting various programming languages such as C, C + +, Java, C #, Python and the like, a user submits a file to be detected to the system through a script, and the system then returns a detection result to the user in the form of a webpage. The detection process is divided into two stages: firstly, extracting a code feature string. Firstly, dividing a program code into a series of k-grams, wherein each k-gram is a substring (namely token) with the length of k, then carrying out hash operation on each k-gram, and extracting the k-grams through a Winnowing fingerprint extraction algorithm[6]Screening out a part of hash values as a final code characteristic string; and secondly, calculating the similarity by using a matching algorithm. The matching algorithm Alex at this stage is not disclosed, but the algorithm idea is similar to the gst (greedy String tying) algorithm, and belongs to unordered matching.
(2)JPlag
Jplac is an online source code detection system developed by l.prechelt, g.malpohl, and m.phippsen, written in Java language, currently supports detection of Java, C #, C + +, Scheme, and natural language text. The detection process is also divided into two stages: firstly, extracting a code feature string. Firstly, through a series of steps of deleting comments, capital and lower case letter conversion, deleting illegal identifiers, extracting words containing structural information and mapping synonyms into the same form, more grammatical information is added, for example, the words BEGIN _ METHOD are used for replacing the words OPEN _ BRACE and the like, so that the probability of accidental matching errors is reduced; and secondly, calculating the similarity by using an RKR-GST (Running Karp-Rabin Greedy String matching) unordered String matching algorithm, wherein some maximum and minimum matching parameters are set to avoid repeated matching.
(3)Plague Doctor
Plague Doctor[7]Based on the software measurement index, the analysis results of other detection technologies (such as token-based detection method) are taken into consideration by adopting the idea of group learning, so that the analysis results form the characteristics of a source code together and serve as the input of a BP (back propagation) neural network model, and finally a [0,1] is output]The interval value indicates the possibility of plagiarism between two pieces of program code, and a larger value means a higher possibility of plagiarism. The input of the network uses 12 numerical features including analysis results of Moss, similarity degree of annotations, spelling error rate and codesLength differences, etc., with 7 hidden layer elements in between, while the relative importance of each feature can be identified when analyzing the weights of the network connections. Experiments show that of these 12 features, the analysis result of Moss is of the highest importance, and the standard weight value is 0.2390, but the authors indicate that the result may be related to the training set used.
Moss and JPlag both use a single token-based detection method, and both can detect simpler plagiarism means such as annotation change, variable renaming, code block sequence change and the like, but cannot detect higher plagiarism means such as redundant statement addition, equivalent control transformation and the like; the platue vector combines a machine learning technology, utilizes a neural network to train 12 characteristics between two codes, but the method has more limiting conditions, a large amount of training data needs to be calibrated in advance by a professional, the output result is relatively single, only similar values of a program can be given, a similarity threshold value for deciding plagiarism needs to be selected manually, the selection of the value directly influences plagiarism judgment results, if the threshold value is too large, the false judgment can be caused, and if the threshold value is too small, the false judgment can be caused.
In summary, most of the existing code plagiarism detection systems detect from text, lexical and grammatical levels, and provide one or two detection techniques, so that high accuracy can be obtained for some specific plagiarism measures, but high-level plagiarism measures such as adding redundant sentences, if and switch sentences, and for and while sentences cannot be detected. The main reason is that these detection techniques represent too little semantic features. The program code has strong structural characteristics, the detection of a single technology can only aim at one or two plagiarism means generally, each technology has respective technical characteristics and advantages and disadvantages, different expressions are realized in the aspects of detection accuracy, anti-confusion capacity, space-time complexity of calculation and the like, and the more plagiarism means can be detected, which means that more structural information is considered, so that the intermediate representation form of the code as input is more complicated.
Yet another reason is that the matter of determining whether two codes are plagiarism is inherently subjective and colorful. The analysis result of the existing detection system returns a value of [0,1] interval of the user to represent the similarity degree between two codes, the larger the value is, the higher the similarity degree is, the higher the possibility of suspected plagiarism is, but the two codes to be compared do not actually have any plagiarism, so except for the calculation of the similarity degree, the user hopes that the detection system can automatically provide evidence of how the two codes are similar.
Disclosure of Invention
In view of some defects of the technology, the method for detecting the plagiarism type of the program code based on the source code multi-label graph neural network is provided, and aims to combine various intermediate representation forms of the program code, consider grammatical and semantic information of the code as much as possible, enable the detection capability of the system to be stronger than tools such as Moss, JPlag and the like, and provide more valuable and convincing detection results for evaluators.
In order to realize the purpose, the technical scheme is as follows:
the program code plagiarism type detection method based on the source code multi-label graph neural network comprises the following steps:
s1, for a code text, generating a plagiarism version for the code text by using a self-defined code micro-confusion tool, and simultaneously recording plagiarism types;
s2, extracting feature vectors of the code attribute graph of the code text and the plagiarism version of the code text;
s3, integrating the code text and the code attribute map feature vectors of the plagiarism versions of the code text to provide good input for a neural network, and enabling the integrated code text and the code attribute map feature vectors of the plagiarism versions of the code text to be a positive example;
s4, integrating the code text-code attribute graph feature vectors of the code text by using the methods in the steps S2-S3, and taking the feature vectors as a counterexample;
s5, defining a multi-task learning network model by using a neural network, simultaneously training 10 classifiers according to the input of each positive example/negative example, and finally outputting a 10-dimensional vector, wherein each dimension represents a plagiarism type defined, thereby providing plagiarism evidences for evaluators.
Preferably, the plagiarism types include: full copy, adding and deleting comments, changing empty lines and line feeds, renaming identifiers, changing the order of function bodies, changing the order of statements inside functions, changing the order of operands of operators in expressions, changing the data types of variables, adding redundant statements or variables, controlling the equivalent transformation of structures.
Preferably, the specific process of generating the code text plagiarism version in step S1 is as follows:
s11, inputting an original code data set;
s12, specifying a plagiarism type;
s13, traversing and executing corresponding confusion scripts;
s14, writing a label file;
and S15, outputting the plagiarism version.
Preferably, the specific implementation procedure of step S2 is as follows:
s21, inputting a folder path to be analyzed, a database file storage path and a file node ID storage path;
s22, analyzing and specifying all files by joern;
s23, starting a neo4j database;
s24, inquiring and traversing the CPG;
and S25, outputting the characteristic vector representation of the code attribute graph.
Preferably, the step S3 is integrated as follows:
s31, inputting a characteristic vector set;
s32, forming a plagiarism right case and a plagiarism reverse case according to the file names;
s33, vector splicing is carried out;
s34, normalization processing;
and S35, outputting the characteristic vector pair with the label.
Meanwhile, the invention also provides a system applying the method, and the specific scheme is as follows:
the method comprises a data generation module, a feature extraction module, a data processing module and a neural network module, wherein the data generation module is used for executing the step S1, the feature extraction module is used for executing the step S2, the data processing module is used for executing the step S3, and the neural network module is used for executing the step S5.
Compared with the prior art, the invention has the beneficial effects that:
according to the method, the grammatical and semantic features of the code are fully considered, blind spots and defects existing in a system can be effectively detected, and the method has high accuracy especially for advanced plagiarism means; compared with the existing plagiarism detection technology, the method does not calculate the similarity of codes, but utilizes the neural network to train and output 10-dimensional probability vectors to express how high the possibility of using various plagiarism means is, and the result has more reference value and teaching significance for evaluators.
Drawings
Fig. 1 is a schematic block diagram of a system.
FIG. 2 is a schematic flow chart of a method.
FIG. 3 is a flow chart of a data generation module.
FIG. 4 is a flow diagram of a feature extraction module.
FIG. 5 is a flow chart of a data processing module.
FIG. 6 is a schematic diagram of a neural network model.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
the invention is further illustrated below with reference to the figures and examples.
Example 1
As shown in fig. 2, the method for detecting a plagiarism type of a program code based on a source code multi-label graph neural network provided by the present invention includes the following steps:
s1, for a code text, generating a plagiarism version for the code text by using a self-defined code micro-confusion tool, and simultaneously recording plagiarism types;
s2, extracting feature vectors of the code attribute graph of the code text and the plagiarism version of the code text;
s3, integrating the code text and the code attribute map feature vectors of the plagiarism versions of the code text to provide good input for a neural network, and enabling the integrated code text and the code attribute map feature vectors of the plagiarism versions of the code text to be a positive example;
s4, integrating the code text-code attribute graph feature vectors of the code text by using the methods in the steps S2-S3, and taking the feature vectors as a counterexample;
s5, defining a multi-task learning network model by using a neural network, simultaneously training 10 classifiers according to the input of each positive example/negative example, and finally outputting a 10-dimensional vector, wherein each dimension represents a plagiarism type defined, thereby providing plagiarism evidences for evaluators.
In specific use, as shown in fig. 1, it implements its various functional processes through the following system structure: a data generation module: for executing step S1; a feature extraction module: step S2 is executed; a data processing module: step S3 is executed; a neural network module: step S5 is executed.
The system develops a self-defined code micro-obfuscating tool for the data according to the method, wherein the data generation module respectively carries out complete copying (T0), annotation adding and deleting (T1), empty line and line changing (T2), renaming identifier (T3), function body sequence changing (T4), function internal statement sequence changing (T5), operator operand sequence changing (T6) in an expression, variable data type changing (T7), redundant statement or variable adding (T8) and equivalent transformation of a control structure (T9) according to 10 categories of plagiarism types, and the system has the functions of automatically obfuscating a designated partial plagiarism type of input source code, generating a plagiarism version of code and simultaneously recording label values of corresponding positions. Fig. 3 shows a flow chart of the data generating module, which specifically includes the following steps:
s11, inputting an original code data set;
s12, specifying a plagiarism type;
s13, traversing and executing corresponding confusion scripts;
s14, writing a label file;
and S15, outputting the plagiarism version.
The feature extraction module mainly utilizes a joern tool to carry out lexical analysis and syntactic analysis on the source code to obtain a corresponding abstract languageThe method includes the steps of querying and traversing Graph structures stored in a Neo4j database by using Gremlin language, and obtaining tree structure and Graph structure information of source codes from a grammatical and semantic level. Joern is a C/C + + code analysis Tool developed by Fabian Yamaguchi, written based on java, using the syntax analysis Tool ANTLR (antenna Tool for Language recognition), whose main function is to generate abstract syntax tree, control flow Graph and program dependence from source code, and synthesize a Graph defined as Code Property Graph (CPG)[8]And stores it in Neo4j graph database.
The coding content of the invention comprises the corresponding number of 66 node types in AST, the number of nodes in CFG, the number of edges in CFG, the number of nodes with the entry degrees of 0-4 and more than 4 respectively under various intermediate structures, the number of nodes with the exit degrees of 0-4 and more than 4 respectively, the number of class nodes and function nodes and the like, and then the coding is carried out to obtain 142-dimensional vectors which are used as a depth characteristic representation of codes. The flow chart of the feature extraction module is shown in fig. 4: the method comprises the following steps:
s21, inputting a folder path to be analyzed, a database file storage path and a file node ID storage path;
s22, analyzing and specifying all files by joern;
s23, starting a neo4j database;
s24, inquiring and traversing the CPG;
and S25, outputting the characteristic vector representation of the code attribute graph.
The data processing module is used for integrating the feature vector representations corresponding to the two codes and providing good input for the neural network. The flow chart of the data processing module is shown in fig. 5, and includes the following steps:
s31, inputting a characteristic vector set;
s32, forming a plagiarism right case and a plagiarism reverse case according to the file names;
s33, vector splicing is carried out;
s34, normalization processing;
and S35, outputting the characteristic vector pair with the label.
The neural network module utilizes a semi-supervised learning method, utilizes the output of the data processing module as an input sample of the network, trains 10 classifiers for each sample at the same time, and finally outputs a 10-dimensional vector which represents the probability value of each plagiarism type. The network model is schematically shown in fig. 6.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
Reference to the literature
[1]Verco K L,Wise M J.Software for detecting suspected plagiarism:comparing structure and attribute-counting systems[C]//Australasian Conference on Computer Science Education.ACM,1996:81-88.
[2]http://theory.stanford.edu/~aiken/moss/
[3]http://jplag.ipd.kit.edu/
[4]Baxter I D,Yahin A,Moura L,et al.Clone detection using abstract syntax trees[C]//International.Conferenceon Software Maintenance,ICSM 1998:368-377.
[5]Liu C,Chen C,Han J,et al.GPLAG:detection of software plagiarism by program dependence graph analysis[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2006:872-881.
[6]Schleimer S,Wilkerson D S,Aiken A.Winnowing:local algorithms for document fingerprinting[C]//ACM SIGMOD International Conference on Management of Data.ACM,2003:76-85.
[7]Engels S,Lakshmanan V,Craig M.Plagiarism detection using feature-based neural networks[J].Acm Sigcse Bulletin,2007,39(1):34-38.
[8]Yamaguchi F,Golde N,Arp D,et al.Modeling and Discovering Vulnerabilities with Code Property Graphs[C]//Security and Privacy.IEEE,2014:590-604.

Claims (3)

1. The program code plagiarism type detection method based on the source code multi-label graph neural network is characterized by comprising the following steps: the method comprises the following steps:
s1, for a code text, generating a plagiarism version for the code text by using a self-defined code micro-confusion tool, and simultaneously recording plagiarism types;
s2, extracting feature vectors of the code attribute graph of the code text and the plagiarism version of the code text;
s3, integrating the code text and the code attribute map feature vectors of the plagiarism versions of the code text to provide good input for a neural network, and enabling the integrated code text and the code attribute map feature vectors of the plagiarism versions of the code text to be a positive example;
s4, integrating the code text-code attribute graph feature vectors of the code text by using the methods in the steps S2-S3, and taking the feature vectors as a counterexample;
s5, defining a multi-task learning network model by using a neural network, simultaneously training 10 classifiers aiming at the input of each positive example/negative example, and finally outputting a 10-dimensional vector, wherein each dimension represents a plagiarism type defined, so that plagiarism evidences are provided for evaluators;
the specific process of integrating in step S3 is as follows:
s31, inputting a characteristic vector set;
s32, forming a plagiarism right case and a plagiarism reverse case according to the file names;
s33, vector splicing is carried out;
s34, normalization processing;
s35, outputting the characteristic vector pairs with the labels;
the plagiarism types include: complete copying, adding and deleting comments, changing empty lines and line feed, renaming identifiers, changing the sequence of a function body, changing the sequence of statements inside a function, changing the sequence of operation character operands in an expression, changing the data type of a variable, adding redundant statements or variables, and controlling equivalent transformation of a structure;
the specific implementation process of step S2 is as follows:
s21, inputting a folder path to be analyzed, a database file storage path and a file node ID storage path;
s22, analyzing and specifying all files by joern;
s23, starting a neo4j database;
s24, inquiring and traversing the CPG;
and S25, outputting the characteristic vector representation of the code attribute graph.
2. The method for detecting the plagiarism type of the program code based on the source code multi-tag graph neural network as claimed in claim 1, wherein: the specific process of generating the code text plagiarism version in the step S1 is as follows:
s11, inputting an original code data set;
s12, specifying a plagiarism type;
s13, traversing and executing corresponding confusion scripts;
s14, writing a label file;
and S15, outputting the plagiarism version.
3. A system according to the method of any one of claims 1 to 2, characterized in that: the method comprises a data generation module, a feature extraction module, a data processing module and a neural network module, wherein the data generation module is used for executing the step S1, the feature extraction module is used for executing the step S2, the data processing module is used for executing the step S3, and the neural network module is used for executing the step S5.
CN201810226651.0A 2018-03-19 2018-03-19 Program code plagiarism type detection method and system based on source code multi-label graph neural network Active CN108446540B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810226651.0A CN108446540B (en) 2018-03-19 2018-03-19 Program code plagiarism type detection method and system based on source code multi-label graph neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810226651.0A CN108446540B (en) 2018-03-19 2018-03-19 Program code plagiarism type detection method and system based on source code multi-label graph neural network

Publications (2)

Publication Number Publication Date
CN108446540A CN108446540A (en) 2018-08-24
CN108446540B true CN108446540B (en) 2022-02-25

Family

ID=63195776

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810226651.0A Active CN108446540B (en) 2018-03-19 2018-03-19 Program code plagiarism type detection method and system based on source code multi-label graph neural network

Country Status (1)

Country Link
CN (1) CN108446540B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109542766B (en) * 2018-10-23 2020-08-18 西安交通大学 Large-scale program similarity rapid detection and evidence generation method based on code mapping and lexical analysis
CN109472310B (en) * 2018-11-12 2022-08-09 深圳八爪网络科技有限公司 Identification method and device for determining two resumes to be identical talents
CN111459788A (en) * 2019-01-18 2020-07-28 南京大学 Test program plagiarism detection method based on support vector machine
CN111459787A (en) * 2019-01-18 2020-07-28 南京大学 Test plagiarism detection method based on machine learning
CN110011986B (en) * 2019-03-20 2021-04-02 中山大学 Deep learning-based source code vulnerability detection method
CN110008344B (en) * 2019-04-16 2020-09-29 中森云链(成都)科技有限责任公司 Method for automatically marking data structure label on code
CN110287702B (en) * 2019-05-29 2020-08-11 清华大学 Binary vulnerability clone detection method and device
CN110286891B (en) * 2019-06-25 2020-09-29 中国科学院软件研究所 Program source code encoding method based on code attribute tensor
CN110489102B (en) * 2019-07-29 2021-06-18 东北大学 Method for automatically generating Python code from natural language
CN110555121B (en) * 2019-08-27 2022-04-15 清华大学 Image hash generation method and device based on graph neural network
CN110502277B (en) * 2019-08-30 2023-04-07 西安邮电大学 Code bad smell detection method based on BP neural network
CN110659723B (en) * 2019-09-03 2023-09-19 腾讯科技(深圳)有限公司 Data processing method and device based on artificial intelligence, medium and electronic equipment
CN110673840B (en) * 2019-09-23 2022-10-11 山东师范大学 Automatic code generation method and system based on tag graph embedding technology
CN110990273B (en) * 2019-11-29 2024-04-23 中国银行股份有限公司 Clone code detection method and device
US11403488B2 (en) * 2020-03-19 2022-08-02 Hong Kong Applied Science and Technology Research Institute Company Limited Apparatus and method for recognizing image-based content presented in a structured layout
CN113138924B (en) * 2021-04-23 2023-10-31 扬州大学 Thread safety code identification method based on graph learning
CN114463567B (en) * 2022-04-12 2022-11-11 光子云(三河)网络技术有限公司 Block chain-based intelligent education operation big data plagiarism prevention method and system
CN115129364B (en) * 2022-07-05 2023-04-18 四川大学 Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network
CN115438709A (en) * 2022-07-11 2022-12-06 云南恒于科技有限公司 Code similarity detection method based on code attribute graph
CN114936099B (en) * 2022-07-25 2022-09-30 之江实验室 Graph optimization method and device for neural network calculation
US11915135B2 (en) 2022-07-25 2024-02-27 Zhejiang Lab Graph optimization method and apparatus for neural network computation

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398758A (en) * 2008-10-30 2009-04-01 北京航空航天大学 Detection method of code copy
CN103729580A (en) * 2014-01-27 2014-04-16 国家电网公司 Method and device for detecting software plagiarism
CN103870721A (en) * 2014-03-04 2014-06-18 西安交通大学 Multi-thread software plagiarism detection method based on thread slice birthmarks
CN104077147A (en) * 2014-07-11 2014-10-01 东南大学 Software reusing method based on code clone automatic detection and timely prompting
CN104407872A (en) * 2014-12-04 2015-03-11 北京邮电大学 Code clone detection method
CN105868108A (en) * 2016-03-28 2016-08-17 中国科学院信息工程研究所 Instruction-set-irrelevant binary code similarity detection method based on neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015017796A2 (en) * 2013-08-02 2015-02-05 Digimarc Corporation Learning systems and methods

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398758A (en) * 2008-10-30 2009-04-01 北京航空航天大学 Detection method of code copy
CN103729580A (en) * 2014-01-27 2014-04-16 国家电网公司 Method and device for detecting software plagiarism
CN103870721A (en) * 2014-03-04 2014-06-18 西安交通大学 Multi-thread software plagiarism detection method based on thread slice birthmarks
CN104077147A (en) * 2014-07-11 2014-10-01 东南大学 Software reusing method based on code clone automatic detection and timely prompting
CN104407872A (en) * 2014-12-04 2015-03-11 北京邮电大学 Code clone detection method
CN105868108A (en) * 2016-03-28 2016-08-17 中国科学院信息工程研究所 Instruction-set-irrelevant binary code similarity detection method based on neural network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于复用代码检测的缺陷发现方法;常超 等;《系统工程与电子技术》;20170930;第39卷(第9期);第2157-2164页 *
基于多种技术的混合式程序代码抄袭检测方法;杨超;《计算机工程与应用》;20160930(第18期);第222-227页 *
基于栈行为动态胎记的软件抄袭检测方法;范铭 等;《山东大学学报(理学版)》;20140930;第49卷(第9期);第9-16页 *

Also Published As

Publication number Publication date
CN108446540A (en) 2018-08-24

Similar Documents

Publication Publication Date Title
CN108446540B (en) Program code plagiarism type detection method and system based on source code multi-label graph neural network
CN111639344B (en) Vulnerability detection method and device based on neural network
CN111428044B (en) Method, device, equipment and storage medium for acquiring supervision and identification results in multiple modes
CN109697162B (en) Software defect automatic detection method based on open source code library
Perez et al. Cross-language clone detection by learning over abstract syntax trees
Wang et al. Blended, precise semantic program embeddings
CN113641586A (en) Software source code defect detection method, system, electronic equipment and storage medium
CN111241294A (en) Graph convolution network relation extraction method based on dependency analysis and key words
CN106537333A (en) Systems and methods for a database of software artifacts
CN101894236A (en) Software homology detection method and device based on abstract syntax tree and semantic matching
CN113127339B (en) Method for acquiring Github open source platform data and source code defect repair system
CN108345457A (en) A method of to program source code automatic generation function descriptive notes
CN112733156A (en) Intelligent software vulnerability detection method, system and medium based on code attribute graph
CN115309451A (en) Code clone detection method, device, equipment, storage medium and program product
Meng et al. [Retracted] A Deep Learning Approach for a Source Code Detection Model Using Self‐Attention
CN115438709A (en) Code similarity detection method based on code attribute graph
Ji et al. Code clone detection with hierarchical attentive graph embedding
Yuan et al. Java code clone detection by exploiting semantic and syntax information from intermediate code-based graph
Zhang et al. Flow Chart Generation‐Based Source Code Similarity Detection Using Process Mining
Li et al. Semantic code clone detection via event embedding tree and GAT network
Wen et al. A cross-project defect prediction model based on deep learning with self-attention
CN116974554A (en) Code data processing method, apparatus, computer device and storage medium
CN116662991A (en) Intelligent contract intention detection method based on artificial intelligence
Frankel et al. Machine learning approaches for authorship attribution using source code stylometry
Torres et al. Comparison of Clang abstract syntax trees using string kernels

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant