CN107169321B - Program plagiarism detection method and system based on combination of attribute counting and structure measurement technology - Google Patents

Program plagiarism detection method and system based on combination of attribute counting and structure measurement technology Download PDF

Info

Publication number
CN107169321B
CN107169321B CN201710462952.9A CN201710462952A CN107169321B CN 107169321 B CN107169321 B CN 107169321B CN 201710462952 A CN201710462952 A CN 201710462952A CN 107169321 B CN107169321 B CN 107169321B
Authority
CN
China
Prior art keywords
similarity
program
detection
code
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710462952.9A
Other languages
Chinese (zh)
Other versions
CN107169321A (en
Inventor
卫军超
耿楠
孔凡东
常在斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Traffic Engineering Institute
Original Assignee
Xian Traffic Engineering Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Traffic Engineering Institute filed Critical Xian Traffic Engineering Institute
Priority to CN201710462952.9A priority Critical patent/CN107169321B/en
Publication of CN107169321A publication Critical patent/CN107169321A/en
Application granted granted Critical
Publication of CN107169321B publication Critical patent/CN107169321B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/12Protecting executable software
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Multimedia (AREA)
  • Technology Law (AREA)
  • Computer Security & Cryptography (AREA)
  • Debugging And Monitoring (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a program plagiarism detection method and a system based on the combination of attribute counting and structure measurement technologies, wherein the method comprises the following steps: submitting the program to be detected to a system, and receiving a detection task submitted by a user by the system; preprocessing a source program, namely removing useless data; calculating the similarity by adopting a GST character string matching algorithm, judging whether a decision condition is met or not according to a similarity value obtained by the GST algorithm through a decision function, if the decision condition is met, returning a result, and finishing detection; if the decision condition is not met, the next operation is carried out; selecting attribute feature elements and structural feature elements from a source program according to the features of the C language, and then calculating the similarity according to a method combining an attribute counting method and a structural measurement technology; and (4) integrating the two similarity measurement results to give a similarity evaluation grade. The invention reduces the time complexity of the detection system and improves the detection precision of the system by extracting the code attribute characteristics and the structural characteristics.

Description

Program plagiarism detection method and system based on combination of attribute counting and structure measurement technology
Technical Field
The invention relates to the field of computers, in particular to a program plagiarism detection method and a system based on the combination of an attribute counting technology and a structure measuring technology.
Background
In order to inhibit the spreading of plagiarism and plagiarism code phenomena in daily C language program design courses and solve the problem of quickly and accurately identifying plagiarism codes in a large amount of source codes, various plagiarism detection systems emerge in the prior art, but the common detection results are not accurate enough, the complexity of the running time of the system is high and the like.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method and a system for detecting plagiarism based on the combination of attribute counting and structure measurement technologies.
In order to achieve the purpose, the invention adopts the technical scheme that:
the program plagiarism detection method based on the combination of the attribute counting and the structure measuring technology comprises the following steps:
s1, submitting program codes
Submitting the program to be detected to a system, and receiving a detection task submitted by a user by the system;
s2, pretreatment
Preprocessing a source program, namely removing useless data such as comments, blank lines, redundant spaces, head files and the like;
s3, calculating similarity by adopting a GST character string matching algorithm, wherein the character string matching algorithm can accurately detect a large amount of low-level plagiarism problems in teaching; judging whether the decision condition is met or not according to the similarity value obtained by the GST algorithm through a decision function, if the decision condition is met, returning a result, and finishing detection; if the decision condition is not satisfied, performing the operation of step S4;
s4, selecting attribute feature elements and structural feature elements from a source program according to the features of the C language, and then calculating the similarity according to a method combining an attribute counting method and a structural measurement technology;
and S5, synthesizing the two similarity measurement results to give a similarity evaluation level, namely determining the plagiarism degree of the program code.
In order to solve the technical problem, the invention also provides a program plagiarism detection system based on the combination of the attribute counting and the structure measuring technology, which comprises a user interface, a background management module, a code similarity detection module and a database module; the user submits a code detection request, the background management module calls the code similarity detection module, the code similarity detection module reads data source codes from a source code database submitted by the user, similarity calculation is completed through the method, the calculated result is fed back to the background management module, and the background management module feeds the detected result back to a user interface for the user to check.
The invention has the following beneficial effects:
(1) aiming at the problem of high complexity of the running time of the existing detection system, a program similarity calculation method based on the combination of an attribute counting method and a structure measurement technology is provided, and the time complexity of the detection system is reduced by a method for extracting code attribute features and structure features. Experiments show that the running time complexity of the detection system is reduced by 15.1% compared with the longest common subsequence algorithm.
(2) Aiming at the problem of low accuracy of the existing detection system, the detection process is optimized by analyzing the characteristics of code plagiarism: the method comprises the steps of preferentially using a GST algorithm to detect low-level code plagiarism, judging whether a method combining an attribute counting method and a structural measurement technology is adopted to further calculate similarity according to a decision function, and obtaining a final detection conclusion by comprehensive evaluation of two detection results, wherein experimental results show that the accuracy of the detection result reaches 95% for low-level plagiarism means in a construction sample, student daily work and a program code sample submitted in an on-computer test; for more advanced detection means, such as adding redundancy, equivalent structure replacement and the like, the precision of the detection system designed by the method is improved by 5.6% compared with that of a JPlag system.
(3) The method realizes a set of automatic detection systems of program codes, and comprises code preprocessing, effective selection of the characteristics of a source program, efficient extraction of characteristic elements and realization of a similarity detection algorithm. In order to verify the effectiveness of the research method and the establishment of a plagiarism detection system, three groups of typical samples are selected to be simultaneously detected in a JPlag system and the plagiarism detection system established by the research, and finally, the detection results are compared. The detection results of five plagiarism means commonly used in three groups of samples are comprehensively analyzed, and the precision of the detection results is improved by 7.3 percent compared with that of a JPlag system. Software tests show that the system can stably and reliably work, and the design target is well realized.
Drawings
Fig. 1 is a system diagram of a plagiarism detection system based on the combination of attribute counting and structure metric techniques according to an embodiment of the present invention.
Fig. 2 is a flowchart of a program piracy detection method based on the combination of attribute counting and structure measurement techniques according to an embodiment of the present invention.
FIG. 3 is a flowchart illustrating the GST algorithm measures code similarity according to an embodiment of the present invention.
Fig. 4 is a similarity calculation flow chart based on the combination of the attribute counting method and the structure measurement technique in the embodiment of the present invention.
Detailed Description
In order that the objects and advantages of the invention will be more clearly understood, the invention is further described in detail below with reference to examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, an embodiment of the present invention provides a program plagiarism detection system based on a combination of attribute counting and structural measurement technologies, which includes a user interface, a background management module, a code similarity detection module, and a database module; after the user submits the code, the business logic checks whether the submitted code of the user is legal or not, and if the submitted code of the user is in line with the requirement, the code submitted by the user is transmitted to the background management; the background management module transmits a user request to the code similarity detection module, the code similarity detection module reads a source code to be detected and configuration information from a database, similarity detection is carried out, after detection, a similarity calculation result is fed back to the evaluation program, the evaluation program carries out evaluation according to a built-in piecewise function, an evaluation grade, namely an evaluation result, is given, and finally the result is output to a user browser.
As shown in fig. 2, the code similarity detection module completes similarity detection by the following steps:
s1, submitting program codes
Submitting the program to be detected to a system, and receiving a detection task submitted by a user by the system;
s2, pretreatment
Preprocessing a source program, namely removing useless data such as comments, blank lines, redundant spaces, head files and the like;
s3, calculating similarity by GST string matching algorithm, the string matching algorithm can accurately detect a large number of low-level plagiarism problems in teaching, judging whether a decision condition is satisfied or not according to the similarity value obtained by the GST algorithm through a decision function, if the decision condition is satisfied, returning a result, ending the detection, if the decision condition is not satisfied, carrying out the operation of step S4, concretely, as shown in FIG. 3, using the GST algorithm to measure the similarity, marking a program code, namely representing the code as a string, scanning a source program when the code is converted into the string, considering the language and version of the source program code, and different conversion rules of the mark and the string to be defined when different programming languages are scanned, based on the consideration, a C language-oriented scanning and syntax analysis tool based on XM L is designed, before using the formula to calculate the similarity, the mark in the code needs to be converted into corresponding letter or number, thus converting the program into the string capable of embodying the program characteristics, and finally, using the GST algorithm to calculate the similarity, namely comparing the string.
The method comprises the steps of S4, selecting attribute characteristic elements and structural characteristic elements from a source program according to the characteristics of C language, then calculating similarity according to a method of combining an attribute counting method and a structural measurement technology, specifically, as shown in figure 4, in order to improve the accuracy of detection results, performing secondary detection on program pairs with GST algorithm detection results between 60% and 90%, using a similarity calculation method based on the combination of the attribute counting method and the structural measurement technology, selecting the characteristic elements of the program and obtaining the attribute elements by using the attribute counting method, when the characteristic elements are selected, fully considering the characteristics of a source program language, editing the source program to be detected and converting the source program into identifiers by using a well-known lexical analysis tool L ex, finally counting the number of the corresponding identifiers, determining the structure of the program commonly used by using the structural measurement technology again, obtaining the commonly used structure, writing grammar rules according to the characteristics of the source program, generating a grammatical structure of the program to ensure that L ex can extract the structural characteristic elements of the source program, and finally combining the extracted structural characteristic elements twice into a cosine vector and calculating the similarity by using a characteristic vector.
And S5, synthesizing the two similarity measurement results to give a similarity evaluation level, namely determining the plagiarism degree of the program code.
Examples
Construction sample data acquisition
10 students of class 15 computer application technical specialty 1 class of the Zhongxing communication system of the Xian traffic engineering college, modify the program code according to requirements according to a given C language sequencing program code, and respectively make the following five types of modifications to the source code:
(1) adding or altering annotations;
(2) blank space and redundancy are increased;
(3) changing the distribution of function codes, disordering the sequence of sentences and the like to change typesetting;
(4) changing the variable name;
(5) and substitution of equivalent control structures for each other.
Students are required to modify the five types, and meanwhile, the modified programs can be guaranteed to run correctly to produce results. Selecting a representative sample from the modification program submitted by the student for experiment, wherein the types of the constructed samples are as follows:
(1) the modification types of C _ code11.c and C _ code12.c are: adding or altering annotations;
(2) the modification types of C _ code21.c and C _ code22.c are: blank space and redundancy are increased;
(3) the modification types of C _ code31.c and C _ code32.c are: changing the distribution of function codes, disordering the sequence of sentences and the like to change typesetting;
(4) the modification types of C _ code41.c and C _ code42.c are: changing the variable name;
(5) the modification types of C _ code51.c and C _ code52.c are: and substitution of equivalent control structures for each other. Structural sample test result analysis
TABLE 1 results of similarity measurements on ten structural samples by experimental system and JPlag system
Figure GSB0000187232400000061
The similarity measure results between the constructed samples and the original program are given in table 1. It can be seen from the table that the measurement results given by the experimental system are more accurate. Especially, the method can accurately detect low-level plagiarism such as adding or changing comments, changing variable names and the like; the detection in the aspects of changing the code distribution of the function and disturbing the sequence of the sentences has a gap with the JPlag system; the detection effect is better than that of the JPlag system in the aspect of mutual replacement of equivalent control structures. Generally, for a student who is in the primary learning programming course, if the student can understand codes of other students and replace equivalent control sentences, the student grasps basic programming knowledge and achieves the basic learning goal of the course. In fact, the process of modifying the program is a good learning method for beginners, and is worthy of affirmation. It is also understandable that we cannot regard this as a plagiarism, so the system is not sensitive to this type of plagiarism detection. If the software design and development personnel of the business, the behavior can be considered to be plagiarism. The experimental system and the JPlag system have certain difference in changing typesetting aspects such as changing function code distribution, disordering statement sequence and the like, and the method is also an aspect to be improved by the system. Adding redundancy to a program is not easily detectable, and as the number of inserted redundant codes increases, the accuracy of the measurement results gradually decreases. The recognition capability of some foreign famous detection systems to the plagiarism category is very limited, and the detection result accuracy of the experimental system is improved by 6.3% compared with that of the JPlag system.
By combining the analysis, the test result of the constructed sample shows that the experimental system can identify the primary plagiarism means accurately, the accuracy of the experimental system is improved by 6.3 percent compared with the JPlag system in the aspect of advanced plagiarism, and the experimental system designed by the application is improved by 8.2 percent compared with the JPlag system by combining 5 types of plagiarism means.
In the specific implementation, a GST algorithm is used for carrying out primary similarity calculation on codes, in order to improve the accuracy of detection results, a decision function is added into a system, whether further detection needs to be carried out by using a method combining an attribute counting method and a structure measurement technology provided by the application, and a similarity measurement result is given by integrating two detections. The GST algorithm can accurately detect low-level plagiarism means, but the time complexity is higher; the time complexity of the method combining the attribute counting method and the structure measurement technology is low, and the advantages of the method can be complemented with those of a GST algorithm. The code plagiarism detection system designed by the application has better execution effect from the two aspects of calculation precision and execution efficiency, and can meet the code plagiarism detection requirement of a program design course.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and these improvements and modifications should also be construed as the protection scope of the present invention.

Claims (1)

1. The program plagiarism detection method based on the combination of the attribute counting and the structure measuring technology is characterized in that the system comprises a user interface, a background management module, a code similarity detection module and a database module; a user submits a code detection request, a background management module calls a code similarity detection module, the code similarity detection module reads data source codes from a source code database submitted by the user, similarity calculation is completed, a calculation result is fed back to the background management module, and the background management module feeds back the detection result to a user interface for the user to check;
the detection method comprises the following steps:
s1, submitting program codes
Submitting the program to be detected to a system, and receiving a detection task submitted by a user by the system;
s2, pretreatment
Preprocessing a source program, namely removing useless data;
s3, calculating the similarity by adopting a GST string matching algorithm, judging whether a decision condition is met or not according to the similarity value obtained by the GST algorithm through a decision function, if the decision condition is met, returning a result, and finishing detection; if the decision condition is not satisfied, performing the operation of step S4;
specifically, a program code needs to be marked by measuring the similarity by using a GST algorithm, namely the code is expressed as a character string, a source program needs to be scanned when the code is converted into the character string, conversion rules of the mark and the character string which are defined when different programming languages are scanned are different in consideration of the language and the version of the source program code, based on the consideration, a C language scanning and syntax analysis tool based on XM L is designed, before the similarity is calculated by using a formula, the mark in the code needs to be converted into a corresponding letter or number, so that the program is converted into the character string which can embody the characteristics of the program, and finally the GST algorithm is used for calculating the similarity of the character string, namely comparing the character string;
s4, selecting attribute feature elements and structural feature elements from a source program according to the features of the C language, and then calculating the similarity according to a method combining an attribute counting method and a structural measurement technology;
the method comprises the steps of firstly selecting attribute characteristic elements of a program by using an attribute counting method, editing a source program to be detected by using a lexical analysis tool L ex and converting the source program into identifiers when the characteristic elements are selected, finally counting the number of corresponding identifiers, determining a commonly used structure of the program by using a structural measurement technology, acquiring the commonly used structure, compiling a grammar rule by using open source Ucc according to the characteristics of the source program, generating a grammar structure of the program to ensure that L ex can extract the structural characteristic elements of the source program, finally combining the characteristic elements extracted twice into a characteristic vector, and calculating the similarity of the characteristic vector by using a cosine coefficient method;
and S5, synthesizing the two similarity measurement results to give a similarity evaluation level, namely determining the plagiarism degree of the program code.
CN201710462952.9A 2017-06-10 2017-06-10 Program plagiarism detection method and system based on combination of attribute counting and structure measurement technology Expired - Fee Related CN107169321B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710462952.9A CN107169321B (en) 2017-06-10 2017-06-10 Program plagiarism detection method and system based on combination of attribute counting and structure measurement technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710462952.9A CN107169321B (en) 2017-06-10 2017-06-10 Program plagiarism detection method and system based on combination of attribute counting and structure measurement technology

Publications (2)

Publication Number Publication Date
CN107169321A CN107169321A (en) 2017-09-15
CN107169321B true CN107169321B (en) 2020-07-28

Family

ID=59818873

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710462952.9A Expired - Fee Related CN107169321B (en) 2017-06-10 2017-06-10 Program plagiarism detection method and system based on combination of attribute counting and structure measurement technology

Country Status (1)

Country Link
CN (1) CN107169321B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679567B (en) * 2017-09-22 2021-04-27 江苏海事职业技术学院 Code copying behavior identification method, device and system
CN108399193B (en) * 2018-01-29 2022-03-04 华侨大学 Program code clustering method based on sequence structure
CN109062792A (en) * 2018-07-21 2018-12-21 东南大学 A kind of Open Source Code detection method based on String matching and characteristic matching
CN110442847B (en) * 2019-07-26 2023-05-12 南京邮电大学 Code similarity detection method and device based on code warehouse process management
CN110704308B (en) * 2019-09-11 2022-09-09 无锡江南计算技术研究所 Multistage feature extraction method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101976318A (en) * 2010-11-15 2011-02-16 北京理工大学 Detection method of code similarity based on digital fingerprints

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101254247B1 (en) * 2007-01-18 2013-04-12 중앙대학교 산학협력단 Apparatus and method for detecting program plagiarism through memory access log analysis

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101976318A (en) * 2010-11-15 2011-02-16 北京理工大学 Detection method of code similarity based on digital fingerprints

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
程序代码相似度检测方法研究及应用;胡正军;《中国优秀硕士学位论文全文数据库信息科技辑》;20130215;第9-46页 *

Also Published As

Publication number Publication date
CN107169321A (en) 2017-09-15

Similar Documents

Publication Publication Date Title
CN107169321B (en) Program plagiarism detection method and system based on combination of attribute counting and structure measurement technology
CN113127339B (en) Method for acquiring Github open source platform data and source code defect repair system
WO2022226716A1 (en) Deep learning-based java program internal annotation generation method and system
CN110427612B (en) Entity disambiguation method, device, equipment and storage medium based on multiple languages
CN110750297B (en) Python code reference information generation method based on program analysis and text analysis
WO2016112782A1 (en) Method and system of extracting user living range
Aida et al. A comprehensive analysis of PMI-based models for measuring semantic differences
Xue et al. Improved correction detection in revised ESL sentences
CN117194258A (en) Method and device for evaluating large code model
CN111104503A (en) Construction engineering quality acceptance standard question-answering system and construction method thereof
Liaqat et al. Plagiarism detection in java code
Ghayoomi et al. Converting an HPSG-based Treebank into its Parallel Dependency-based Treebank.
CN111754352A (en) Method, device, equipment and storage medium for judging correctness of viewpoint statement
Chuda et al. Support for checking plagiarism in e-learning
US20120197894A1 (en) Apparatus and method for processing documents to extract expressions and descriptions
CN106844218B (en) Evolution influence set prediction method based on evolution slices
CN115373982A (en) Test report analysis method, device, equipment and medium based on artificial intelligence
Charitsis et al. Assessing function names and quantifying the relationship between identifiers and their functionality to improve them
Zurini Stylometry metrics selection for creating a model for evaluating the writing style of authors according to their cultural orientation
Alosaimy et al. Web-based annotation tool for inflectional language resources
Chang et al. Validating halstead metrics for scratch program using process data
Higashi et al. Hierarchical clustering of OSS license statements toward automatic generation of license rules
KR20080049764A (en) Detecting segmentation errors in an annotated corpus
CN115600580B (en) Text matching method, device, equipment and storage medium
JP2014235584A (en) Document analysis system, document analysis method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200728

Termination date: 20210610

CF01 Termination of patent right due to non-payment of annual fee