CN107169321B - Program plagiarism detection method and system based on combination of attribute counting and structure measurement technology - Google Patents
Program plagiarism detection method and system based on combination of attribute counting and structure measurement technology Download PDFInfo
- Publication number
- CN107169321B CN107169321B CN201710462952.9A CN201710462952A CN107169321B CN 107169321 B CN107169321 B CN 107169321B CN 201710462952 A CN201710462952 A CN 201710462952A CN 107169321 B CN107169321 B CN 107169321B
- Authority
- CN
- China
- Prior art keywords
- similarity
- program
- detection
- code
- attribute
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 69
- 238000005259 measurement Methods 0.000 title claims abstract description 24
- 238000005516 engineering process Methods 0.000 title claims abstract description 21
- 238000000034 method Methods 0.000 claims abstract description 37
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 22
- 230000006870 function Effects 0.000 claims abstract description 11
- 238000011156 evaluation Methods 0.000 claims abstract description 8
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000004458 analytical method Methods 0.000 claims description 6
- 230000002194 synthesizing effect Effects 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 230000004048 modification Effects 0.000 description 9
- 238000012986 modification Methods 0.000 description 8
- 238000012360 testing method Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000000691 measurement method Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000004883 computer application Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000011524 similarity measure Methods 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/10—Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
- G06F21/12—Protecting executable software
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3604—Software analysis for verifying properties of programs
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Multimedia (AREA)
- Technology Law (AREA)
- Computer Security & Cryptography (AREA)
- Debugging And Monitoring (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a program plagiarism detection method and a system based on the combination of attribute counting and structure measurement technologies, wherein the method comprises the following steps: submitting the program to be detected to a system, and receiving a detection task submitted by a user by the system; preprocessing a source program, namely removing useless data; calculating the similarity by adopting a GST character string matching algorithm, judging whether a decision condition is met or not according to a similarity value obtained by the GST algorithm through a decision function, if the decision condition is met, returning a result, and finishing detection; if the decision condition is not met, the next operation is carried out; selecting attribute feature elements and structural feature elements from a source program according to the features of the C language, and then calculating the similarity according to a method combining an attribute counting method and a structural measurement technology; and (4) integrating the two similarity measurement results to give a similarity evaluation grade. The invention reduces the time complexity of the detection system and improves the detection precision of the system by extracting the code attribute characteristics and the structural characteristics.
Description
Technical Field
The invention relates to the field of computers, in particular to a program plagiarism detection method and a system based on the combination of an attribute counting technology and a structure measuring technology.
Background
In order to inhibit the spreading of plagiarism and plagiarism code phenomena in daily C language program design courses and solve the problem of quickly and accurately identifying plagiarism codes in a large amount of source codes, various plagiarism detection systems emerge in the prior art, but the common detection results are not accurate enough, the complexity of the running time of the system is high and the like.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method and a system for detecting plagiarism based on the combination of attribute counting and structure measurement technologies.
In order to achieve the purpose, the invention adopts the technical scheme that:
the program plagiarism detection method based on the combination of the attribute counting and the structure measuring technology comprises the following steps:
s1, submitting program codes
Submitting the program to be detected to a system, and receiving a detection task submitted by a user by the system;
s2, pretreatment
Preprocessing a source program, namely removing useless data such as comments, blank lines, redundant spaces, head files and the like;
s3, calculating similarity by adopting a GST character string matching algorithm, wherein the character string matching algorithm can accurately detect a large amount of low-level plagiarism problems in teaching; judging whether the decision condition is met or not according to the similarity value obtained by the GST algorithm through a decision function, if the decision condition is met, returning a result, and finishing detection; if the decision condition is not satisfied, performing the operation of step S4;
s4, selecting attribute feature elements and structural feature elements from a source program according to the features of the C language, and then calculating the similarity according to a method combining an attribute counting method and a structural measurement technology;
and S5, synthesizing the two similarity measurement results to give a similarity evaluation level, namely determining the plagiarism degree of the program code.
In order to solve the technical problem, the invention also provides a program plagiarism detection system based on the combination of the attribute counting and the structure measuring technology, which comprises a user interface, a background management module, a code similarity detection module and a database module; the user submits a code detection request, the background management module calls the code similarity detection module, the code similarity detection module reads data source codes from a source code database submitted by the user, similarity calculation is completed through the method, the calculated result is fed back to the background management module, and the background management module feeds the detected result back to a user interface for the user to check.
The invention has the following beneficial effects:
(1) aiming at the problem of high complexity of the running time of the existing detection system, a program similarity calculation method based on the combination of an attribute counting method and a structure measurement technology is provided, and the time complexity of the detection system is reduced by a method for extracting code attribute features and structure features. Experiments show that the running time complexity of the detection system is reduced by 15.1% compared with the longest common subsequence algorithm.
(2) Aiming at the problem of low accuracy of the existing detection system, the detection process is optimized by analyzing the characteristics of code plagiarism: the method comprises the steps of preferentially using a GST algorithm to detect low-level code plagiarism, judging whether a method combining an attribute counting method and a structural measurement technology is adopted to further calculate similarity according to a decision function, and obtaining a final detection conclusion by comprehensive evaluation of two detection results, wherein experimental results show that the accuracy of the detection result reaches 95% for low-level plagiarism means in a construction sample, student daily work and a program code sample submitted in an on-computer test; for more advanced detection means, such as adding redundancy, equivalent structure replacement and the like, the precision of the detection system designed by the method is improved by 5.6% compared with that of a JPlag system.
(3) The method realizes a set of automatic detection systems of program codes, and comprises code preprocessing, effective selection of the characteristics of a source program, efficient extraction of characteristic elements and realization of a similarity detection algorithm. In order to verify the effectiveness of the research method and the establishment of a plagiarism detection system, three groups of typical samples are selected to be simultaneously detected in a JPlag system and the plagiarism detection system established by the research, and finally, the detection results are compared. The detection results of five plagiarism means commonly used in three groups of samples are comprehensively analyzed, and the precision of the detection results is improved by 7.3 percent compared with that of a JPlag system. Software tests show that the system can stably and reliably work, and the design target is well realized.
Drawings
Fig. 1 is a system diagram of a plagiarism detection system based on the combination of attribute counting and structure metric techniques according to an embodiment of the present invention.
Fig. 2 is a flowchart of a program piracy detection method based on the combination of attribute counting and structure measurement techniques according to an embodiment of the present invention.
FIG. 3 is a flowchart illustrating the GST algorithm measures code similarity according to an embodiment of the present invention.
Fig. 4 is a similarity calculation flow chart based on the combination of the attribute counting method and the structure measurement technique in the embodiment of the present invention.
Detailed Description
In order that the objects and advantages of the invention will be more clearly understood, the invention is further described in detail below with reference to examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, an embodiment of the present invention provides a program plagiarism detection system based on a combination of attribute counting and structural measurement technologies, which includes a user interface, a background management module, a code similarity detection module, and a database module; after the user submits the code, the business logic checks whether the submitted code of the user is legal or not, and if the submitted code of the user is in line with the requirement, the code submitted by the user is transmitted to the background management; the background management module transmits a user request to the code similarity detection module, the code similarity detection module reads a source code to be detected and configuration information from a database, similarity detection is carried out, after detection, a similarity calculation result is fed back to the evaluation program, the evaluation program carries out evaluation according to a built-in piecewise function, an evaluation grade, namely an evaluation result, is given, and finally the result is output to a user browser.
As shown in fig. 2, the code similarity detection module completes similarity detection by the following steps:
s1, submitting program codes
Submitting the program to be detected to a system, and receiving a detection task submitted by a user by the system;
s2, pretreatment
Preprocessing a source program, namely removing useless data such as comments, blank lines, redundant spaces, head files and the like;
s3, calculating similarity by GST string matching algorithm, the string matching algorithm can accurately detect a large number of low-level plagiarism problems in teaching, judging whether a decision condition is satisfied or not according to the similarity value obtained by the GST algorithm through a decision function, if the decision condition is satisfied, returning a result, ending the detection, if the decision condition is not satisfied, carrying out the operation of step S4, concretely, as shown in FIG. 3, using the GST algorithm to measure the similarity, marking a program code, namely representing the code as a string, scanning a source program when the code is converted into the string, considering the language and version of the source program code, and different conversion rules of the mark and the string to be defined when different programming languages are scanned, based on the consideration, a C language-oriented scanning and syntax analysis tool based on XM L is designed, before using the formula to calculate the similarity, the mark in the code needs to be converted into corresponding letter or number, thus converting the program into the string capable of embodying the program characteristics, and finally, using the GST algorithm to calculate the similarity, namely comparing the string.
The method comprises the steps of S4, selecting attribute characteristic elements and structural characteristic elements from a source program according to the characteristics of C language, then calculating similarity according to a method of combining an attribute counting method and a structural measurement technology, specifically, as shown in figure 4, in order to improve the accuracy of detection results, performing secondary detection on program pairs with GST algorithm detection results between 60% and 90%, using a similarity calculation method based on the combination of the attribute counting method and the structural measurement technology, selecting the characteristic elements of the program and obtaining the attribute elements by using the attribute counting method, when the characteristic elements are selected, fully considering the characteristics of a source program language, editing the source program to be detected and converting the source program into identifiers by using a well-known lexical analysis tool L ex, finally counting the number of the corresponding identifiers, determining the structure of the program commonly used by using the structural measurement technology again, obtaining the commonly used structure, writing grammar rules according to the characteristics of the source program, generating a grammatical structure of the program to ensure that L ex can extract the structural characteristic elements of the source program, and finally combining the extracted structural characteristic elements twice into a cosine vector and calculating the similarity by using a characteristic vector.
And S5, synthesizing the two similarity measurement results to give a similarity evaluation level, namely determining the plagiarism degree of the program code.
Examples
Construction sample data acquisition
10 students of class 15 computer application technical specialty 1 class of the Zhongxing communication system of the Xian traffic engineering college, modify the program code according to requirements according to a given C language sequencing program code, and respectively make the following five types of modifications to the source code:
(1) adding or altering annotations;
(2) blank space and redundancy are increased;
(3) changing the distribution of function codes, disordering the sequence of sentences and the like to change typesetting;
(4) changing the variable name;
(5) and substitution of equivalent control structures for each other.
Students are required to modify the five types, and meanwhile, the modified programs can be guaranteed to run correctly to produce results. Selecting a representative sample from the modification program submitted by the student for experiment, wherein the types of the constructed samples are as follows:
(1) the modification types of C _ code11.c and C _ code12.c are: adding or altering annotations;
(2) the modification types of C _ code21.c and C _ code22.c are: blank space and redundancy are increased;
(3) the modification types of C _ code31.c and C _ code32.c are: changing the distribution of function codes, disordering the sequence of sentences and the like to change typesetting;
(4) the modification types of C _ code41.c and C _ code42.c are: changing the variable name;
(5) the modification types of C _ code51.c and C _ code52.c are: and substitution of equivalent control structures for each other. Structural sample test result analysis
TABLE 1 results of similarity measurements on ten structural samples by experimental system and JPlag system
The similarity measure results between the constructed samples and the original program are given in table 1. It can be seen from the table that the measurement results given by the experimental system are more accurate. Especially, the method can accurately detect low-level plagiarism such as adding or changing comments, changing variable names and the like; the detection in the aspects of changing the code distribution of the function and disturbing the sequence of the sentences has a gap with the JPlag system; the detection effect is better than that of the JPlag system in the aspect of mutual replacement of equivalent control structures. Generally, for a student who is in the primary learning programming course, if the student can understand codes of other students and replace equivalent control sentences, the student grasps basic programming knowledge and achieves the basic learning goal of the course. In fact, the process of modifying the program is a good learning method for beginners, and is worthy of affirmation. It is also understandable that we cannot regard this as a plagiarism, so the system is not sensitive to this type of plagiarism detection. If the software design and development personnel of the business, the behavior can be considered to be plagiarism. The experimental system and the JPlag system have certain difference in changing typesetting aspects such as changing function code distribution, disordering statement sequence and the like, and the method is also an aspect to be improved by the system. Adding redundancy to a program is not easily detectable, and as the number of inserted redundant codes increases, the accuracy of the measurement results gradually decreases. The recognition capability of some foreign famous detection systems to the plagiarism category is very limited, and the detection result accuracy of the experimental system is improved by 6.3% compared with that of the JPlag system.
By combining the analysis, the test result of the constructed sample shows that the experimental system can identify the primary plagiarism means accurately, the accuracy of the experimental system is improved by 6.3 percent compared with the JPlag system in the aspect of advanced plagiarism, and the experimental system designed by the application is improved by 8.2 percent compared with the JPlag system by combining 5 types of plagiarism means.
In the specific implementation, a GST algorithm is used for carrying out primary similarity calculation on codes, in order to improve the accuracy of detection results, a decision function is added into a system, whether further detection needs to be carried out by using a method combining an attribute counting method and a structure measurement technology provided by the application, and a similarity measurement result is given by integrating two detections. The GST algorithm can accurately detect low-level plagiarism means, but the time complexity is higher; the time complexity of the method combining the attribute counting method and the structure measurement technology is low, and the advantages of the method can be complemented with those of a GST algorithm. The code plagiarism detection system designed by the application has better execution effect from the two aspects of calculation precision and execution efficiency, and can meet the code plagiarism detection requirement of a program design course.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and these improvements and modifications should also be construed as the protection scope of the present invention.
Claims (1)
1. The program plagiarism detection method based on the combination of the attribute counting and the structure measuring technology is characterized in that the system comprises a user interface, a background management module, a code similarity detection module and a database module; a user submits a code detection request, a background management module calls a code similarity detection module, the code similarity detection module reads data source codes from a source code database submitted by the user, similarity calculation is completed, a calculation result is fed back to the background management module, and the background management module feeds back the detection result to a user interface for the user to check;
the detection method comprises the following steps:
s1, submitting program codes
Submitting the program to be detected to a system, and receiving a detection task submitted by a user by the system;
s2, pretreatment
Preprocessing a source program, namely removing useless data;
s3, calculating the similarity by adopting a GST string matching algorithm, judging whether a decision condition is met or not according to the similarity value obtained by the GST algorithm through a decision function, if the decision condition is met, returning a result, and finishing detection; if the decision condition is not satisfied, performing the operation of step S4;
specifically, a program code needs to be marked by measuring the similarity by using a GST algorithm, namely the code is expressed as a character string, a source program needs to be scanned when the code is converted into the character string, conversion rules of the mark and the character string which are defined when different programming languages are scanned are different in consideration of the language and the version of the source program code, based on the consideration, a C language scanning and syntax analysis tool based on XM L is designed, before the similarity is calculated by using a formula, the mark in the code needs to be converted into a corresponding letter or number, so that the program is converted into the character string which can embody the characteristics of the program, and finally the GST algorithm is used for calculating the similarity of the character string, namely comparing the character string;
s4, selecting attribute feature elements and structural feature elements from a source program according to the features of the C language, and then calculating the similarity according to a method combining an attribute counting method and a structural measurement technology;
the method comprises the steps of firstly selecting attribute characteristic elements of a program by using an attribute counting method, editing a source program to be detected by using a lexical analysis tool L ex and converting the source program into identifiers when the characteristic elements are selected, finally counting the number of corresponding identifiers, determining a commonly used structure of the program by using a structural measurement technology, acquiring the commonly used structure, compiling a grammar rule by using open source Ucc according to the characteristics of the source program, generating a grammar structure of the program to ensure that L ex can extract the structural characteristic elements of the source program, finally combining the characteristic elements extracted twice into a characteristic vector, and calculating the similarity of the characteristic vector by using a cosine coefficient method;
and S5, synthesizing the two similarity measurement results to give a similarity evaluation level, namely determining the plagiarism degree of the program code.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710462952.9A CN107169321B (en) | 2017-06-10 | 2017-06-10 | Program plagiarism detection method and system based on combination of attribute counting and structure measurement technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710462952.9A CN107169321B (en) | 2017-06-10 | 2017-06-10 | Program plagiarism detection method and system based on combination of attribute counting and structure measurement technology |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107169321A CN107169321A (en) | 2017-09-15 |
CN107169321B true CN107169321B (en) | 2020-07-28 |
Family
ID=59818873
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710462952.9A Expired - Fee Related CN107169321B (en) | 2017-06-10 | 2017-06-10 | Program plagiarism detection method and system based on combination of attribute counting and structure measurement technology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107169321B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107679567B (en) * | 2017-09-22 | 2021-04-27 | 江苏海事职业技术学院 | Code copying behavior identification method, device and system |
CN108399193B (en) * | 2018-01-29 | 2022-03-04 | 华侨大学 | Program code clustering method based on sequence structure |
CN109062792A (en) * | 2018-07-21 | 2018-12-21 | 东南大学 | A kind of Open Source Code detection method based on String matching and characteristic matching |
CN110442847B (en) * | 2019-07-26 | 2023-05-12 | 南京邮电大学 | Code similarity detection method and device based on code warehouse process management |
CN110704308B (en) * | 2019-09-11 | 2022-09-09 | 无锡江南计算技术研究所 | Multistage feature extraction method |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101976318A (en) * | 2010-11-15 | 2011-02-16 | 北京理工大学 | Detection method of code similarity based on digital fingerprints |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101254247B1 (en) * | 2007-01-18 | 2013-04-12 | 중앙대학교 산학협력단 | Apparatus and method for detecting program plagiarism through memory access log analysis |
-
2017
- 2017-06-10 CN CN201710462952.9A patent/CN107169321B/en not_active Expired - Fee Related
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101976318A (en) * | 2010-11-15 | 2011-02-16 | 北京理工大学 | Detection method of code similarity based on digital fingerprints |
Non-Patent Citations (1)
Title |
---|
程序代码相似度检测方法研究及应用;胡正军;《中国优秀硕士学位论文全文数据库信息科技辑》;20130215;第9-46页 * |
Also Published As
Publication number | Publication date |
---|---|
CN107169321A (en) | 2017-09-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107169321B (en) | Program plagiarism detection method and system based on combination of attribute counting and structure measurement technology | |
CN113127339B (en) | Method for acquiring Github open source platform data and source code defect repair system | |
WO2022226716A1 (en) | Deep learning-based java program internal annotation generation method and system | |
CN110427612B (en) | Entity disambiguation method, device, equipment and storage medium based on multiple languages | |
CN110750297B (en) | Python code reference information generation method based on program analysis and text analysis | |
WO2016112782A1 (en) | Method and system of extracting user living range | |
Aida et al. | A comprehensive analysis of PMI-based models for measuring semantic differences | |
Xue et al. | Improved correction detection in revised ESL sentences | |
CN117194258A (en) | Method and device for evaluating large code model | |
CN111104503A (en) | Construction engineering quality acceptance standard question-answering system and construction method thereof | |
Liaqat et al. | Plagiarism detection in java code | |
Ghayoomi et al. | Converting an HPSG-based Treebank into its Parallel Dependency-based Treebank. | |
CN111754352A (en) | Method, device, equipment and storage medium for judging correctness of viewpoint statement | |
Chuda et al. | Support for checking plagiarism in e-learning | |
US20120197894A1 (en) | Apparatus and method for processing documents to extract expressions and descriptions | |
CN106844218B (en) | Evolution influence set prediction method based on evolution slices | |
CN115373982A (en) | Test report analysis method, device, equipment and medium based on artificial intelligence | |
Charitsis et al. | Assessing function names and quantifying the relationship between identifiers and their functionality to improve them | |
Zurini | Stylometry metrics selection for creating a model for evaluating the writing style of authors according to their cultural orientation | |
Alosaimy et al. | Web-based annotation tool for inflectional language resources | |
Chang et al. | Validating halstead metrics for scratch program using process data | |
Higashi et al. | Hierarchical clustering of OSS license statements toward automatic generation of license rules | |
KR20080049764A (en) | Detecting segmentation errors in an annotated corpus | |
CN115600580B (en) | Text matching method, device, equipment and storage medium | |
JP2014235584A (en) | Document analysis system, document analysis method, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200728 Termination date: 20210610 |
|
CF01 | Termination of patent right due to non-payment of annual fee |