CN109284485B

CN109284485B - Paper originality detection method based on citation

Info

Publication number: CN109284485B
Application number: CN201810870256.6A
Authority: CN
Inventors: 刘刚; 王贺飞; 杨笑笑
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2018-08-02
Filing date: 2018-08-02
Publication date: 2023-04-07
Anticipated expiration: 2038-08-02
Also published as: CN109284485A

Abstract

A paper originality detection method based on citation relates to the field of paper retrieval comparison. The invention provides a method for researching plagiarism from the perspective of quotation, designs citation characteristics of a text to analyze the quotation, segments a text from a reference document at the end of the text, segments a segmented reference character string, creates a bibliography list, positions the quotation according to a bibliography writer and a publishing year, extracts the quotation by using a resolver, analyzes the bibliography list of an experimental text firstly, analyzes the next stage if the shared citation exceeds a certain threshold value, analyzes the longest public citation sequence of the screened document, intensively eliminates the document from the experimental text if the value of the public citation sequence is less than the certain threshold value, analyzes the quotation of the text which passes the detection of the first two stages smoothly, and measures the plagiarism degree of the text by the maximum overlapping number of citation blocks. The invention has important significance for academic misdetection and is beneficial to the standardization of academic wind and qi and the improvement of scientific research level.

Description

Thesis originality detection method based on citation

Technical Field

The invention relates to the field of thesis retrieval comparison, in particular to a thesis originality detection method based on citation.

Background

The concept of book-to-book coupling is of great practical significance as a measure of topic similarity. Two documents are considered to be bibliographic coupled if there is at least one reference bibliographic between the two documents. The coupling strength is characterized by the number of shared references.

Bibliographic coupling methods characterize relationships between documents based on earlier documents determined by the author when selecting a reference bibliographic. This relationship is static and inherent to the coupling file because it depends only on the respective cited work and does not change over time.

Some researchers challenge the effectiveness of the subject coupling as a measure of similarity. The book coupling can only represent the probability of the existence of the correlation among the files and has uncertain value. Through analysis, the texts with the coupling relation but without similar subjects account for 15% -19% of the whole text set, and the effectiveness of book and bibliography coupling is denied.

In addition, the scholars make a criticism that the absolute coupling strength cannot guarantee the same similar units, which is not comparable in different texts. Review-like articles tend to have higher coupling strengths because such articles typically contain more references. Partial files that are coupled with respect to the bibliography, i.e., shared and unshared references, can provide some remedy for this problem, but do not completely solve it. The static nature of bibliographic coupling is undesirable in characterizing changes in concepts and ideas, which is detrimental to mapping emerging trends and evolutions in research fields.

To solve the problem of static nature of book-to-book coupling, a co-citation concept is proposed. Two files are considered to be commonly referenced if they are jointly referenced by at least one recent job. The number of times these two documents are co-referenced in a previous publication determines the strength of their co-referenced relationship and the score of the co-referenced collection.

The book-coupled static link is shared by two files. Although the strength of the bibliographic coupling of a document can be determined immediately after publication, this strength does not change over time. But the co-citation reflects a transition in the relationship between documents over time depending on the frequency with which subsequent papers cite previous papers.

Bibliographic coupling and co-citations have received considerable attention from scholars in scientific research and are widely used for many purposes such as document retrieval, pre-research analysis, mapping science, and measuring the influence of scientists and evaluating the different properties of articles and periodicals.

The comparison of the article cited references is the most major obstacle to the detection of originality. In consideration of processing of text documents, the extraction of reference documents has certain problems, after text conversion, the expression of the documents is poor, names and titles have many deviations, and document comparison of two texts causes many unnecessary troubles, so that the difference between a detection result and a correct result is relatively large. At present, citation analysis is mainly used for identifying semantically related documents, but not for original detection, so that no excessive existing work can be directly used for reference, and almost all reference-based similarity measures are used for analyzing reference relations among documents from a global level. It is therefore desirable to design and evaluate algorithms to verify the suitability of citation-based detection of text originality.

Disclosure of Invention

The invention aims to solve the problem of inaccurate detection results and provides a citation-based paper originality detection method.

The purpose of the invention is realized as follows:

a citation-based paper originality detection method comprises the following steps:

(1) Corpus processing

Searching and downloading the article by using a web search engine and utilizing heuristic rules; carrying out format conversion on the downloaded articles, and uniformly converting the downloaded articles into a UTF-8 coded plain text format; for a plain text, firstly, checking whether the plain text is an effective scientific document, namely, judging whether the plain text contains a reference part, if the plain text does not contain or contains files which are incompletely or wrongly cited, removing the plain text from an experimental document set, and carrying out standardized processing on the plain text; identifying citations pointing to the same article using a simple baseline approach, traversing all bibliographies, grouping according to whether or not to cite the same article, then ranking based on citation length from longest to shortest, for each citation, we look for the maximum number that it matches the previous citation, if this number exceeds a threshold, then the citation is considered to be the same as the article that the previous citation represents, grouping it into the same group as the previous citation, otherwise it is considered to be a new citation.

(2) Reference bibliographic segmentation and extraction

Given a pure UTF-8 file, searching for a reference book by means of a series of heuristics; searching a labeled reference part in a text, wherein the label is 'References', 'Bibliography', 'reference' or common variation of character strings, and repeatedly segmenting the text based on the labels; if a tag is discovered prematurely in the document, then a subsequent match is sought based on a parameter that by default is less than 40% of the entire text; the last match is considered the starting point of the reference section; the handler then searches for a subsequent part tag, which is an appendix, a drawing, a table, a thank you, or the end of a file, etc., to find the end of the reference part, thereby segmenting the text from the ending reference, segmenting the reference strings of the segmented reference, and creating a list of bibliographies.

(3) And identifying and extracting the quotations, positioning the quotations according to the book authors and the publication years, and extracting by using an analyzer.

(4) The citation characteristics adopted in the candidate document generation stage comprise: book coupling, longest public reference sequence and citation blocking; the plagiarism is judged by combining the three quoted characteristics, and the final plagiarism result is obtained. Firstly, performing bibliographic coupling detection on a reference sequence of a text, if the shared reference bibliographic is lower than a set threshold value, considering that the shared reference bibliographic has no plagiarism behavior, and removing the shared reference bibliographic from an experimental text set, otherwise, performing longest public reference sequence detection; if the detection result of the longest common reference sequence is lower than a set threshold value, the longest common reference sequence is eliminated from the experimental text set, otherwise, the text quotation is subjected to blocking processing, the similarity is calculated based on the number of shared references in the block, and the text piracy degree is analyzed through the number of shared references in the block.

The invention has the beneficial effects that:

the method combines the particularity of the citation of the references in writing of the documents, makes the citation of the documents related to each other according to the important characteristics of the citation in academic documents, and judges whether the articles have plagiarism by comparing the using conditions of the citation of the two documents. The paper has important significance for academic untenable detection, and is beneficial to the standardization of academic wind and qi and the improvement of scientific research level.

Drawings

FIG. 1 is a schematic diagram of a technique for original detection;

FIG. 2 is a roadmap for a method of invasive detection;

FIG. 3 is a diagram of a reference originality detection module;

FIG. 4 is a block diagram of the overall framework of the citation of the originality test;

FIG. 5 is a system data flow diagram;

FIG. 6 is a graph of the coupling strength of a document pair with a book;

FIG. 7 is a graph of relative book end coupling strength;

FIG. 8 is a document versus scatter plot with the highest similarity score;

FIG. 9 is the longest common reference sequence length as a function of bibliographic coupling strength;

FIG. 10 is a graph of maximum block length distribution;

FIG. 11 is a file distribution diagram with a quote block length greater than or equal to 4;

FIG. 12 is a document versus scatter plot that uses a citation blocking algorithm to generate a high similarity score.

Detailed Description

The invention is described in more detail below with reference to the accompanying drawings.

With reference to fig. 1, 2, 3, 4 and 5, the present invention includes the following steps:

1. and (5) processing a corpus. The located articles are searched using a web search engine using heuristic rules. For the downloaded articles, format conversion is required, and for the convenience of experiment, we convert them to UTF-8 encoded plain text format. For plain text, it is first checked whether it is a valid scientific document, i.e. whether it contains a reference part. Files containing incomplete or erroneous references are also removed from the experimental corpus, the text is normalized, and references pointing to the same article are identified and grouped together using a simple baseline approach. This method entails traversing all of the reference bibliographies and then ranking from longest to shortest based on the length of the references. For each citation, we find the maximum number that it matches the previous citation, and if this number exceeds a threshold, the citation is considered to be the same as the article that the previous citation represents, and then it is grouped with the previous citation, otherwise it is considered as a new citation.

2. And dividing and extracting the reference book. Given a pure UTF-8 file, the bibliography is first found by means of a series of heuristics. It first searches the text for the portion of the reference that has been tagged. Tags may include such strings as "References", "bibliographies", "References", or common variations of these strings. Based on these labels, the text is repeatedly segmented. If a tag is discovered prematurely in the document, subsequent matches are sought according to a configurable parameter, which by default is less than 40% of the entire text. The final match is considered the starting point of the reference section. The handler then begins searching for the end point of the reference section by searching for subsequent section tags, such as an appendix, a drawing, a table, a credit or the end of a file. It is trivial to extract information from the reference list. Based on observations of academic works, there are many fields in the reference such as author, title, journal, volume number, year, number of pages, etc., and usually punctuation marks are considered to be main field separators, which are semantically highly overloaded. For example, periods are used after author names or name abbreviations as markers to segment the reference fields. Each reference string can be thought of as a collection of fields such as title, author, year, journal, etc. These fields are surface forms of strings, and encoded data is recovered by clues such as punctuation marks.

3. And identifying and extracting quotations. Each bibliography has at least one corresponding reference in the body. The reference will contain the name of the author. Although citations and references both contain examples of authors, they are made in different contexts. In the citation, the author names are part of the sentence, while in the reference list they are part of the bibliography. Different instances of the same entity, such as an author's name, may be constrained to, and complementary to, each other. When identifying the quotations, corresponding quotations can be screened by positioning the reference book.

4. Hacking detection based on citation features. And aiming at the characteristics of the citation, the characteristics are accurately corresponding. Bibliographic coupling is a feature in that two documents are bibliographic coupled if they refer to at least one common reference bibliographic. The coupling strength represents the number of shared references. Bibliographic coupling strength, which represents the number of common references owned by two documents, is a well-known similarity feature. The stronger the book coupling strength, the more closely the reference relationship of the two documents is indicated. Which is a global level of similarity measure that does not allow localization to a specific location. While possession of the same reference does not necessarily indicate that two authors refer to the same content of a work, the stronger the bibliographic coupling strength the more likely there is plagiarism between documents than there is for documents that do not have a public reference relationship. The order of reference and distance are visual indicators that indicate that the text segments containing the corresponding references are semantically similar. If the sequence of the same reference appearing in the text is similar or the distances of the same reference are relatively similar, the potential plagiarism is indicated; and the citation blocking refers to blocking the cited fragments in the file. Reference block refers to a reference sequence substring of a file that has a variable size. The quotation blocking is to take the shared quotation as a text anchor to carry out heuristic detection so as to reveal a local quotation mode regardless of potential quotation transposition and scaling; the longest common subsequence is fit to the text string by changing its traditional similarity measure method. The reference sequence that matches in the same order in both documents, but is interrupted by non-matching references, is defined as the longest common reference sequence. Each document pair has at most one longest common reference sequence. Various forms of plagiarism are addressed by combining these three approaches. The final plagiarism results are obtained.

The reference characteristics adopted by the candidate document generation stage comprise: bibliographic coupling, longest common reference sequence, and citation blocking. The specific details thereof are as follows.

1) Book end coupling

Firstly, a fair standard is determined to restrict the experimental text, so that the similarity analysis by citation is easy. In addition, the establishment of the standard can reduce the experimental text set, thereby accelerating the calculation speed. We therefore use the absolute coupling strength and the relative coupling strength to analyze the document. FIG. 6 shows the relationship of document pairs to the binding strength of a book. Where the mean absolute coupling strength is μ =1.21 and the standard deviation is σ =0.95. It can be seen intuitively that the distribution relationship between the two strongly favors low values. About 84% of the files have a bibliographic file pair of 1.

The final goal of the plagiarism detection system is to determine file similarity based on citations, and the higher book coupling strength can reflect potential suspicious document plagiarism. Generally, the stronger the book coupling strength, the greater the likelihood of plagiarism behavior between files. We can set a threshold σ and select those documents from the experimental text collection whose mean absolute coupling strength is greater than or equal to σ, which narrows the experiment and excludes those documents with fewer references. This practice, however, is somewhat flawed. Since even a few documents are cited may have a large relative bibliographic coupling strength, we should analyze the distribution of bibliographic coupling strengths across all document pairs. See figure 7 for details.

N detection environment given, we consider all documents as potentially suspicious and compare them to each other. Bibliographic coupling provides a relatively coarse measure of similarity that can identify highly relevant documents that are identical or contain many common reference bibliographic.

Due to the large scale of text collections, it is inevitable that efficiency is low if manual inspection is performed one by one. By analyzing the matching text, we found that about 50% of the documents share a maximum of 4 reference bibliographies and 70% of the documents share a maximum of 12 reference bibliographies. Therefore, in an experimental text set, only 4-12 shared books can be manually checked to determine potential suspicious similarities.

2) Longest common reference sequence

The longest common reference sequence is one detection algorithm. When matching references, it allows slight transposition or skips over unmatched references. The algorithm measures global document similarity in a single-valued form. To test the detectability of the longest common reference sequence method, we used CF-Score and Cont-Score as dimensions of the scatter plot.

Both CF-Score and Cont-Score depend on the number of references matched. Unlike most files that possess a significant number of matching references, files that share fewer references exhibit significant outliers in one or two dimensions. To prevent the scores of the files with more books from masking the extreme values of the files with less references, it is a reasonable method to analyze the files separately. But for the sake of evaluation we did not perform this separation, but rather consider those files that had the highest CF and Cont scores, while ignoring the number of their shared references. As shown in fig. 8, we isolated the most significant outliers by thresholding two dimensions on the scatter plot. We exclude the documents that have been detected in the bibliographic analysis section and then select those files with CF-Score ≧ 480 and Max (length, cont-Score) ≧ 310.

To analyze how the number of shared reference books affects the similarity based on the longest common reference sequence, we investigated the relationship between the strength of the absolute book coupling and the length of the longest common reference sequence. We ignore documents detected in the previous analysis stage and plot a scatter plot with absolute bibliographic coupling strength versus longest common reference sequence length as dimensions, see fig. 9. By visually inspecting the most prominent outliers, we define heuristic thresholds for manually inspecting the files. If the longest common reference sequence length l of a certain file pair>64, it is selected from the text collection. We also choose those longest common reference sequences that are smaller in length l>39, but the absolute book coupling strength is lower s _BC Document < 16.

Through analysis, the longest common reference sequence is a reliable method, and the search space of potential candidate files can be effectively limited. It is useful to further investigate the relationship between the book coupling strength and the length of the longest common reference sequence before establishing a particular threshold for the number of common reference books. Once the connection is established, the threshold is a strong similarity indicator. Both bibliographic coupling strengths and the longest common citation sequence approach produce patterns of repeat occurrence for articles that possess co-authors. Such articles tend to be highly relevant to the release of newer, previously studied, identical or similar articles in different periodicals. Both methods can accurately identify the global document similarity of higher level.

3) Citation block

We segment the text into blocks, each containing a reference. If there are n non-matching references in a block, where n ≦ 1 or 1 > n ≦ s, s is the number of references contained in the built block, then a shared reference is added to the block and partitioned from the last matching reference. Once the algorithm has chunked a file, it performs a comparative analysis of each block of the file with each block of another file, disregarding the order of references. This algorithm also requires a limit on the number of files with matching references, as is done with other similarity algorithms. To define a suitable file exclusion threshold, we analyze the distribution of maximum block lengths, as shown in FIG. 10.

The distribution in the figure indicates that the quote block contains more matching quote patterns, which represent the expected behavior. A significant number of text segments contain references that match at a closer distance, but not necessarily in the same order. Furthermore, many files have long quotation blocks, which is a negative feature of quotation-blocking algorithms. In the case where there are many matching references already contained in the constructed block, the algorithm allows non-matching references to be appended to the block in an incremental mode. This feature has a negative impact on files with many shared bibliographies and citations, because in such a case the algorithm tends to form many longer blocks.

In view of the distribution of files in fig. 10, we exclude those files that share small quotation block lengths. We analyzed the file distribution for shared quote block length l ≧ 4, see FIG. 11.

As can be seen by analysis, about 75% of the file sharing quotation blocks have the length of 5. To further limit the subset analysis, we performed similarity measurements with the maximum or block length of the sum of CF-Score and Cont-Score as the dimension of the scatter plot. To avoid overlap with the results of bibliographic coupling analysis, we only consider CF-Score or blocks with a length greater than 1. Fig. 12 shows the generated scatter plot.

Consistent with the criteria for evaluating the similarity function before, we selected the outliers in FIG. 12. By analyzing the maximum block length, for a file with many shared references, the citation blocking algorithm tends to deteriorate as a global similarity measure. The algorithm detects well for works that are highly related without co-authors, which is not achievable by the longest common reference sequence algorithm. If a document has many common bibliographic references, the cited reference block is expanded so that most of the references in the document are contained in the block. It is believed that documents with higher similarity can be detected by bibliographic coupling or longest common reference sequence.

From the analysis of FIG. 12, it can be seen that the shorter the length of the quotation block of the text is, the higher the CF-Score is. For example, CF-Score is higher when l < 16. The advantage of the citation blocking algorithm is that it can accurately locate local similarity. In summary, we found that a quote block of length 3 or 4 can localize local similarity with high accuracy. The quote block algorithm works best for texts with fewer or average number of citations.

The invention combines the particularity of the citation of the reference documents in writing, and makes the citation of the documents related to each other according to the important characteristics of the citation in academic documents, and judges whether the articles have plagiarism or not by comparing the use conditions of the citation of the two documents. The paper has important significance for academic untenable detection, and is beneficial to the standardization of academic wind and qi and the improvement of scientific research level.

Claims

1. A paper original detection method based on citation is characterized by comprising the following steps:

(1) Processing a corpus;

(2) Dividing and extracting reference book entries;

(3) Identifying and extracting quotations, positioning the quotations according to the book authors and the publication years, and extracting by using an analyzer;

(4) The citation characteristics adopted in the candidate document generation stage comprise: book coupling, longest public reference sequence and quotation blocking; the plagiarism is judged by combining the three citation characteristics to obtain a final plagiarism result;

the corpus processing specifically includes: searching and downloading articles by using a web search engine and utilizing heuristic rules; carrying out format conversion on the downloaded articles, and uniformly converting the downloaded articles into a UTF-8 coded plain text format; for a plain text, firstly, checking whether the plain text is an effective scientific document, namely judging whether the plain text contains a reference document part, if the plain text does not contain or contains files with incomplete citation or wrong citation, removing the plain text from an experimental document set, and carrying out normalized processing on the plain text; identifying citations pointing to the same article by adopting a simple baseline method, traversing all bibliographies, grouping according to whether the same article is cited or not, then arranging according to the citation length from the longest to the shortest, searching the maximum number of the citations matched with the previous citations for each citation, if the number exceeds a threshold value, considering the citation to be the same as the article represented by the previous citation, and classifying the citation and the previous citation into the same group, otherwise, considering the citation to be a new citation;

the reference book segmentation and extraction specifically comprises the following steps: given a pure UTF-8 file, searching for reference books by means of a series of heuristics; searching a labeled reference part in a text, wherein the label is 'References', 'Bibliography', 'reference' or common variation of character strings, and repeatedly segmenting the text based on the labels; if a tag is found prematurely in the document, then a subsequent match is sought based on a parameter, by default, less than 40% of the entire text; the last match is considered the starting point of the reference section; the processing program searches the subsequent part labels to find the end point of the reference document part, wherein the subsequent part labels are appendices, figures, tables, thank you or the end of a file, so that the text and the document reference are segmented, the reference character string of the segmented reference document is segmented, and a list of the booklist is created;

the copy detection based on citation characteristics specifically includes: firstly, performing bibliographic coupling detection on a reference sequence of a text, if the shared reference bibliographic is lower than a set threshold value, considering that the shared reference bibliographic has no plagiarism behavior, and removing the shared reference bibliographic from an experimental text set, otherwise, performing longest public reference sequence detection; if the longest common reference sequence detection result is lower than a set threshold value, the longest common reference sequence detection result is eliminated from the experimental text set, otherwise, the text quotation is processed in blocks, the similarity is calculated based on the shared reference number in the blocks, and the plagiarism degree of the text is analyzed through the shared reference number in the blocks.