CN104050299A

CN104050299A - Method for paper duplicate checking

Info

Publication number: CN104050299A
Application number: CN201410319183.3A
Authority: CN
Inventors: 严敏; 林文荟; 杨华; 刘志程
Original assignee: JIANGSU WISEDU INFORMATION TECHNOLOGY Co Ltd
Current assignee: JIANGSU WISEDU INFORMATION TECHNOLOGY Co Ltd
Priority date: 2014-07-07
Filing date: 2014-07-07
Publication date: 2014-09-17

Abstract

The invention discloses a method for paper duplicate checking. According to the method, fingerprint comparison is conducted on sentences of a paper to be checked and sentences in papers in a text library so that duplicated sentences and the positions of the duplicated sentences in the original papers can be obtained; then, whether gaps between the duplicated sentences in the original papers are smaller than M or not is judged, if the gaps between the duplicated sentences in the original papers are smaller than M, it is determined that the paper to be checked is duplicated from the text library. According to the method for paper duplicate checking, the duplicate judging speed and the response speed are high, comparison is conducted in a sentence level, and therefore the extracted original papers can be found from a plurality of extractions of a plurality of original papers.

Description

A kind of paper is looked into heavy method

Technical field

The present invention relates to paper and look into heavy technology.

Background technology

Paper is looked into weighing method and is mainly contained three kinds at present: the method based on string matching, the method based on document fingerprint and the method based on semantic knowledge.

Method based on string matching is a kind of method based on mathematical statistics.It first, by string matching algorithm, finds out the character string number that the document in document to be detected and database matches, and utilizes subsequently similarity computing formula to obtain result.This method to character string to choose requirement very high, the time complexity of string matching algorithm is higher simultaneously, needs larger resource overhead and longer computing time.

Method based on document fingerprint is by using the text that represents document semantic as " fingerprint ", by relatively " fingerprint " thus reach and differentiate the object of plagiarizing.In the process of choosing " fingerprint ", may be subject to article hierarchical structure impact and cause and fail to judge.

Thereby the method based on semantic knowledge is to reach by analyzing the naturally semantic similarity degree of article more to be detected and database article the object of differentiating plagiarism.The method depends on the calculating of natural language similarity, and due to the complicacy of Chinese language, the judged result correctness based on semantic knowledge is difficult to be guaranteed.

For the current weight technology of looking into, if Authors of Science Articles at same paragraph, many pieces of documents of selection as much as possible are won part clause to same paragraph from every piece of list of references, can not looked into heavy system fast detecting out by paper.

Summary of the invention

Problem to be solved by this invention: if Authors of Science Articles is selected many pieces of documents, win part clause from every piece of list of references, can not looked into heavy system fast detecting out by current paper.

For addressing the above problem, the scheme that the present invention adopts is as follows:

Paper is looked into a heavy method, comprises the following steps:

S1: the original text in text library is carried out to subordinate sentence, and calculate the fingerprint of each subordinate sentence of original text;

S2: article to be checked is carried out to subordinate sentence, and calculate the fingerprint of each subordinate sentence of article to be checked;

S3: by the contrast of the fingerprint of each subordinate sentence of article to be checked and the fingerprint of each subordinate sentence of original text, determine subordinate sentence that original text subordinate sentence fingerprint is identical with article subordinate sentence fingerprint to be checked and the position of subordinate sentence, obtain repetition subordinate sentence and repeat the position of subordinate sentence in original text;

S4: according to repeating the position of subordinate sentence in original text, judgement repeats the interval of subordinate sentence in original text and whether is less than M; If repeat the interval of subordinate sentence in original text, be less than M, article to be checked and original text in have repetition; Wherein M is predefined constant.

Further, paper according to the present invention is looked into heavy method, also comprises the step that builds subordinate sentence fingerprint base; The step of described structure subordinate sentence fingerprint base is for to carry out subordinate sentence to each original text in text library, and the fingerprint that calculates each each subordinate sentence of original text obtains subordinate sentence fingerprint base; Described subordinate sentence fingerprint base has been preserved the fingerprint of the subordinate sentence of each original text and the position mapping table of subordinate sentence in text library.

Technique effect of the present invention is as follows:

1. the present invention contrasts by fingerprint, and computing cost is low, sentences heavy speed fast, fast response time.

2. be accurate to the method for discrimination of subordinate sentence, can to the phenomenon of plagiarizing, differentiate more exactly.

3. paragraph and clause's content of can precise restoration being plagiarized, look into and bring up again for strong evidence for paper.

4. can from a plurality of original papers, in the extracts of many places, find out the original papers of extracts.

Accompanying drawing explanation

Fig. 1 paper of the present invention is looked into the process flow diagram of weighing method.

Embodiment

Below in conjunction with accompanying drawing, the present invention is described in further details.

The present invention is obtained repetition subordinate sentence and is repeated the position of subordinate sentence in original text by the subordinate sentence fingerprint contrast of article in contrast article to be checked and text library, then judgement repeats the interval of subordinate sentence in original text and whether is less than M, if repeat the interval of subordinate sentence in original text, be less than M, article to be checked has repetition in text library.As shown in Figure 1, comprise step:

S1: the fingerprint that calculates each subordinate sentence of original text in text library;

S2: the fingerprint that calculates each subordinate sentence of article to be checked;

S3: find out repetition subordinate sentence and repeat the position of subordinate sentence in original text;

S4: judgement repeats the interval of subordinate sentence in original text and whether is less than M.

The original text here refers to the document text in text library.In step S1 and S2, in fact the process of calculated fingerprint has comprised two steps: text is carried out to the step of subordinate sentence and the step of calculating subordinate sentence fingerprint.The step that text is carried out to subordinate sentence refers to and text is divided into the process of a plurality of sentences according to decollator.Decollator can be fullstop, exclamation mark, question mark, branch, segmentation symbol etc.The sentence obtaining after text segmentation is called subordinate sentence.The all subordinate sentences of text combine rear written urtext in order.The step of calculating subordinate sentence fingerprint is to adopt hash function subordinate sentence to be carried out to the process of computing.The hash function here refers to one-way hash function, such as MD5, SHA-1, SHA-2, SHA-3 etc.By adopting hash function subordinate sentence to be carried out obtaining after computing the cryptographic hash of subordinate sentence, this cryptographic hash can be used as the fingerprint of this subordinate sentence.

Overall process in Fig. 1 is one embodiment of the invention.More common situation, step S1 is subordinated to initialized step.This initialized step can be called again the step that builds subordinate sentence fingerprint base.Build the step of subordinate sentence fingerprint base for each original text in text library is carried out to subordinate sentence, and the fingerprint that calculates each each subordinate sentence of original text obtains subordinate sentence fingerprint base.Subordinate sentence fingerprint base has been preserved the fingerprint of the subordinate sentence of each original text and the position mapping table of subordinate sentence in text library.Had after the step of initialized structure subordinate sentence fingerprint base, when needs are looked into heavily to certain article to be checked, only need to perform step S2, S3 and S4.Subordinate sentence fingerprint base can be preserved by database, also can preserve by internal memory.When subordinate sentence fingerprint base adopts database to preserve, can adopt independently database to preserve, also can be saved in text library by the attribute using the subordinate sentence finger print information of each original text as text.

Step S3 is by the contrast of the fingerprint of the fingerprint of each subordinate sentence of article to be checked and each subordinate sentence of original text, determines subordinate sentence that original text subordinate sentence fingerprint is identical with article subordinate sentence fingerprint to be checked and the position of subordinate sentence, obtains repetition subordinate sentence and repeats the position of subordinate sentence in original text.Step S4 is that judgement repeats the interval of subordinate sentence in original text and whether is less than M according to repeating the position of subordinate sentence in original text; If repeat the interval of subordinate sentence in original text, be less than M, article to be checked and original text in have repetition.Wherein M is predefined constant, can be 2 or 3 or 5.Step S3 and step S4 are continuous processes, that is, the output of step S3 is directly sentenced heavy input foundation as step S4.Step S3 and S4 have two kinds of embodiments: the first embodiment be to each original text in text library one by one with the fingerprint of article comparative clause to be checked, this embodiment as shown in Figure 1, when an original text sentence heavily finish after the sentencing heavily of the next original text of execution.The second embodiment is first in step S3, to find out the subordinate sentences identical with article subordinate sentence fingerprint to be checked all in text library, then in step S4, finds out once each original text that meets " repeat the interval of subordinate sentence in original text and be less than M " condition.Wherein the first embodiment is applicable to the situation of aforesaid " the subordinate sentence finger print information of each original text is saved in text library as the attribute of text " and " not building subordinate sentence fingerprint base ", the situation of that the second embodiment is applicable to is aforesaid " subordinate sentence fingerprint base adopts independently database to preserve " and " preserving subordinate sentence fingerprint base by internal memory ".The preferential the second embodiment of the present invention.It should be noted that, the method according to this invention, the original text that has an identical content with article to be checked finding may have a plurality of.

With concrete data demonstrating, process of the present invention is described below.If the text in text library is: p ₁, p ₂, p ₃..., p _n.The text of article to be checked is r.As follows after each text fractionation subordinate sentence in text library:

P ₁={ ?P _1,1， P _1,2， P _1,3，...， P _1,m1?}；

P ₂={ ?P _2,1， P _2,2， P _2,3，...， P _2,m2?}；

P ₃={ ?P _3,1， P _3,2， P _3,3，...， P _3,m3?}；

P _n={ ?P _n,1， P _n,2， P _n,3，...， P _n,mn?}。

Above-mentioned m1, m2, m3..., mnbe respectively text p ₁, p ₂, p ₃..., p _nsubordinate sentence number.The fingerprint that obtains as calculated each text after fingerprint is as follows:

P ₁={ ?h _1,1， h _1,2， h _1,3，...， h _1,m1?}；

P ₂={ ?h _2,1， h _2,2， h _2,3，...， h _2,m2?}；

P ₃={ ?h _3,1， h _3,2， h _3,3，...， h _3,m3?}；

P _n={ ?h _n,1， h _n,2， h _n,3，...， h _n,mn?}。

Subordinate sentence fingerprint base, in text library, the position mapping table of the fingerprint of the subordinate sentence of each original text and subordinate sentence is as follows:

{ ?P ₁， h _1,1，1}，

{ ?P ₁， h _1,2，2}，

{ ?P ₁， h _1,3，3}，

{ ?P ₁， h _1,m1， m1}，

{ ?P ₂， h _2,1，1}，

{ ?P _n， h _n,mn， mn}。

The text of article to be checked reach subordinate sentence be: r= s ₁, s ₂, s ₃..., s _r.Each subordinate sentence fingerprint that calculates article to be checked is: k ₁, k ₂, k ₃..., k _r.Through step S3, obtaining repetition subordinate sentence sequence is: s ₂, p ₁, 3}, s ₃, p ₁, 4}, s ₄, p ₂, 6}, s ₈, p ₂, 8}, s ₉, p ₁, 7}.In above-mentioned repetition subordinate sentence sequence { } structure, first is the subordinate sentence sequence number of article to be checked, and second is the ID of original text in text library, and the 3rd is the sequence number of subordinate sentence in original text.In above-mentioned repetition subordinate sentence, subordinate sentence s ₂with s ₃at original text p ₁in be spaced apart 1, subordinate sentence s ₃with s ₉at original text p ₁in be spaced apart 3, subordinate sentence s ₄with s ₈at original text p ₂in be spaced apart 2.Suppose that M is 2, original text p ₁with text rthere is identical content.If M selects 3, original text p ₁with p ₂all and text rthere is identical content.

Claims

1. paper is looked into a heavy method, it is characterized in that, comprises the following steps:

2. paper as claimed in claim 1 is looked into heavy method, it is characterized in that, also comprises the step that builds subordinate sentence fingerprint base; The step of described structure subordinate sentence fingerprint base is for to carry out subordinate sentence to each original text in text library, and the fingerprint that calculates each each subordinate sentence of original text obtains subordinate sentence fingerprint base; Described subordinate sentence fingerprint base has been preserved the fingerprint of the subordinate sentence of each original text and the position mapping table of subordinate sentence in text library.