CN104050299A - Method for paper duplicate checking - Google Patents

Method for paper duplicate checking Download PDF

Info

Publication number
CN104050299A
CN104050299A CN201410319183.3A CN201410319183A CN104050299A CN 104050299 A CN104050299 A CN 104050299A CN 201410319183 A CN201410319183 A CN 201410319183A CN 104050299 A CN104050299 A CN 104050299A
Authority
CN
China
Prior art keywords
subordinate sentence
fingerprint
original text
text
checked
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410319183.3A
Other languages
Chinese (zh)
Inventor
严敏
林文荟
杨华
刘志程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JIANGSU WISEDU INFORMATION TECHNOLOGY Co Ltd
Original Assignee
JIANGSU WISEDU INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JIANGSU WISEDU INFORMATION TECHNOLOGY Co Ltd filed Critical JIANGSU WISEDU INFORMATION TECHNOLOGY Co Ltd
Priority to CN201410319183.3A priority Critical patent/CN104050299A/en
Publication of CN104050299A publication Critical patent/CN104050299A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for paper duplicate checking. According to the method, fingerprint comparison is conducted on sentences of a paper to be checked and sentences in papers in a text library so that duplicated sentences and the positions of the duplicated sentences in the original papers can be obtained; then, whether gaps between the duplicated sentences in the original papers are smaller than M or not is judged, if the gaps between the duplicated sentences in the original papers are smaller than M, it is determined that the paper to be checked is duplicated from the text library. According to the method for paper duplicate checking, the duplicate judging speed and the response speed are high, comparison is conducted in a sentence level, and therefore the extracted original papers can be found from a plurality of extractions of a plurality of original papers.

Description

A kind of paper is looked into heavy method
Technical field
The present invention relates to paper and look into heavy technology.
Background technology
Paper is looked into weighing method and is mainly contained three kinds at present: the method based on string matching, the method based on document fingerprint and the method based on semantic knowledge.
Method based on string matching is a kind of method based on mathematical statistics.It first, by string matching algorithm, finds out the character string number that the document in document to be detected and database matches, and utilizes subsequently similarity computing formula to obtain result.This method to character string to choose requirement very high, the time complexity of string matching algorithm is higher simultaneously, needs larger resource overhead and longer computing time.
Method based on document fingerprint is by using the text that represents document semantic as " fingerprint ", by relatively " fingerprint " thus reach and differentiate the object of plagiarizing.In the process of choosing " fingerprint ", may be subject to article hierarchical structure impact and cause and fail to judge.
Thereby the method based on semantic knowledge is to reach by analyzing the naturally semantic similarity degree of article more to be detected and database article the object of differentiating plagiarism.The method depends on the calculating of natural language similarity, and due to the complicacy of Chinese language, the judged result correctness based on semantic knowledge is difficult to be guaranteed.
For the current weight technology of looking into, if Authors of Science Articles at same paragraph, many pieces of documents of selection as much as possible are won part clause to same paragraph from every piece of list of references, can not looked into heavy system fast detecting out by paper.
Summary of the invention
Problem to be solved by this invention: if Authors of Science Articles is selected many pieces of documents, win part clause from every piece of list of references, can not looked into heavy system fast detecting out by current paper.
For addressing the above problem, the scheme that the present invention adopts is as follows:
Paper is looked into a heavy method, comprises the following steps:
S1: the original text in text library is carried out to subordinate sentence, and calculate the fingerprint of each subordinate sentence of original text;
S2: article to be checked is carried out to subordinate sentence, and calculate the fingerprint of each subordinate sentence of article to be checked;
S3: by the contrast of the fingerprint of each subordinate sentence of article to be checked and the fingerprint of each subordinate sentence of original text, determine subordinate sentence that original text subordinate sentence fingerprint is identical with article subordinate sentence fingerprint to be checked and the position of subordinate sentence, obtain repetition subordinate sentence and repeat the position of subordinate sentence in original text;
S4: according to repeating the position of subordinate sentence in original text, judgement repeats the interval of subordinate sentence in original text and whether is less than M; If repeat the interval of subordinate sentence in original text, be less than M, article to be checked and original text in have repetition; Wherein M is predefined constant.
Further, paper according to the present invention is looked into heavy method, also comprises the step that builds subordinate sentence fingerprint base; The step of described structure subordinate sentence fingerprint base is for to carry out subordinate sentence to each original text in text library, and the fingerprint that calculates each each subordinate sentence of original text obtains subordinate sentence fingerprint base; Described subordinate sentence fingerprint base has been preserved the fingerprint of the subordinate sentence of each original text and the position mapping table of subordinate sentence in text library.
Technique effect of the present invention is as follows:
1. the present invention contrasts by fingerprint, and computing cost is low, sentences heavy speed fast, fast response time.
2. be accurate to the method for discrimination of subordinate sentence, can to the phenomenon of plagiarizing, differentiate more exactly.
3. paragraph and clause's content of can precise restoration being plagiarized, look into and bring up again for strong evidence for paper.
4. can from a plurality of original papers, in the extracts of many places, find out the original papers of extracts.
Accompanying drawing explanation
Fig. 1 paper of the present invention is looked into the process flow diagram of weighing method.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in further details.
The present invention is obtained repetition subordinate sentence and is repeated the position of subordinate sentence in original text by the subordinate sentence fingerprint contrast of article in contrast article to be checked and text library, then judgement repeats the interval of subordinate sentence in original text and whether is less than M, if repeat the interval of subordinate sentence in original text, be less than M, article to be checked has repetition in text library.As shown in Figure 1, comprise step:
S1: the fingerprint that calculates each subordinate sentence of original text in text library;
S2: the fingerprint that calculates each subordinate sentence of article to be checked;
S3: find out repetition subordinate sentence and repeat the position of subordinate sentence in original text;
S4: judgement repeats the interval of subordinate sentence in original text and whether is less than M.
The original text here refers to the document text in text library.In step S1 and S2, in fact the process of calculated fingerprint has comprised two steps: text is carried out to the step of subordinate sentence and the step of calculating subordinate sentence fingerprint.The step that text is carried out to subordinate sentence refers to and text is divided into the process of a plurality of sentences according to decollator.Decollator can be fullstop, exclamation mark, question mark, branch, segmentation symbol etc.The sentence obtaining after text segmentation is called subordinate sentence.The all subordinate sentences of text combine rear written urtext in order.The step of calculating subordinate sentence fingerprint is to adopt hash function subordinate sentence to be carried out to the process of computing.The hash function here refers to one-way hash function, such as MD5, SHA-1, SHA-2, SHA-3 etc.By adopting hash function subordinate sentence to be carried out obtaining after computing the cryptographic hash of subordinate sentence, this cryptographic hash can be used as the fingerprint of this subordinate sentence.
Overall process in Fig. 1 is one embodiment of the invention.More common situation, step S1 is subordinated to initialized step.This initialized step can be called again the step that builds subordinate sentence fingerprint base.Build the step of subordinate sentence fingerprint base for each original text in text library is carried out to subordinate sentence, and the fingerprint that calculates each each subordinate sentence of original text obtains subordinate sentence fingerprint base.Subordinate sentence fingerprint base has been preserved the fingerprint of the subordinate sentence of each original text and the position mapping table of subordinate sentence in text library.Had after the step of initialized structure subordinate sentence fingerprint base, when needs are looked into heavily to certain article to be checked, only need to perform step S2, S3 and S4.Subordinate sentence fingerprint base can be preserved by database, also can preserve by internal memory.When subordinate sentence fingerprint base adopts database to preserve, can adopt independently database to preserve, also can be saved in text library by the attribute using the subordinate sentence finger print information of each original text as text.
Step S3 is by the contrast of the fingerprint of the fingerprint of each subordinate sentence of article to be checked and each subordinate sentence of original text, determines subordinate sentence that original text subordinate sentence fingerprint is identical with article subordinate sentence fingerprint to be checked and the position of subordinate sentence, obtains repetition subordinate sentence and repeats the position of subordinate sentence in original text.Step S4 is that judgement repeats the interval of subordinate sentence in original text and whether is less than M according to repeating the position of subordinate sentence in original text; If repeat the interval of subordinate sentence in original text, be less than M, article to be checked and original text in have repetition.Wherein M is predefined constant, can be 2 or 3 or 5.Step S3 and step S4 are continuous processes, that is, the output of step S3 is directly sentenced heavy input foundation as step S4.Step S3 and S4 have two kinds of embodiments: the first embodiment be to each original text in text library one by one with the fingerprint of article comparative clause to be checked, this embodiment as shown in Figure 1, when an original text sentence heavily finish after the sentencing heavily of the next original text of execution.The second embodiment is first in step S3, to find out the subordinate sentences identical with article subordinate sentence fingerprint to be checked all in text library, then in step S4, finds out once each original text that meets " repeat the interval of subordinate sentence in original text and be less than M " condition.Wherein the first embodiment is applicable to the situation of aforesaid " the subordinate sentence finger print information of each original text is saved in text library as the attribute of text " and " not building subordinate sentence fingerprint base ", the situation of that the second embodiment is applicable to is aforesaid " subordinate sentence fingerprint base adopts independently database to preserve " and " preserving subordinate sentence fingerprint base by internal memory ".The preferential the second embodiment of the present invention.It should be noted that, the method according to this invention, the original text that has an identical content with article to be checked finding may have a plurality of.
With concrete data demonstrating, process of the present invention is described below.If the text in text library is: p 1 , p 2 , p 3 ..., p n .The text of article to be checked is r.As follows after each text fractionation subordinate sentence in text library:
P 1 ={ ?P 1,1 P 1,2 P 1,3 ,..., P 1,m1 ?};
P 2 ={ ?P 2,1 P 2,2 P 2,3 ,..., P 2,m2 ?};
P 3 ={ ?P 3,1 P 3,2 P 3,3 ,..., P 3,m3 ?};
P n ={ ?P n,1 P n,2 P n,3 ,..., P n,mn ?}。
Above-mentioned m1, m2, m3..., mnbe respectively text p 1 , p 2 , p 3 ..., p n subordinate sentence number.The fingerprint that obtains as calculated each text after fingerprint is as follows:
P 1 ={ ?h 1,1 h 1,2 h 1,3 ,..., h 1,m1 ?};
P 2 ={ ?h 2,1 h 2,2 h 2,3 ,..., h 2,m2 ?};
P 3 ={ ?h 3,1 h 3,2 h 3,3 ,..., h 3,m3 ?};
P n ={ ?h n,1 h n,2 h n,3 ,..., h n,mn ?}。
Subordinate sentence fingerprint base, in text library, the position mapping table of the fingerprint of the subordinate sentence of each original text and subordinate sentence is as follows:
{ ?P 1 h 1,1 ,1},
{ ?P 1 h 1,2 ,2},
{ ?P 1 h 1,3 ,3},
{ ?P 1 h 1,m1 m1},
{ ?P 2 h 2,1 ,1},
{ ?P n h n,mn mn}。
The text of article to be checked reach subordinate sentence be: r= s 1 , s 2 , s 3 ..., s r .Each subordinate sentence fingerprint that calculates article to be checked is: k 1 , k 2 , k 3 ..., k r .Through step S3, obtaining repetition subordinate sentence sequence is: s 2 , p 1 , 3}, s 3 , p 1 , 4}, s 4 , p 2 , 6}, s 8 , p 2 , 8}, s 9 , p 1 , 7}.In above-mentioned repetition subordinate sentence sequence { } structure, first is the subordinate sentence sequence number of article to be checked, and second is the ID of original text in text library, and the 3rd is the sequence number of subordinate sentence in original text.In above-mentioned repetition subordinate sentence, subordinate sentence s 2 with s 3 at original text p 1 in be spaced apart 1, subordinate sentence s 3 with s 9 at original text p 1 in be spaced apart 3, subordinate sentence s 4 with s 8 at original text p 2 in be spaced apart 2.Suppose that M is 2, original text p 1 with text rthere is identical content.If M selects 3, original text p 1 with p 2 all and text rthere is identical content.

Claims (2)

1. paper is looked into a heavy method, it is characterized in that, comprises the following steps:
S1: the original text in text library is carried out to subordinate sentence, and calculate the fingerprint of each subordinate sentence of original text;
S2: article to be checked is carried out to subordinate sentence, and calculate the fingerprint of each subordinate sentence of article to be checked;
S3: by the contrast of the fingerprint of each subordinate sentence of article to be checked and the fingerprint of each subordinate sentence of original text, determine subordinate sentence that original text subordinate sentence fingerprint is identical with article subordinate sentence fingerprint to be checked and the position of subordinate sentence, obtain repetition subordinate sentence and repeat the position of subordinate sentence in original text;
S4: according to repeating the position of subordinate sentence in original text, judgement repeats the interval of subordinate sentence in original text and whether is less than M; If repeat the interval of subordinate sentence in original text, be less than M, article to be checked and original text in have repetition; Wherein M is predefined constant.
2. paper as claimed in claim 1 is looked into heavy method, it is characterized in that, also comprises the step that builds subordinate sentence fingerprint base; The step of described structure subordinate sentence fingerprint base is for to carry out subordinate sentence to each original text in text library, and the fingerprint that calculates each each subordinate sentence of original text obtains subordinate sentence fingerprint base; Described subordinate sentence fingerprint base has been preserved the fingerprint of the subordinate sentence of each original text and the position mapping table of subordinate sentence in text library.
CN201410319183.3A 2014-07-07 2014-07-07 Method for paper duplicate checking Pending CN104050299A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410319183.3A CN104050299A (en) 2014-07-07 2014-07-07 Method for paper duplicate checking

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410319183.3A CN104050299A (en) 2014-07-07 2014-07-07 Method for paper duplicate checking

Publications (1)

Publication Number Publication Date
CN104050299A true CN104050299A (en) 2014-09-17

Family

ID=51503131

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410319183.3A Pending CN104050299A (en) 2014-07-07 2014-07-07 Method for paper duplicate checking

Country Status (1)

Country Link
CN (1) CN104050299A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699785A (en) * 2015-03-10 2015-06-10 中国石油大学(华东) Paper similarity detection method
CN106776880A (en) * 2016-11-22 2017-05-31 广东技术师范学院 A kind of paper based on picture and text identification reviews system and method
CN107038216A (en) * 2017-03-09 2017-08-11 百度在线网络技术(北京)有限公司 Paper duplicate checking method, device, equipment and storage medium
CN107169065A (en) * 2017-05-05 2017-09-15 腾讯科技(深圳)有限公司 The minimizing technology and device of a kind of certain content
CN107871002A (en) * 2017-11-10 2018-04-03 哈尔滨工程大学 A kind of across language plagiarism detection method based on fingerprint fusion
CN108984493A (en) * 2018-07-19 2018-12-11 中国联合网络通信集团有限公司 A kind of Chinese articles duplicate checking method and system
CN109471921A (en) * 2018-11-23 2019-03-15 深圳市元征科技股份有限公司 A kind of text duplicate checking method, device and equipment
CN110019674A (en) * 2017-11-21 2019-07-16 盛霆信息技术(上海)有限公司 A kind of text plagiarizes detection method and system
CN110162752A (en) * 2019-05-13 2019-08-23 百度在线网络技术(北京)有限公司 Article sentences weight processing method, device and electronic equipment
CN114357977A (en) * 2022-03-18 2022-04-15 北京创新乐知网络技术有限公司 Method, system, equipment and storage medium for realizing anti-plagiarism

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040039933A1 (en) * 2002-08-26 2004-02-26 Cricket Technologies Document data profiler apparatus, system, method, and electronically stored computer program product
CN101076800A (en) * 2004-08-23 2007-11-21 汤姆森环球资源公司 Repetitive file detecting and displaying function

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040039933A1 (en) * 2002-08-26 2004-02-26 Cricket Technologies Document data profiler apparatus, system, method, and electronically stored computer program product
CN101076800A (en) * 2004-08-23 2007-11-21 汤姆森环球资源公司 Repetitive file detecting and displaying function

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
俞昊旻: "文档部分重复检测研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699785A (en) * 2015-03-10 2015-06-10 中国石油大学(华东) Paper similarity detection method
CN106776880A (en) * 2016-11-22 2017-05-31 广东技术师范学院 A kind of paper based on picture and text identification reviews system and method
CN107038216A (en) * 2017-03-09 2017-08-11 百度在线网络技术(北京)有限公司 Paper duplicate checking method, device, equipment and storage medium
CN107038216B (en) * 2017-03-09 2021-10-26 百度在线网络技术(北京)有限公司 Thesis duplicate checking method, device, equipment and storage medium
CN107169065A (en) * 2017-05-05 2017-09-15 腾讯科技(深圳)有限公司 The minimizing technology and device of a kind of certain content
CN107169065B (en) * 2017-05-05 2022-06-14 腾讯科技(深圳)有限公司 Method and device for removing specific content
CN107871002B (en) * 2017-11-10 2021-03-30 哈尔滨工程大学 Fingerprint fusion-based cross-language plagiarism detection method
CN107871002A (en) * 2017-11-10 2018-04-03 哈尔滨工程大学 A kind of across language plagiarism detection method based on fingerprint fusion
CN110019674A (en) * 2017-11-21 2019-07-16 盛霆信息技术(上海)有限公司 A kind of text plagiarizes detection method and system
CN108984493A (en) * 2018-07-19 2018-12-11 中国联合网络通信集团有限公司 A kind of Chinese articles duplicate checking method and system
CN108984493B (en) * 2018-07-19 2022-04-29 中国联合网络通信集团有限公司 Chinese article duplicate checking method and system
CN109471921A (en) * 2018-11-23 2019-03-15 深圳市元征科技股份有限公司 A kind of text duplicate checking method, device and equipment
CN110162752A (en) * 2019-05-13 2019-08-23 百度在线网络技术(北京)有限公司 Article sentences weight processing method, device and electronic equipment
CN114357977A (en) * 2022-03-18 2022-04-15 北京创新乐知网络技术有限公司 Method, system, equipment and storage medium for realizing anti-plagiarism
CN114357977B (en) * 2022-03-18 2022-06-14 北京创新乐知网络技术有限公司 Method, system, equipment and storage medium for realizing anti-plagiarism

Similar Documents

Publication Publication Date Title
CN104050299A (en) Method for paper duplicate checking
CN106294350B (en) A kind of text polymerization and device
CN103970722B (en) A kind of method of content of text duplicate removal
Tolias et al. Visual query expansion with or without geometry: refining local descriptors by feature aggregation
CN106873964A (en) A kind of improved SimHash detection method of code similarities
CN103617157A (en) Text similarity calculation method based on semantics
Usbeck et al. AGDISTIS–agnostic disambiguation of named entities using linked open data
CN104239570B (en) The searching method and device of paper
WO2012169128A1 (en) Orthographical variant detection device and orthographical variant detection program
CN104636319A (en) Text duplicate removal method and device
CN107085568A (en) A kind of text similarity method of discrimination and device
US9633009B2 (en) Knowledge-rich automatic term disambiguation
Bhowmik et al. A novel three-level voting model for detecting misleading information on covid-19
CN107895053B (en) Emerging hot topic detection system and method based on topic cluster momentum model
Hakak et al. Diacritical digital Quran authentication model
CN112534507B (en) System and method for grouping and folding of sequencing reads
CN103049434B (en) A kind of alternative word identification system and identification method
CN108509414A (en) Plagiarism based on sequence detects text matching technique
Crocetti Textual spatial cosine similarity
CN109542766A (en) Extensive program similitude based on code mapping and morphological analysis quickly detects and evidence generation method
CN103793398B (en) The method and apparatus for detecting junk data
KR101113787B1 (en) Apparatus and method for indexing text
CN108021951A (en) A kind of method of document detection, server and computer-readable recording medium
CN104392002B (en) A kind of the approximate of extensive collections of web pages repeats lookup method
Abu Hawas et al. Rule-based approach for Arabic root extraction: new rules to directly extract roots of Arabic words

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140917

RJ01 Rejection of invention patent application after publication