CN104050299A - Method for paper duplicate checking - Google Patents
Method for paper duplicate checking Download PDFInfo
- Publication number
- CN104050299A CN104050299A CN201410319183.3A CN201410319183A CN104050299A CN 104050299 A CN104050299 A CN 104050299A CN 201410319183 A CN201410319183 A CN 201410319183A CN 104050299 A CN104050299 A CN 104050299A
- Authority
- CN
- China
- Prior art keywords
- subordinate sentence
- fingerprint
- original text
- text
- checked
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a method for paper duplicate checking. According to the method, fingerprint comparison is conducted on sentences of a paper to be checked and sentences in papers in a text library so that duplicated sentences and the positions of the duplicated sentences in the original papers can be obtained; then, whether gaps between the duplicated sentences in the original papers are smaller than M or not is judged, if the gaps between the duplicated sentences in the original papers are smaller than M, it is determined that the paper to be checked is duplicated from the text library. According to the method for paper duplicate checking, the duplicate judging speed and the response speed are high, comparison is conducted in a sentence level, and therefore the extracted original papers can be found from a plurality of extractions of a plurality of original papers.
Description
Technical field
The present invention relates to paper and look into heavy technology.
Background technology
Paper is looked into weighing method and is mainly contained three kinds at present: the method based on string matching, the method based on document fingerprint and the method based on semantic knowledge.
Method based on string matching is a kind of method based on mathematical statistics.It first, by string matching algorithm, finds out the character string number that the document in document to be detected and database matches, and utilizes subsequently similarity computing formula to obtain result.This method to character string to choose requirement very high, the time complexity of string matching algorithm is higher simultaneously, needs larger resource overhead and longer computing time.
Method based on document fingerprint is by using the text that represents document semantic as " fingerprint ", by relatively " fingerprint " thus reach and differentiate the object of plagiarizing.In the process of choosing " fingerprint ", may be subject to article hierarchical structure impact and cause and fail to judge.
Thereby the method based on semantic knowledge is to reach by analyzing the naturally semantic similarity degree of article more to be detected and database article the object of differentiating plagiarism.The method depends on the calculating of natural language similarity, and due to the complicacy of Chinese language, the judged result correctness based on semantic knowledge is difficult to be guaranteed.
For the current weight technology of looking into, if Authors of Science Articles at same paragraph, many pieces of documents of selection as much as possible are won part clause to same paragraph from every piece of list of references, can not looked into heavy system fast detecting out by paper.
Summary of the invention
Problem to be solved by this invention: if Authors of Science Articles is selected many pieces of documents, win part clause from every piece of list of references, can not looked into heavy system fast detecting out by current paper.
For addressing the above problem, the scheme that the present invention adopts is as follows:
Paper is looked into a heavy method, comprises the following steps:
S1: the original text in text library is carried out to subordinate sentence, and calculate the fingerprint of each subordinate sentence of original text;
S2: article to be checked is carried out to subordinate sentence, and calculate the fingerprint of each subordinate sentence of article to be checked;
S3: by the contrast of the fingerprint of each subordinate sentence of article to be checked and the fingerprint of each subordinate sentence of original text, determine subordinate sentence that original text subordinate sentence fingerprint is identical with article subordinate sentence fingerprint to be checked and the position of subordinate sentence, obtain repetition subordinate sentence and repeat the position of subordinate sentence in original text;
S4: according to repeating the position of subordinate sentence in original text, judgement repeats the interval of subordinate sentence in original text and whether is less than M; If repeat the interval of subordinate sentence in original text, be less than M, article to be checked and original text in have repetition; Wherein M is predefined constant.
Further, paper according to the present invention is looked into heavy method, also comprises the step that builds subordinate sentence fingerprint base; The step of described structure subordinate sentence fingerprint base is for to carry out subordinate sentence to each original text in text library, and the fingerprint that calculates each each subordinate sentence of original text obtains subordinate sentence fingerprint base; Described subordinate sentence fingerprint base has been preserved the fingerprint of the subordinate sentence of each original text and the position mapping table of subordinate sentence in text library.
Technique effect of the present invention is as follows:
1. the present invention contrasts by fingerprint, and computing cost is low, sentences heavy speed fast, fast response time.
2. be accurate to the method for discrimination of subordinate sentence, can to the phenomenon of plagiarizing, differentiate more exactly.
3. paragraph and clause's content of can precise restoration being plagiarized, look into and bring up again for strong evidence for paper.
4. can from a plurality of original papers, in the extracts of many places, find out the original papers of extracts.
Accompanying drawing explanation
Fig. 1 paper of the present invention is looked into the process flow diagram of weighing method.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in further details.
The present invention is obtained repetition subordinate sentence and is repeated the position of subordinate sentence in original text by the subordinate sentence fingerprint contrast of article in contrast article to be checked and text library, then judgement repeats the interval of subordinate sentence in original text and whether is less than M, if repeat the interval of subordinate sentence in original text, be less than M, article to be checked has repetition in text library.As shown in Figure 1, comprise step:
S1: the fingerprint that calculates each subordinate sentence of original text in text library;
S2: the fingerprint that calculates each subordinate sentence of article to be checked;
S3: find out repetition subordinate sentence and repeat the position of subordinate sentence in original text;
S4: judgement repeats the interval of subordinate sentence in original text and whether is less than M.
The original text here refers to the document text in text library.In step S1 and S2, in fact the process of calculated fingerprint has comprised two steps: text is carried out to the step of subordinate sentence and the step of calculating subordinate sentence fingerprint.The step that text is carried out to subordinate sentence refers to and text is divided into the process of a plurality of sentences according to decollator.Decollator can be fullstop, exclamation mark, question mark, branch, segmentation symbol etc.The sentence obtaining after text segmentation is called subordinate sentence.The all subordinate sentences of text combine rear written urtext in order.The step of calculating subordinate sentence fingerprint is to adopt hash function subordinate sentence to be carried out to the process of computing.The hash function here refers to one-way hash function, such as MD5, SHA-1, SHA-2, SHA-3 etc.By adopting hash function subordinate sentence to be carried out obtaining after computing the cryptographic hash of subordinate sentence, this cryptographic hash can be used as the fingerprint of this subordinate sentence.
Overall process in Fig. 1 is one embodiment of the invention.More common situation, step S1 is subordinated to initialized step.This initialized step can be called again the step that builds subordinate sentence fingerprint base.Build the step of subordinate sentence fingerprint base for each original text in text library is carried out to subordinate sentence, and the fingerprint that calculates each each subordinate sentence of original text obtains subordinate sentence fingerprint base.Subordinate sentence fingerprint base has been preserved the fingerprint of the subordinate sentence of each original text and the position mapping table of subordinate sentence in text library.Had after the step of initialized structure subordinate sentence fingerprint base, when needs are looked into heavily to certain article to be checked, only need to perform step S2, S3 and S4.Subordinate sentence fingerprint base can be preserved by database, also can preserve by internal memory.When subordinate sentence fingerprint base adopts database to preserve, can adopt independently database to preserve, also can be saved in text library by the attribute using the subordinate sentence finger print information of each original text as text.
Step S3 is by the contrast of the fingerprint of the fingerprint of each subordinate sentence of article to be checked and each subordinate sentence of original text, determines subordinate sentence that original text subordinate sentence fingerprint is identical with article subordinate sentence fingerprint to be checked and the position of subordinate sentence, obtains repetition subordinate sentence and repeats the position of subordinate sentence in original text.Step S4 is that judgement repeats the interval of subordinate sentence in original text and whether is less than M according to repeating the position of subordinate sentence in original text; If repeat the interval of subordinate sentence in original text, be less than M, article to be checked and original text in have repetition.Wherein M is predefined constant, can be 2 or 3 or 5.Step S3 and step S4 are continuous processes, that is, the output of step S3 is directly sentenced heavy input foundation as step S4.Step S3 and S4 have two kinds of embodiments: the first embodiment be to each original text in text library one by one with the fingerprint of article comparative clause to be checked, this embodiment as shown in Figure 1, when an original text sentence heavily finish after the sentencing heavily of the next original text of execution.The second embodiment is first in step S3, to find out the subordinate sentences identical with article subordinate sentence fingerprint to be checked all in text library, then in step S4, finds out once each original text that meets " repeat the interval of subordinate sentence in original text and be less than M " condition.Wherein the first embodiment is applicable to the situation of aforesaid " the subordinate sentence finger print information of each original text is saved in text library as the attribute of text " and " not building subordinate sentence fingerprint base ", the situation of that the second embodiment is applicable to is aforesaid " subordinate sentence fingerprint base adopts independently database to preserve " and " preserving subordinate sentence fingerprint base by internal memory ".The preferential the second embodiment of the present invention.It should be noted that, the method according to this invention, the original text that has an identical content with article to be checked finding may have a plurality of.
With concrete data demonstrating, process of the present invention is described below.If the text in text library is:
p 1 ,
p 2 ,
p 3 ...,
p n .The text of article to be checked is
r.As follows after each text fractionation subordinate sentence in text library:
P 1 ={
?P 1,1 ,
P 1,2 ,
P 1,3 ,...,
P 1,m1 ?};
P 2 ={
?P 2,1 ,
P 2,2 ,
P 2,3 ,...,
P 2,m2 ?};
P 3 ={
?P 3,1 ,
P 3,2 ,
P 3,3 ,...,
P 3,m3 ?};
P n ={
?P n,1 ,
P n,2 ,
P n,3 ,...,
P n,mn ?}。
Above-mentioned
m1,
m2,
m3...,
mnbe respectively text
p 1 ,
p 2 ,
p 3 ...,
p n subordinate sentence number.The fingerprint that obtains as calculated each text after fingerprint is as follows:
P 1 ={
?h 1,1 ,
h 1,2 ,
h 1,3 ,...,
h 1,m1 ?};
P 2 ={
?h 2,1 ,
h 2,2 ,
h 2,3 ,...,
h 2,m2 ?};
P 3 ={
?h 3,1 ,
h 3,2 ,
h 3,3 ,...,
h 3,m3 ?};
P n ={
?h n,1 ,
h n,2 ,
h n,3 ,...,
h n,mn ?}。
Subordinate sentence fingerprint base, in text library, the position mapping table of the fingerprint of the subordinate sentence of each original text and subordinate sentence is as follows:
{
?P 1 ,
h 1,1 ,1},
{
?P 1 ,
h 1,2 ,2},
{
?P 1 ,
h 1,3 ,3},
{
?P 1 ,
h 1,m1 ,
m1},
{
?P 2 ,
h 2,1 ,1},
{
?P n ,
h n,mn ,
mn}。
The text of article to be checked
reach subordinate sentence be:
r=
s 1 ,
s 2 ,
s 3 ...,
s r .Each subordinate sentence fingerprint that calculates article to be checked is:
k 1 ,
k 2 ,
k 3 ...,
k r .Through step S3, obtaining repetition subordinate sentence sequence is:
s 2 ,
p 1 , 3},
s 3 ,
p 1 , 4},
s 4 ,
p 2 , 6},
s 8 ,
p 2 , 8},
s 9 ,
p 1 , 7}.In above-mentioned repetition subordinate sentence sequence { } structure, first is the subordinate sentence sequence number of article to be checked, and second is the ID of original text in text library, and the 3rd is the sequence number of subordinate sentence in original text.In above-mentioned repetition subordinate sentence, subordinate sentence
s 2 with
s 3 at original text
p 1 in be spaced apart 1, subordinate sentence
s 3 with
s 9 at original text
p 1 in be spaced apart 3, subordinate sentence
s 4 with
s 8 at original text
p 2 in be spaced apart 2.Suppose that M is 2, original text
p 1 with text
rthere is identical content.If M selects 3, original text
p 1 with
p 2 all and text
rthere is identical content.
Claims (2)
1. paper is looked into a heavy method, it is characterized in that, comprises the following steps:
S1: the original text in text library is carried out to subordinate sentence, and calculate the fingerprint of each subordinate sentence of original text;
S2: article to be checked is carried out to subordinate sentence, and calculate the fingerprint of each subordinate sentence of article to be checked;
S3: by the contrast of the fingerprint of each subordinate sentence of article to be checked and the fingerprint of each subordinate sentence of original text, determine subordinate sentence that original text subordinate sentence fingerprint is identical with article subordinate sentence fingerprint to be checked and the position of subordinate sentence, obtain repetition subordinate sentence and repeat the position of subordinate sentence in original text;
S4: according to repeating the position of subordinate sentence in original text, judgement repeats the interval of subordinate sentence in original text and whether is less than M; If repeat the interval of subordinate sentence in original text, be less than M, article to be checked and original text in have repetition; Wherein M is predefined constant.
2. paper as claimed in claim 1 is looked into heavy method, it is characterized in that, also comprises the step that builds subordinate sentence fingerprint base; The step of described structure subordinate sentence fingerprint base is for to carry out subordinate sentence to each original text in text library, and the fingerprint that calculates each each subordinate sentence of original text obtains subordinate sentence fingerprint base; Described subordinate sentence fingerprint base has been preserved the fingerprint of the subordinate sentence of each original text and the position mapping table of subordinate sentence in text library.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410319183.3A CN104050299A (en) | 2014-07-07 | 2014-07-07 | Method for paper duplicate checking |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410319183.3A CN104050299A (en) | 2014-07-07 | 2014-07-07 | Method for paper duplicate checking |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104050299A true CN104050299A (en) | 2014-09-17 |
Family
ID=51503131
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410319183.3A Pending CN104050299A (en) | 2014-07-07 | 2014-07-07 | Method for paper duplicate checking |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104050299A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104699785A (en) * | 2015-03-10 | 2015-06-10 | 中国石油大学(华东) | Paper similarity detection method |
CN106776880A (en) * | 2016-11-22 | 2017-05-31 | 广东技术师范学院 | A kind of paper based on picture and text identification reviews system and method |
CN107038216A (en) * | 2017-03-09 | 2017-08-11 | 百度在线网络技术(北京)有限公司 | Paper duplicate checking method, device, equipment and storage medium |
CN107169065A (en) * | 2017-05-05 | 2017-09-15 | 腾讯科技(深圳)有限公司 | The minimizing technology and device of a kind of certain content |
CN107871002A (en) * | 2017-11-10 | 2018-04-03 | 哈尔滨工程大学 | A kind of across language plagiarism detection method based on fingerprint fusion |
CN108984493A (en) * | 2018-07-19 | 2018-12-11 | 中国联合网络通信集团有限公司 | A kind of Chinese articles duplicate checking method and system |
CN109471921A (en) * | 2018-11-23 | 2019-03-15 | 深圳市元征科技股份有限公司 | A kind of text duplicate checking method, device and equipment |
CN110019674A (en) * | 2017-11-21 | 2019-07-16 | 盛霆信息技术(上海)有限公司 | A kind of text plagiarizes detection method and system |
CN110162752A (en) * | 2019-05-13 | 2019-08-23 | 百度在线网络技术(北京)有限公司 | Article sentences weight processing method, device and electronic equipment |
CN114357977A (en) * | 2022-03-18 | 2022-04-15 | 北京创新乐知网络技术有限公司 | Method, system, equipment and storage medium for realizing anti-plagiarism |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040039933A1 (en) * | 2002-08-26 | 2004-02-26 | Cricket Technologies | Document data profiler apparatus, system, method, and electronically stored computer program product |
CN101076800A (en) * | 2004-08-23 | 2007-11-21 | 汤姆森环球资源公司 | Repetitive file detecting and displaying function |
-
2014
- 2014-07-07 CN CN201410319183.3A patent/CN104050299A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040039933A1 (en) * | 2002-08-26 | 2004-02-26 | Cricket Technologies | Document data profiler apparatus, system, method, and electronically stored computer program product |
CN101076800A (en) * | 2004-08-23 | 2007-11-21 | 汤姆森环球资源公司 | Repetitive file detecting and displaying function |
Non-Patent Citations (1)
Title |
---|
俞昊旻: "文档部分重复检测研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104699785A (en) * | 2015-03-10 | 2015-06-10 | 中国石油大学(华东) | Paper similarity detection method |
CN106776880A (en) * | 2016-11-22 | 2017-05-31 | 广东技术师范学院 | A kind of paper based on picture and text identification reviews system and method |
CN107038216A (en) * | 2017-03-09 | 2017-08-11 | 百度在线网络技术(北京)有限公司 | Paper duplicate checking method, device, equipment and storage medium |
CN107038216B (en) * | 2017-03-09 | 2021-10-26 | 百度在线网络技术(北京)有限公司 | Thesis duplicate checking method, device, equipment and storage medium |
CN107169065A (en) * | 2017-05-05 | 2017-09-15 | 腾讯科技(深圳)有限公司 | The minimizing technology and device of a kind of certain content |
CN107169065B (en) * | 2017-05-05 | 2022-06-14 | 腾讯科技(深圳)有限公司 | Method and device for removing specific content |
CN107871002B (en) * | 2017-11-10 | 2021-03-30 | 哈尔滨工程大学 | Fingerprint fusion-based cross-language plagiarism detection method |
CN107871002A (en) * | 2017-11-10 | 2018-04-03 | 哈尔滨工程大学 | A kind of across language plagiarism detection method based on fingerprint fusion |
CN110019674A (en) * | 2017-11-21 | 2019-07-16 | 盛霆信息技术(上海)有限公司 | A kind of text plagiarizes detection method and system |
CN108984493A (en) * | 2018-07-19 | 2018-12-11 | 中国联合网络通信集团有限公司 | A kind of Chinese articles duplicate checking method and system |
CN108984493B (en) * | 2018-07-19 | 2022-04-29 | 中国联合网络通信集团有限公司 | Chinese article duplicate checking method and system |
CN109471921A (en) * | 2018-11-23 | 2019-03-15 | 深圳市元征科技股份有限公司 | A kind of text duplicate checking method, device and equipment |
CN110162752A (en) * | 2019-05-13 | 2019-08-23 | 百度在线网络技术(北京)有限公司 | Article sentences weight processing method, device and electronic equipment |
CN114357977A (en) * | 2022-03-18 | 2022-04-15 | 北京创新乐知网络技术有限公司 | Method, system, equipment and storage medium for realizing anti-plagiarism |
CN114357977B (en) * | 2022-03-18 | 2022-06-14 | 北京创新乐知网络技术有限公司 | Method, system, equipment and storage medium for realizing anti-plagiarism |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104050299A (en) | Method for paper duplicate checking | |
CN106294350B (en) | A kind of text polymerization and device | |
CN103970722B (en) | A kind of method of content of text duplicate removal | |
Tolias et al. | Visual query expansion with or without geometry: refining local descriptors by feature aggregation | |
CN106873964A (en) | A kind of improved SimHash detection method of code similarities | |
CN103617157A (en) | Text similarity calculation method based on semantics | |
Usbeck et al. | AGDISTIS–agnostic disambiguation of named entities using linked open data | |
CN104239570B (en) | The searching method and device of paper | |
WO2012169128A1 (en) | Orthographical variant detection device and orthographical variant detection program | |
CN104636319A (en) | Text duplicate removal method and device | |
CN107085568A (en) | A kind of text similarity method of discrimination and device | |
US9633009B2 (en) | Knowledge-rich automatic term disambiguation | |
Bhowmik et al. | A novel three-level voting model for detecting misleading information on covid-19 | |
CN107895053B (en) | Emerging hot topic detection system and method based on topic cluster momentum model | |
Hakak et al. | Diacritical digital Quran authentication model | |
CN112534507B (en) | System and method for grouping and folding of sequencing reads | |
CN103049434B (en) | A kind of alternative word identification system and identification method | |
CN108509414A (en) | Plagiarism based on sequence detects text matching technique | |
Crocetti | Textual spatial cosine similarity | |
CN109542766A (en) | Extensive program similitude based on code mapping and morphological analysis quickly detects and evidence generation method | |
CN103793398B (en) | The method and apparatus for detecting junk data | |
KR101113787B1 (en) | Apparatus and method for indexing text | |
CN108021951A (en) | A kind of method of document detection, server and computer-readable recording medium | |
CN104392002B (en) | A kind of the approximate of extensive collections of web pages repeats lookup method | |
Abu Hawas et al. | Rule-based approach for Arabic root extraction: new rules to directly extract roots of Arabic words |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20140917 |
|
RJ01 | Rejection of invention patent application after publication |