CN101957809A - Anti-plagiarism method - Google Patents

Anti-plagiarism method Download PDF

Info

Publication number
CN101957809A
CN101957809A CN 201010506555 CN201010506555A CN101957809A CN 101957809 A CN101957809 A CN 101957809A CN 201010506555 CN201010506555 CN 201010506555 CN 201010506555 A CN201010506555 A CN 201010506555A CN 101957809 A CN101957809 A CN 101957809A
Authority
CN
China
Prior art keywords
file
key word
detected
content
plagiarism
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 201010506555
Other languages
Chinese (zh)
Inventor
江潮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TRANSN (BEIJING) INFORMATION TECHNOLOGY Co Ltd
Original Assignee
TRANSN (BEIJING) INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TRANSN (BEIJING) INFORMATION TECHNOLOGY Co Ltd filed Critical TRANSN (BEIJING) INFORMATION TECHNOLOGY Co Ltd
Priority to CN 201010506555 priority Critical patent/CN101957809A/en
Publication of CN101957809A publication Critical patent/CN101957809A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses an anti-plagiarism method, which comprises the following steps of: inputting a detection file, and extracting key words from the detection file; calling a search engine for searching the key words, retrieving source files of search result pages, and obtaining a match result of the key words to obtain the match rate of the key words; and when the match rate accords with the preset match rate, labeling contents which correspond to the key words in the file to be detected according to the set labeling mode. The method solves the problem of plagiarism identification of article contents; and the article contents are only input, so that the contents which already exist in the network and the contents which are owned by an author can be identified in a short period of time.

Description

A kind of anti-plagiarism method
Technical field
The present invention relates to a kind of file content recognition technology, specifically, relate to a kind of anti-plagiarism method.
Background technology
Along with popularizing of Internet, people can be in very first time online inquiry when solving thing.In order to tackle this plagiarism, a lot of websites also are to want to have use up various ways.Common forbids right mouse button with page script exactly, and right-click menu has not had just to can not find " duplicating ", and this has been ordered.This means, it also is easy cracking, for example, with keyboard shortcut Ctrl+C just; Perhaps, press the mouse right and left key simultaneously, fall dialog box with the left button point then; Perhaps, directly check the webpage source file.
Same problem also can appear at publishing business, plagiarized problem into present maximum, a lot of authors when writing book in order to save trouble, the direct content of using online enquiries to arrive, author's literary property has often been invaded in this behavior, but this act of plagiarism is had no idea to identify and prevented to prior art.
Summary of the invention
Technical matters solved by the invention provides a kind of anti-plagiarism method, has solved the plagiarism identification problem to article content.
Technical scheme is as follows:
A kind of anti-plagiarism method comprises:
Input detects file, extracts key word from described detection file;
Calling search engine is searched for described key word, transfers the source file of result of page searching, obtains the matching result of described key word, draws the matching rate of described key word;
When described matching rate with pre-set matching rate when conforming to, according to setting mark mode the content that conforms to described key word in the described file to be detected is carried out the mark processing.
Further: with the format conversion of described file to be detected is text formatting, the file to be detected of described text formatting is made pauses in reading unpunctuated ancient writings or staging treating, with the punctuate or staging treating after content as key word.
Further: described setting mark mode is selected for use and is adjusted font size, font-weight, font underlines or font is changed color.
Further: when described matching rate greater than 50% the time, content identical with described key word in the described file to be detected is presetted color according to first carries out mark; When described matching rate 50%~30% the time, content content identical with described key word in the described file to be detected is presetted color according to second carries out mark.
Further: described file to be detected is a complete word document or the passage in the word document.
Further: obtain content search as a result behind the matching rate, set according to pre-seting matching rate and font color, in the word document, described key word is carried out mark and handle, at this moment, call the interface that word software provides, the key word in the described word document is carried out mark handle.
Further: whether all search complete to judge the content of the current described file to be detected that is detecting; When described file to be detected when all search complete, call described search engine and utilize described key word that the residue content of described file to be detected is detected; Finish when described file to be detected detects, Save and Close described file to be detected.
Further: described matching rate equals the number of words % of matching result number of words * 100/ key word of key word.
The technique effect that technical solution of the present invention is brought comprises:
The invention solves the plagiarism identification problem to article content, only need the input article content, which just can tell at short notice is the content that has existed on the net, and which is author's oneself a content.
Description of drawings
Fig. 1 is a main flow chart of the present invention;
Fig. 2 is a result schematic diagram of utilizing search engine Baidu to search among the present invention;
Fig. 3 is a synoptic diagram of transferring Baidu's Search Results source file among the present invention;
Fig. 4 is the document synoptic diagram behind the mark among the present invention.
Embodiment
Whether the present invention adopts the source file that obtains search engine, and obtains the matching result of search key by source file, and then discern word content by matching rate and plagiarized.Source file is meant the set of source code, and source code then is the one group of character that can realize specific function (program development code) with certain sense.
Below with reference to accompanying drawing and preferred embodiment, technical scheme of the present invention is described in detail.
As shown in Figure 1, be main flow chart of the present invention.
Step 101: import file to be detected, file layout is selected the word form for use;
File to be detected can be a complete word document, also can be certain section literal in the word document.
Step 102: the file layout conversion is converted to text formatting (txt formatted file) with the word formatted file;
This purpose of larding speech with literary allusions mainly is to handle literal for convenience, and directly the WORD content being carried out can be very low by the sentence treatment effeciency, convert plain text to after, convenient follow-up punctuate is handled.
Step 103: the txt formatted file is made pauses in reading unpunctuated ancient writings or staging treating, and the content after punctuate or the staging treating is as the key word of search;
Step 104: utilize search engine, the content of txt formatted file after after punctuate or the staging treating is searched for as key word; If Search Results is arranged then carry out step 105, otherwise finish to analyze;
Technical solution of the present invention is applicable to any one search engine, for example, txt formatted file content side is arrived removal search in the Baidu (http://www.baidu.com).
As shown in Figure 2, be the result schematic diagram that the present invention utilizes search engine Baidu to search for.When Baidu was searched for, the Search Results that matching rate is high can come the front, and simultaneously, the key word in the content can be marked, and the key word that matches is labeled as redness.
Step S105: resolve Search Results; Carry out step S106 if the content that searches is underlined, be further analyzed, otherwise carry out step S108;
Step S106: calculate matching rate;
As shown in Figure 3, be the synoptic diagram of transferring Baidu's Search Results source file among the present invention.Therefrom can see the key word of search.
To the txt formatted file behind the punctuate, inquire about in network by sentence, by extracting key word, use regular expression<em for the result who inquires about again〉(.*? the em of)</〉 can get access to the matching result of search key, calculate matching rate this moment.
Transfer the html source file of result of page searching, the html source file of the analysis result page can be found, with<em〉</em〉mark part is key word, based on this, by regular expression<em〉(.*? the em of)</〉 can get access to the matching result of searching key word.
Calculate the matching rate of keyword:
The number of words % of matching result number of words * 100/ key word of matching rate=key word.
Step S107: the matching rate of judging key word; When matching rate greater than 50% the time, content identical with key word in the word document is presetted color according to first carries out mark;
Step S108: in the word document, matching rate is presetted color in 50%~30% the content content identical with key word according to second carry out mark;
Calculate content search as a result behind the matching rate, set, need in the word document, carry out mark and handle, at this moment, call the interface that word software provides, the key word in the word document is carried out mark handle key word according to pre-seting matching rate and font color.Matching rate and color settings can be self-defined.Mark to the word document also can adopt other modes, mode such as for example adjust font size, overstriking, underline.
As shown in Figure 4, be the document synoptic diagram behind the mark among the present invention.As can be seen from the figure, Xiang Guan key word has carried out the mark processing.
Step S109: whether all search complete to judge the content of the current word document that is detecting; When needing content retrieved in addition, all search complete for the promptly current word document that is detecting, and carries out step S104; If all search complete for the content of the current word document that is detecting, carry out step S110;
Step S110: Save and Close the word document;
When search complete, just finished word document mark in full, promptly according to pre-seting matching rate and color has been carried out mark in full to the word document.
Step S111: generate form, finish dealing with.
Form is to generate according to the literal that is labeled, and can find out intuitively by form which content is plagiarized in the word document, and which has the plagiarism possibility in, and which is author's oneself a content.

Claims (8)

1. anti-plagiarism method comprises:
Input detects file, extracts key word from described detection file;
Calling search engine is searched for described key word, transfers the source file of result of page searching, obtains the matching result of described key word, draws the matching rate of described key word;
When described matching rate with pre-set matching rate when conforming to, according to setting mark mode the content that conforms to described key word in the described file to be detected is carried out the mark processing.
2. anti-plagiarism method as claimed in claim 1, it is characterized in that: with the format conversion of described file to be detected is text formatting, file to be detected to described text formatting is made pauses in reading unpunctuated ancient writings or staging treating, with the punctuate or staging treating after content as key word.
3. as claim 1 or 2 described anti-plagiarism methods, it is characterized in that: described setting mark mode is selected for use and is adjusted font size, font-weight, font underlines or font is changed color.
4. as claim 1 or 2 described anti-plagiarism methods, it is characterized in that: when described matching rate greater than 50% the time, content identical with described key word in the described file to be detected is presetted color according to first carries out mark; When described matching rate 50%~30% the time, content content identical with described key word in the described file to be detected is presetted color according to second carries out mark.
5. as claim 1 or 2 described anti-plagiarism methods, it is characterized in that: described file to be detected is a complete word document or the passage in the word document.
6. anti-plagiarism method as claimed in claim 5, it is characterized in that: obtain content search as a result behind the matching rate, set according to pre-seting matching rate and font color, in the word document described key word being carried out mark handles, at this moment, call the interface that word software provides, the key word in the described word document is carried out mark handle.
7. as claim 1 or 2 described anti-plagiarism methods, it is characterized in that: whether all search complete to judge the content of the current described file to be detected that is detecting; When described file to be detected when all search complete, call described search engine and utilize described key word that the residue content of described file to be detected is detected; Finish when described file to be detected detects, Save and Close described file to be detected.
8. anti-plagiarism method as claimed in claim 1 is characterized in that: described matching rate equals the number of words % of matching result number of words * 100/ key word of key word.
CN 201010506555 2010-10-14 2010-10-14 Anti-plagiarism method Pending CN101957809A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010506555 CN101957809A (en) 2010-10-14 2010-10-14 Anti-plagiarism method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010506555 CN101957809A (en) 2010-10-14 2010-10-14 Anti-plagiarism method

Publications (1)

Publication Number Publication Date
CN101957809A true CN101957809A (en) 2011-01-26

Family

ID=43485142

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010506555 Pending CN101957809A (en) 2010-10-14 2010-10-14 Anti-plagiarism method

Country Status (1)

Country Link
CN (1) CN101957809A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049467A (en) * 2011-10-12 2013-04-17 杨纯青 Chinese digital anti-plagiarism detection and comparison system and method
CN103412904A (en) * 2013-07-31 2013-11-27 广联达软件股份有限公司 PDF (portable document format) file comparison method and PDF file comparison system
CN103544326A (en) * 2013-11-14 2014-01-29 上海交通大学 Chinese and English cross-language plagiarism recognition method based on characteristics and content of translations
CN103699571A (en) * 2013-11-25 2014-04-02 小米科技有限责任公司 Method and device for file synchronization and electronic equipment
CN106649871A (en) * 2017-01-03 2017-05-10 广州爱九游信息技术有限公司 Detection method, apparatus and computing equipment for repetition degree of articles
CN108804624A (en) * 2013-12-18 2018-11-13 国网江苏省电力有限公司常州供电分公司 The method of text gear typing and comparison

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030037253A1 (en) * 2001-04-27 2003-02-20 Arthur Blank Digital rights management system
CN1909522A (en) * 2006-08-18 2007-02-07 北京金山软件有限公司 Method for acquiring front-page keyword and its application system
CN101042692A (en) * 2006-03-24 2007-09-26 富士通株式会社 translation obtaining method and apparatus based on semantic forecast
CN101334789A (en) * 2008-08-04 2008-12-31 福建师范大学 Device for identifying document plagiarism by search engine

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030037253A1 (en) * 2001-04-27 2003-02-20 Arthur Blank Digital rights management system
CN101042692A (en) * 2006-03-24 2007-09-26 富士通株式会社 translation obtaining method and apparatus based on semantic forecast
CN1909522A (en) * 2006-08-18 2007-02-07 北京金山软件有限公司 Method for acquiring front-page keyword and its application system
CN101334789A (en) * 2008-08-04 2008-12-31 福建师范大学 Device for identifying document plagiarism by search engine

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049467A (en) * 2011-10-12 2013-04-17 杨纯青 Chinese digital anti-plagiarism detection and comparison system and method
CN103412904A (en) * 2013-07-31 2013-11-27 广联达软件股份有限公司 PDF (portable document format) file comparison method and PDF file comparison system
CN103544326A (en) * 2013-11-14 2014-01-29 上海交通大学 Chinese and English cross-language plagiarism recognition method based on characteristics and content of translations
CN103699571A (en) * 2013-11-25 2014-04-02 小米科技有限责任公司 Method and device for file synchronization and electronic equipment
CN108804624A (en) * 2013-12-18 2018-11-13 国网江苏省电力有限公司常州供电分公司 The method of text gear typing and comparison
CN108984593A (en) * 2013-12-18 2018-12-11 国网江苏省电力有限公司常州供电分公司 The method that multi-format text keeps off typing and compares
CN106649871A (en) * 2017-01-03 2017-05-10 广州爱九游信息技术有限公司 Detection method, apparatus and computing equipment for repetition degree of articles

Similar Documents

Publication Publication Date Title
US8874604B2 (en) Method and system for searching an electronic map
CN101957809A (en) Anti-plagiarism method
CN106294396A (en) Keyword expansion method and keyword expansion system
CN103810251B (en) Method and device for extracting text
CN102880647A (en) Method and device for acquiring another name of organization
CN107301195A (en) Generate disaggregated model method, device and the data handling system for searching for content
CN111984845B (en) Website wrongly written word recognition method and system
CN103886094A (en) Method for error correction and expansion of electronic commerce search engine
CN103389970A (en) Real-time learning-based auxiliary word writing system and method
CN109634436A (en) Association method, device, equipment and the readable storage medium storing program for executing of input method
CN103376990B (en) The sound control method of a kind of web page operation and system
TWI682286B (en) System for document searching using results of text analysis and natural language input
EP4080381A1 (en) Method and apparatus for generating patent summary information, and electronic device and medium
CN101782924A (en) Information processing method, information processing apparatus, and program
Luthfi et al. Building an Indonesian named entity recognizer using Wikipedia and DBPedia
CN102982029B (en) A kind of search need recognition methods and device
CN115794225A (en) Method for processing business flow based on natural language
CN107491440B (en) Natural language word segmentation construction method and system and natural language classification method and system
CN105808566A (en) Method and device for extracting abstracts from webpages on basis of search words
CN105808562A (en) Method and device for extracting webpage abstract based on weight
CN102880606B (en) A kind of computer implemented method and apparatus for optimizing marking language text
King et al. Utilising the Crowd to Unlock the Data on Herbarium Specimens at the Royal Botanic Garden Edinburgh.
Wang et al. Research on Web Character Information Extraction Based on Semantic Similarity
Oksanen et al. Semantic Finlex
KR101390300B1 (en) Apparatus and method for extracting sentence on thesis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20110126