CN102156689A - Method and device for detecting document - Google Patents

Method and device for detecting document Download PDF

Info

Publication number
CN102156689A
CN102156689A CN 201110080838 CN201110080838A CN102156689A CN 102156689 A CN102156689 A CN 102156689A CN 201110080838 CN201110080838 CN 201110080838 CN 201110080838 A CN201110080838 A CN 201110080838A CN 102156689 A CN102156689 A CN 102156689A
Authority
CN
China
Prior art keywords
document
paragraph
existing
pirate
characteristic information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201110080838
Other languages
Chinese (zh)
Other versions
CN102156689B (en
Inventor
周纾
李彦宏
徐兴军
张雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN2011100808382A priority Critical patent/CN102156689B/en
Publication of CN102156689A publication Critical patent/CN102156689A/en
Application granted granted Critical
Publication of CN102156689B publication Critical patent/CN102156689B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a method for detecting a document. The method comprises the following steps of: acquiring paragraph characteristic information corresponding to the document; comparing the paragraph characteristic information of the document with paragraph characteristic information of at least one existing document; and judging whether the existing document similar to the document is provided according to a comparison result. The document can be detected by using the paragraph characteristic information, so the similarities of the documents can be compared accurately, and cheating on sectional processing of the document is avoided; moreover, the checking efficiency is higher and the pressing pressure of a server is lower; the method for detecting the document can be used for improving the online document copyright property detection, and detecting the document when the document is uploaded, so that the following document copyright property detection which causes unnecessary pressure on the server can be avoided; and the copyright property detection of the existing documents can be processed in mass, so the efficiency is higher.

Description

Document detection method and device
Technical field
The present invention relates to a kind of document detection method and device, refer in particular to a kind of document detection method and device that is used to compare the lengthy document similarity.
Background technology
Normally, the document detection method that is used for the document similarity is that title, author, word information by document confirmed.Yet, such way has following defective: at first, title, author, word information inquiry by document, be easy to generate omission, for example, title, the author information of the document are revised or deletion, are a plurality of parts with the document cutting perhaps, so that can't inquire about or compare other documents accurately by word information; Secondly, if document length to be checked is longer, as minister's piece of writing novel, then inquire about by word information, search efficiency is lower, server or Computer Processing pressure are bigger, influences the operate as normal efficient of server or computing machine.
Summary of the invention
The object of the present invention is to provide a kind of improved document detection method.
Another object of the present invention is to a kind of improved document detection device, described device is used described improved document detection method.
Correspondingly, the document detection method of one embodiment of the present invention comprises:
S1, obtain and document paragraph characteristic information correspondingly;
S2, the paragraph characteristic information of described document and the paragraph characteristic information of at least one existing document are compared;
S3, judge whether to have the existing document similar to described document according to described comparison result.
As a further improvement on the present invention, described paragraph characteristic information is the paragraph signature of default Q-character.
As a further improvement on the present invention, obtain the paragraph signature of described default Q-character by hash algorithm.
As a further improvement on the present invention, described " obtaining the paragraph signature of described default Q-character by hash algorithm " specifically may further comprise the steps:
S100, each paragraph in the document is cut speech, obtain the speech of this paragraph and two tuples list of word frequency;
S101, two tuples in the described tabulation are carried out the initial weight vector calculation;
S102, described two tuples are calculated by hash algorithm, obtained the Hash character string of default Q-character;
S103, described Hash character string is mapped in the described weight vectors;
The value of S104, the corresponding position of calculating weight vectors, the paragraph that obtains default Q-character is signed.
As a further improvement on the present invention, described S103 step specifically comprises:
Judge that in the described Hash character string each is 0 or 1, if 0, then when mapping to described weight vectors corresponding positions, this position is subtracted power; If 1, then when mapping to described weight vectors corresponding positions, this position is weighted.
As a further improvement on the present invention, described S104 step specifically comprises:
Whether the value of judging the corresponding position of described weight vectors is greater than 0; If greater than 0, then the value with the corresponding position of described weight vectors is made as 1, if smaller or equal to 0, then the value with the corresponding position of this weight vectors is made as 0.
As a further improvement on the present invention, when the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during, judge that described document is similar to existing document more than or equal to second threshold value set; When the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during, then judge described document and existing document dissmilarity less than second threshold value set.
As a further improvement on the present invention, when the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during more than or equal to second threshold value set, and total paragraph signature of described document and total paragraph signature calculation result of described existing document judge that less than first threshold described document is similar to existing document; When the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during, then judge described document and existing document dissmilarity less than second threshold value set.
As a further improvement on the present invention, described similar paragraph number obtains by following steps:
By algorithm the paragraph signature that described document obtains presetting the paragraph signature of Q-character and having the default Q-character of document is calculated, if result of calculation is greater than the first threshold of being scheduled to, then described paragraph dissmilarity; If result of calculation is smaller or equal to predetermined first threshold, then described paragraph is similar.
As a further improvement on the present invention, described " the paragraph signature that described document is obtained default Q-character by algorithm calculates with the paragraph signature of the default Q-character that has document " is the distance by paragraph signature with the paragraph signature of described existing document of the described document of Hamming code distance calculation.
As a further improvement on the present invention, the copyright property of described document that will be similar to existing document is defined as pirate document.
As a further improvement on the present invention, the copyright property of described document that will be similar to existing document is defined as doubtful pirate document.
As a further improvement on the present invention, examine described doubtful pirate document,, then send feedback information if audit confirms that described doubtful pirate document is pirate document; If audit confirms that described doubtful pirate document is non-pirate document, the described non-pirate document of then online issue.
As a further improvement on the present invention, the copyright property of one or more described existing documents that will be similar to described document is defined as pirate document.
As a further improvement on the present invention, the attribute definition of one or more described existing documents that will be similar to described document is doubtful pirate document.
As a further improvement on the present invention, examine described doubtful pirate document,, then delete described pirate document if audit confirms that described doubtful pirate document is pirate document; If audit confirms that described doubtful pirate document is non-pirate document, then keep described non-pirate document.
As a further improvement on the present invention, the copyright property of described non-pirate document is labeled as verifies, and/or with described non-pirate document copying/move to and verify the copyright data storehouse.
As a further improvement on the present invention, repeat claim 17 step, until the screening of finishing all existing documents.
As a further improvement on the present invention, obtain the copyright property of described document according to described judged result.
As a further improvement on the present invention, before described S1 step, also comprise the paragraph characteristic information step that makes up described existing document:
Obtain and be verified as legal digital document;
Extract the paragraph characteristic information of described digital document and set up index.
As a further improvement on the present invention, described " making up the paragraph characteristic information of described existing document " step also comprises:
Whether discern described digital document is document;
If, then extract the paragraph characteristic information of described document and set up index, if not, then described digital document is converted to document by algorithm after, extract the paragraph characteristic information of described document and set up index.
As a further improvement on the present invention, after described " making up the paragraph characteristic information of described existing document " step, also comprise:
The digital document of not verifying copyright property that reception is uploaded.
As a further improvement on the present invention, after described " digital document of not verifying copyright property that reception is uploaded " step, also comprise:
Judge whether described digital document is document;
If, then carry out the S1 step, if not, then described digital document is converted to document by algorithm after, carry out the S1 step.
As a further improvement on the present invention, before described S1 step, also comprise described document is stored.
As a further improvement on the present invention, obtain the copyright property of described existing document according to described judged result.
As a further improvement on the present invention, before described S1 step, also comprise the paragraph characteristic information step that makes up described existing document:
Obtain the existing digital document of not verifying copyright property;
Extract the paragraph characteristic information of described digital document and set up index.
As a further improvement on the present invention, described " making up the paragraph characteristic information of described existing document " step also comprises:
Whether discern described digital document is document;
If, then extract the paragraph characteristic information of described document and set up index, if not, then described digital document is converted to document by algorithm after, extract the paragraph characteristic information of described document and set up index.
As a further improvement on the present invention, after described " making up the paragraph characteristic information of described existing document " step, also comprise:
Receive and be verified as legal digital document.
As a further improvement on the present invention, after described " receive and be verified as legal digital document " step, also comprise:
Judge whether described digital document is document;
If, then carry out the S1 step, if not, then described digital document is converted to document by algorithm after, carry out the S1 step.Correspondingly, the document detection device of one embodiment of the present invention comprises:
Acquiring unit is used to obtain and document paragraph characteristic information correspondingly;
Comparing unit is used for the paragraph characteristic information of described document and the paragraph characteristic information of at least one existing document are compared;
Judging unit is used for judging whether to have the existing document similar to described document according to described comparison result.
As a further improvement on the present invention, described paragraph characteristic information is the paragraph signature of default Q-character.
As a further improvement on the present invention, obtain the paragraph signature of described default Q-character by hash algorithm.
As a further improvement on the present invention, described deriving means is used for:
Each paragraph in the document is cut speech, obtain the speech of this paragraph and two tuples list of word frequency;
Two tuples in the described tabulation are carried out the initial weight vector calculation;
Described two tuples are calculated by hash algorithm, obtained the Hash character string of default Q-character;
Described Hash character string is mapped in the described weight vectors;
Calculate the value of the corresponding position of weight vectors, obtain the paragraph signature of default Q-character.
As a further improvement on the present invention, described deriving means is used for: each that judge described Hash character string is 0 or 1, if 0, then when mapping to described weight vectors corresponding positions, this position is subtracted power; If 1, then when mapping to described weight vectors corresponding positions, this position is weighted.
As a further improvement on the present invention, described deriving means is used for: whether the value of judging the corresponding position of described weight vectors is greater than 0; If greater than 0, then the value with the corresponding position of described weight vectors is made as 1, if smaller or equal to 0, then the value with the corresponding position of this weight vectors is made as 0.
As a further improvement on the present invention, described judging unit is used for: when the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during more than or equal to second threshold value set, judge that described document is similar to existing document; When the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during, then judge described document and existing document dissmilarity less than second threshold value set.
As a further improvement on the present invention, described judging unit is used for: when the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during more than or equal to second threshold value set, and total paragraph signature of described document and total paragraph signature calculation result of described existing document judge that less than first threshold described document is similar to existing document; When the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during, then judge described document and existing document dissmilarity less than second threshold value set.
As a further improvement on the present invention, described comparing unit is used for by algorithm the paragraph signature that described document obtains presetting the paragraph signature of Q-character and having the default Q-character of document being calculated, if result of calculation is greater than predetermined first threshold, then described paragraph dissmilarity; If result of calculation is smaller or equal to predetermined first threshold, then described paragraph is similar.
As a further improvement on the present invention, the distance of the signature of the paragraph by the described document of Hamming code distance calculation and the paragraph signature of described existing document.
As a further improvement on the present invention, described judging unit is used for the copyright property of the described document similar to existing document is defined as pirate document.
As a further improvement on the present invention, described judging unit is used for the copyright property of the described document similar to existing document is defined as doubtful pirate document.
As a further improvement on the present invention, described document detection device also comprises the unit that is used for sending feedback information after the described document of audit affirmation is pirate document.
As a further improvement on the present invention, described document detection device also comprises the unit that is used for the described non-pirate document of online issue after the described document of audit affirmation is non-pirate document.
As a further improvement on the present invention, to be used for the attribute definition with the one or more described existing documents similar to described document be pirate document to described judging unit.
As a further improvement on the present invention, to be used for the attribute definition with the one or more described existing documents similar to described document be doubtful pirate document to described judging unit.
As a further improvement on the present invention, described document detection device comprises that also being used for confirming to delete after described document is pirate document described pirate document when audit deletes the processing unit of described pirated file.
As a further improvement on the present invention, described processing unit also is used for after audit confirms that described document is non-pirate document, the copyright property of described non-pirate document is labeled as verifies, and/or with described non-pirate document copying/move to and verify the copyright data storehouse.
As a further improvement on the present invention, described judging unit also is used for obtaining according to described judged result the copyright property of described document.
As a further improvement on the present invention, described document detection device also comprises:
Be used to store the unit that has been verified as legal digital document; And
The unit that is used to extract the paragraph characteristic information of described digital document and sets up index.
As a further improvement on the present invention, described document detection device also comprises:
Be used to receive the unit of the digital document of uploading of not verifying copyright property.
As a further improvement on the present invention, described judging unit also is used for obtaining according to described judged result the copyright property of described existing document.
As a further improvement on the present invention, described document detection device also comprises:
Be used to store the existing unit of not verifying the digital document of copyright property; And
The unit that is used to extract the paragraph characteristic information of described digital document and sets up index.
As a further improvement on the present invention, described document detection device also comprises:
Be used to receive the unit that has been verified as legal digital document.
As a further improvement on the present invention, described document detection device also comprises:
Whether be used to discern described digital document is the unit of document;
Be used for described digital document being converted to the unit of document by algorithm.
As a further improvement on the present invention, described document detection device also comprises and is used for unit that described document is stored.
The invention has the beneficial effects as follows: the present invention detects document by the paragraph characteristic information, can carry out the similarity comparison between the document comparatively exactly, avoided cheating simultaneously, and, made that search efficiency is higher, the server/computer processing pressure is less by such detection mode; In addition, the present invention is used to improve the online document copyright property with the document detection method and detects, can when uploading, document promptly the document be detected, and avoiding follow-up when the document copyright property is detected, the unnecessary pressure that causes to server; Simultaneously, the copyright property of the existing document of the processing server storage that the present invention can be in batches detects, and efficient is higher.
Description of drawings
Fig. 1 is the process flow diagram of document detection method in an embodiment of the present invention;
Fig. 2 is the process flow diagram that obtains the method for paragraph signature in an embodiment of the present invention;
Fig. 3 is the process flow diagram that in an embodiment of the present invention the document detection mode is used for screening the copyright property of document when uploading document;
Fig. 4 is the process flow diagram that in one embodiment of the present invention the document detection mode is used to screen the copyright property that has document;
Fig. 5 is a process flow diagram of setting up legal database in one embodiment of the present invention;
Fig. 6 sets up the process flow diagram of not verifying the copyright data storehouse in one embodiment of the present invention;
Fig. 7 is the module map of document pick-up unit in one embodiment of the present invention.
Embodiment
Describe the present invention below with reference to each embodiment shown in the drawings.But these embodiments do not limit the present invention, and the conversion on the structure that those of ordinary skill in the art makes according to these embodiments, method or the function all is included in protection scope of the present invention.
As shown in Figure 1, in an embodiment of the present invention, described document detection method comprises:
S1, obtain and document paragraph characteristic information correspondingly; Wherein, described document refers to the e-file based on text, and preferably, in the present embodiment, described document also can carry out copy editor's e-file, for example txt file, doc file etc.By discerning the newline in this e-file, can obtain the paragraph information of described document, described document is divided into one or more paragraphs.In best mode for carrying out the invention, after obtaining described one or more paragraph, can calculate the paragraph characteristic information of described one or more paragraphs by hash algorithm, preferably, this paragraph characteristic information is the paragraph signature of default Q-character, in order to improve the efficient of comparing with the paragraph characteristic information of existing document in the S2 step, take into account the accuracy of comparison simultaneously, in best mode for carrying out the invention, this predetermined characteristic position is 64, for example: (64 altogether of 110101000100...011, each value 0 or 1 does not have other values), certainly, in other embodiments of the present invention, this predetermined characteristic position also can be 128,256 etc.As shown in Figure 2, in an embodiment of the present invention, obtain the method for the paragraph signature of described default Q-character, it comprises the steps:
S100, each paragraph in the document is cut speech, obtain the speech of this paragraph and two tuples list of word frequency; This cuts the method for speech, and those of ordinary skills can by prior art on top of not repeat them here.
S101, two tuples in the described tabulation are carried out the initial weight vector calculation; Wherein, each speech and word frequency all have a weight vectors, and in best mode for carrying out the invention, if should default Q-character be 64, then this weight vectors just has 64 dimensions, and each dimension promptly is each Q-character of representing in these 64.
S102, described two tuples are calculated by hash algorithm, obtained the Hash character string of default Q-character; In best mode for carrying out the invention, promptly be that 64 Hash character string is calculated and obtained to the speech and the word frequency of this paragraph by hash algorithm.
S103, described Hash character string is mapped in the described weight vectors; Concrete steps are: judge that at first in the described Hash character string each is 0 or 1, if 0, then when mapping to described weight vectors corresponding positions, this position is subtracted power, promptly be (log (f+0.1)); If 1 words, then when mapping to described weight vectors corresponding positions, this position is weighted, promptly be (+log (f+0.1)), need to prove: f is the frequency of occurrences of word, if unit of account is a paragraph, f then is the total quantity that this word occurs in paragraph, if be that f then is the total quantity that this word occurs in the text in full.
The value of S104, the corresponding position of calculating weight vectors, the paragraph that obtains default Q-character is signed.Concrete steps are: whether the value of judging the corresponding position of described weight vectors is greater than 0; If greater than 0, then the value with the corresponding position of this weight vectors is made as 1, if smaller or equal to 0, then the value with the corresponding position of this weight vectors is made as 0, like this, can obtain the paragraph signature of described default Q-character.
S2, the paragraph characteristic information of described document and the paragraph characteristic information of at least one existing document are compared; In preferred forms of the present invention, promptly be that the paragraph that described document obtains presetting the paragraph signature of Q-character and having the default Q-character of document is signed according to certain algorithm computation, preferably, it is distance by paragraph signature with the existing paragraph signature of Hamming code distance calculation document, if this distance is during greater than the first threshold of being scheduled to, then think these two paragraph dissmilarities, if this distance is during smaller or equal to the first threshold of being scheduled to, think that then these two paragraphs are similar, in best mode for carrying out the invention, this first threshold is 6.Certainly, this comparison also can be included in the index database that has document and inquire about, promptly be to compare by the paragraph feature of described document and the paragraph characteristic information of a plurality of existing documents, the building mode of this index database will be specifically described in conjunction with Fig. 5, Fig. 6 following.
S3, judge whether to have the existing document similar to described document according to described comparison result.Result by comparison when the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during more than or equal to second threshold value of setting, judges that described document is similar to existing document as can be known; When the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during, then judge described document and existing document dissmilarity less than second threshold value set.Described second threshold value is the ratio of the total paragraph number of described similar paragraph number/document, and it can according to circumstances be set, for example, if need comparison comparatively accurate, then can omit if worry comparison with described second threshold setting more greatly, then can described second threshold setting is littler.In best mode for carrying out the invention, this second threshold value is set in 0.5~1 the interval.Preferably, in preferred forms of the present invention, the similar paragraph number that not only needs document and existing document is more than or equal to second threshold value of setting, the Hamming code of whole paragraph signature that also needs to satisfy the whole paragraph signature of described document and described existing document is apart from smaller or equal to first threshold, the acquisition mode of this integral body paragraph signature can be joined the content that discloses in the S1 step, promptly be with entire article as a paragraph, calculate the paragraph signature of this paragraph by hash algorithm.
By above-mentioned flow process, can carry out the similarity comparison between the document comparatively exactly, avoid cheating, and search efficiency is higher, the server/computer processing pressure is less.
Along with popularizing and development of internet, online reading has become a kind of main reading method, and simultaneously, shared, the popularization of online document have also become a kind of important information issue means.For example, is there Google books (http://books.google.com/bkshp more famous abroad online reading website? tab=yp), there is Baidu library (http://wenku.baidu.com/) or the like domestic more famous online reading website.
This open online document sharing, popularization, reading method, though made things convenient for user search greatly, read, share document, easily, the gratis obtains relevant information and knowledge.But this mode also exists big drawback: promptly be because of the sharing, promote of user or service provider, and invaded the literary property of original text author, distribution society, publisher, make the latter suffer enormous economic loss.
For addressing the above problem, the document that needs will share, promote carries out identification, to screen its copyright property.In one embodiment of the invention, above-mentioned document detection mode can be used to screen the copyright property of document, its main path can comprise two kinds: the one, and the judged result by above-mentioned steps just obtains the copyright property at the document of upload server, another is the copyright property of the existing document that obtains having stored in the server of the judged result by above-mentioned steps, will introduce the idiographic flow of these two kinds of approach respectively in conjunction with the accompanying drawings below.
As shown in Figure 3, in an embodiment of the present invention, the copyright property of the document that can uploaded by the judged result of above-mentioned document detection method, and can make corresponding operating to described document according to described copyright property, its step comprises:
The digital document of not verifying copyright property that S10, reception are uploaded; In this step, usually, can be by browser or client software login online document website, and the digital document of this locality is uploaded to the server of online document website, promptly be that the server of online document website receives the digital document of uploading, usually, this digital document is not verify copyright property, and it may be the digital document that obtains through various channels, for example, download scanning etc.This digital document can comprise various ways, as text, e-book, picture, PDF or the like.In the application scenarios of present embodiment, generally be from the described digital document of not verifying copyright property of client upload by user or service provider.
Whether S11, the described digital document of identification are document; If, then enter the S13 step, if not, then be introduced into the S12 step after, enter the S13 step again; Whether, to judge described digital document be document, for example, if this digital document is a text, its suffix is called txt, doc or the like suffix about document, judges that then this digital document is a document if in this step, can pass through the suffix name of the described digital document of identification; If this digital document is picture, PDF etc., its suffix is called the suffix of non-documents such as jpg, bmp, pdf, judges that then this digital document is non-document.Certainly, also can stipulate on the server of described online document website, can only upload document files, like this, can exclude the file in other non-documents, promptly be this step and nonessential.
S12, described digital document is converted to document by algorithm; Preferably, in the present embodiment, this algorithm can adopt the comparatively general OCR recognizer of industry.Those of ordinary skills can grasp by prior art quantity, just repeat no more at this.Certainly, also can stipulate on the server of described online document website, can only upload document files, like this, can exclude the file in other non-documents, promptly be this step and nonessential.
S13, described document is stored; Wherein, the document of storage has comprised digital document and the document by being converted to that is identified as document.Certainly, this step is also nonessential, and the document can be deposited in the internal memory (RAM), and it can be deleted from described internal memory after finishing the examination copyright property.
S14, obtain and document paragraph characteristic information correspondingly; Wherein, this step has adopted the described concrete steps as Fig. 1, Fig. 2, does not repeat them here.
S15, the paragraph characteristic information that the paragraph characteristic information of described document and at least one have been verified as legal existing document are compared; Wherein, the step of specifically carrying out this comparison has adopted as the S2 step among Fig. 1, does not repeat them here.In the present embodiment, described existing document is to set up the existing document of storing in the good legal database in advance, the paragraph characteristic information of described existing document is the index information of setting up by described legal database, by the paragraph characteristic information of described document and the paragraph characteristic information of the existing document in this index are compared, confirm the copyright property of described document, this step of setting up described index information will be elaborated at following Fig. 5.
S16, judge whether described document is similar to one or more existing documents; In the present embodiment, this concrete execution in step can adopt S3 step as described in Figure 1, does not repeat them here.When judging that described document is similar to existing document, enter the S17 step; When judging that described document and existing document are dissimilar, then directly enter the S19b step;
S17, the copyright property of described document that will be similar to existing document are defined as doubtful pirate document; Preferably, one or more described doubtful pirate documents can be polymerized to doubtful table, and with described doubtful table and/or doubtful pirate document storage the specified path in server, the auditor can visit described doubtful table and/or doubtful pirate document by entering this specified path, to carry out following S18 step, certainly, also described doubtful table and/or doubtful pirate document active push can be held to the appointment auditor, so that the auditor handles this audit the very first time, for example push by Email.In another embodiment of the present invention, the copyright property of the described document similar to existing document directly can be defined as pirate document, and enter the S19a step, send feedback information to the user;
S18, the described doubtful pirate document of audit confirm whether it is pirate document; If confirm as pirate document, then enter the S19a step, if confirm as non-pirate document, then enter the S19b step;
S19a, transmission feedback information; Preferably, in the present embodiment, can send feedback information to uploading the described digital document side that does not verify copyright property, in general, can send described feedback information by the prompting frame form, for example, in browser, eject prompting frame, in client, eject prompting frame etc.Certainly also can return a new page to browser, the content of the digital document of uploading with the prompting side of uploading serves as pirate or does not pass through copyright authentication.
S19b, the described non-pirate document of online issue.In one embodiment, be that described non-pirate document is added in the on-line documentation database, preferably, in a particular embodiment of the present invention, described online document database promptly is in the legal database, by adding described non-pirate document, can effectively expand the legal document in the described legal database, with the more effective digital document of uploading in the future of screening to described legal database.
By above-mentioned flow process, can when uploading, document promptly the document be detected, avoiding follow-up when the document copyright property is detected, the unnecessary pressure that causes to server.
As shown in Figure 4, in an embodiment of the present invention, can do not verified the copyright property of the existing document (being stored in the server) of copyright property by the judged result of above-mentioned document detection method, and can make corresponding operating to described existing document according to described copyright property, its step comprises:
S20, reception have been verified as legal digital document; In this step, can obtain this legal digital document by the third party who authorizes, also can must obtain this legal digital document by reading legal database, this third party can comprise the approach that partner site, writer, the Writers' Union, colleges and universities, publishing house or the like are legal, and the digital document of described legal copy can be uploaded onto the server, promptly be that server receives the digital document of uploading that is verified as legal copy, usually, this digital document can comprise various ways, as text, e-book, picture, PDF or the like.Preferably, this legal digital document can be stored in the above-mentioned legal database.
Whether S21, the described digital document of identification are document; If, then enter the S23 step, if not, then be introduced into the S22 step after, enter the S23 step again; Whether, to judge described digital document be document, for example, if this digital document is a text, its suffix is called txt, doc or the like suffix about document, judges that then this digital document is a document if in this step, can pass through the suffix name of the described digital document of identification; If this digital document is picture, PDF etc., its suffix is called the suffix of non-documents such as jpg, bmp, pdf, judges that then this digital document is non-document.Certainly, also can stipulate on the server of described online document website, can only upload document files, like this, can exclude the file in other non-documents, promptly be this step and nonessential.
S22, by algorithm described digital document is converted to document preferably, in the present embodiment, this algorithm can adopt the comparatively general OCR recognizer of industry.Those of ordinary skills can grasp by prior art quantity, just repeat no more at this.Certainly, also can stipulate on the server of described online document website, can only upload document files, like this, can exclude the file in other non-documents, promptly be this step and nonessential.
S23, obtain and document paragraph characteristic information correspondingly; Wherein, this step has adopted the described concrete steps as Fig. 1, Fig. 2, does not repeat them here.
S24, the paragraph characteristic information of described document and the paragraph characteristic information of the existing document that at least one does not verify copyright property are compared; Wherein, the step of specifically carrying out this comparison has adopted as the S2 step among Fig. 1, does not repeat them here.In the present embodiment, described existing document is the existing document of setting up in advance of storing in the copyright data storehouse of not verifying, the paragraph characteristic information of described existing document is an index information of not verifying that by described the copyright data storehouse is set up, by the paragraph characteristic information of described document and the paragraph characteristic information of the existing document in this index are compared, confirm the copyright property of described existing document, this step of setting up described index information will be elaborated at following Fig. 6.
S25, judge whether described document is similar to one or more existing documents; In the present embodiment, this concrete execution in step can adopt S3 step as described in Figure 1, does not repeat them here.What deserves to be mentioned is: in this step, the relation of described document and existing document generally is an one-to-many, it promptly is the very possible corresponding a plurality of similar pirate documents of a legal document/doubtful pirate document, compare processing and the corresponding a plurality of pirate documents of described legal document/doubtful pirate document that can be in batches in mode so.When judging that described document is similar to existing document, then enter the S26 step; When judging described document and existing document dissmilarity, then directly enter the S29 step;
S26, the copyright property of one or more described existing documents that will be similar to described document are defined as doubtful pirate document; Preferably, one or more described doubtful pirate documents can be polymerized to doubtful table, and with described doubtful table and/or doubtful pirate document storage the specified path in server, the auditor can visit described doubtful table and/or doubtful pirate document by entering this specified path, to carry out following S27 step, certainly, also described doubtful table and/or doubtful pirate document active push can be held to the appointment auditor, so that the auditor handles this audit the very first time, for example push by Email.In another embodiment of the present invention, the copyright property of the one or more described existing documents similar to described document directly can be defined as pirate document, and enter the S28 step, directly delete described pirate document;
S27, the described doubtful pirate document of audit confirm whether it is pirate document; If confirm as pirate document, then enter the S28 step, if confirm as non-pirate document, then enter the S29 step;
S28, the described pirate document of deletion; In the present embodiment, promptly be that described pirate document is not deleted from described the checking the copyright data storehouse.
S29, the described non-pirate document of reservation.Preferably, in the present embodiment, also the copyright property of described non-pirate document is labeled as and verifies, and/or with described non-pirate document copying/move to and verify the copyright data storehouse.As a special case of present embodiment, this has verified above-mentioned legal database of copyright data storehouse.
Preferably, in the present embodiment, also can constantly repeat above-mentioned steps, until describedly not verifying that existing documents all in the copyright data storehouse finishes screening to being stored in, promptly be the pirate document of deletion.
By above-mentioned flow process, the copyright property of the existing document of processing server storage that can be in batches detects, and efficient is higher.
As shown in Figure 5, in an embodiment of the present invention, this legal copy database and the index information that produces according to described legal database are used in the copyright property that obtains described document by the judged result of document detection method.The method that this sets up described index information may further comprise the steps:
S30, obtain and be verified as legal digital document; This obtains channel can be to obtain by the third party who authorizes, and this third party can comprise the approach that partner site, writer, the Writers' Union, colleges and universities, publishing house or the like are legal.
Whether S31, the described digital document of identification are document; If, then enter the S33 step, if not, then carry out the S32 step, after this digital document is changed, enter the S33 step again; Whether, to judge described digital document be document, for example, if this digital document is a text, its suffix is called txt, doc or the like suffix about document, judges that then this digital document is a document if in this step, can pass through the suffix name of the described digital document of identification; If this digital document is picture, PDF etc., its suffix is called the suffix of non-documents such as jpg, bmp, pdf, judges that then this digital document is non-document.Certainly, also can stipulate on the server of described online document website, can only upload document files, like this, can exclude the file in other non-documents, promptly be this step and nonessential.
S32, described digital document is converted to document by algorithm; In the present embodiment, this algorithm can adopt the comparatively general OCR recognizer of industry.Those of ordinary skills can grasp by prior art quantity, just repeat no more at this.Certainly, also can stipulate on the server of described online document website, can only upload document files, like this, can exclude the file in other non-documents, promptly be this step and nonessential.
The paragraph characteristic information of S33, the described digital document of extraction is also set up index.The method of this extraction paragraph characteristic information can not repeat them here with reference to the method that is disclosed among Fig. 1, Fig. 2.Preferably, can be with this index stores in first indexing units, so that for candidate's inquiry.Certainly in another embodiment, except described paragraph characteristic information is set up index, also information such as the title of described document, author, number of words, length, words are also set up respective index, to satisfy dissimilar inquiries.
As shown in Figure 6, in an embodiment of the present invention, do not verify the copyright data storehouse and be used in the copyright property that obtains described existing document by the judged result of document detection method according to the described index information of not verifying that the copyright data storehouse produces.The method that this sets up described index information may further comprise the steps:
S40, obtain the existing digital document of not verifying copyright property; Preferably, in the present embodiment, the described digital document of not verifying copyright property is for being uploaded to the digital document of online document database, and it may be the digital document of not verifying copyright property of online issue, also the digital document of not issuing temporarily of not verifying copyright property.This digital document of not verifying copyright property is uploaded from user or service provider mostly, does not verify its copyright property through regular approach.
Whether S41, the described digital document of identification are document; If, then enter the S43 step, if not, then carry out the S42 step, after this digital document is changed, enter the S43 step again; Whether, to judge described digital document be document, for example, if this digital document is a text, its suffix is called txt, doc or the like suffix about document, judges that then this digital document is a document if in this step, can pass through the suffix name of the described digital document of identification; If this digital document is picture, PDF etc., its suffix is called the suffix of non-documents such as jpg, bmp, pdf, judges that then this digital document is non-document.Certainly, also can stipulate on the server of described online document website, can only upload document files, like this, can exclude the file in other non-documents, promptly be this step and nonessential.
S42, described digital document is converted to document by algorithm; In the present embodiment, this algorithm can adopt the comparatively general OCR recognizer of industry.Those of ordinary skills can grasp by prior art quantity, just repeat no more at this.Certainly, also can stipulate on the server of described online document website, can only upload document files, like this, can exclude the file in other non-documents, promptly be this step and nonessential.
The paragraph characteristic information of S43, the described digital document of extraction is also set up index.The method of this extraction paragraph characteristic information can not repeat them here with reference to the method that is disclosed among Fig. 1, Fig. 2.Preferably, can be with this index stores in second indexing units, so that for candidate's inquiry.Certainly in another embodiment, except described paragraph characteristic information is set up index, also information such as the title of described document, author, number of words, length, words are also set up respective index, to satisfy dissimilar inquiries.
As shown in Figure 7, be the module map of document pick-up unit in an embodiment of the present invention.Described document detection device has comprised acquiring unit 10, comparing unit 11, judging unit 12, first receiving element 13, recognition unit 14, converting unit 15, storage unit 16, second receiving element 17, legal database 18, first indexing units 19, has been checking copyright data storehouse 20, second indexing units 21, feedback unit 22, release unit 23, and processing unit 24.
In an embodiment of the present invention, described acquiring unit is used to obtain and document paragraph characteristic information correspondingly; Wherein, described document refers to the e-file based on text, and preferably, in the present embodiment, described document also can carry out copy editor's e-file, for example txt file, doc file etc.By discerning the newline in this e-file, can obtain the paragraph information of described document, described document is divided into one or more paragraphs.In best mode for carrying out the invention, after obtaining described one or more paragraph, can calculate the paragraph characteristic information of described one or more paragraphs by hash algorithm, preferably, this paragraph characteristic information is the paragraph signature of default Q-character, in order to improve the efficient of comparing with the paragraph characteristic information of existing document in the comparing unit, take into account the accuracy of comparison simultaneously, in best mode for carrying out the invention, this predetermined characteristic position is 64, for example: (64 altogether of 110101000100...011, each value 0 or 1 does not have other values), certainly, in other embodiments of the present invention, this predetermined characteristic position also can be 128,256 etc.Wherein, when obtaining the paragraph feature of described predetermined characteristic position, described acquisition unit can be used for: each paragraph in the document is cut speech, obtain the speech of this paragraph and two tuples list of word frequency; Two tuples in the described tabulation are carried out the initial weight vector calculation; Described two tuples are calculated by hash algorithm, obtained the Hash character string of default Q-character; Described Hash character string is mapped in the described weight vectors; Be specially: judge that in the described Hash character string each is 0 or 1, if 0, then when mapping to described weight vectors corresponding positions, this position is subtracted power, promptly be (log (f+0.1)); If 1 words, then when mapping to described weight vectors corresponding positions, this position is weighted, promptly be (+log (f+0.1)), need to prove: f is the frequency of occurrences of word, if unit of account is a paragraph, f then is the total quantity that this word occurs in paragraph, if be that f then is the total quantity that this word occurs in the text in full; And, calculate the value of the corresponding position of weight vectors, obtain the paragraph signature of default Q-character.Be specially: whether the value of judging the corresponding position of described weight vectors is greater than 0; If greater than 0, then the value with the corresponding position of this weight vectors is made as 1, if smaller or equal to 0, then the value with the corresponding position of this weight vectors is made as 0, like this, can obtain the paragraph signature of described default Q-character.
Comparing unit is used for the paragraph characteristic information of described document and the paragraph characteristic information of at least one existing document are compared; In preferred forms of the present invention, promptly be that the paragraph that described document obtains presetting the paragraph signature of Q-character and having the default Q-character of document is signed according to certain algorithm computation, preferably, it is distance by paragraph signature with the existing paragraph signature of Hamming code distance calculation document, if this distance is during greater than the first threshold of being scheduled to, then think these two paragraph dissmilarities, if this distance is during smaller or equal to the first threshold of being scheduled to, think that then these two paragraphs are similar, in best mode for carrying out the invention, this first threshold is 6.Certainly, this comparison also can be included in the index database that has document and inquire about, promptly be to compare by the paragraph feature of described document and the paragraph characteristic information of a plurality of existing documents, the building mode of this index database will be specifically described in conjunction with Fig. 5, Fig. 6 following.
Judging unit is used for judging whether to have the existing document similar to described document according to described comparison result.When the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during, judge that described document is similar to existing document more than or equal to second threshold value set; When the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during, then judge described document and existing document dissmilarity less than second threshold value set.Described second threshold value is the ratio of the total paragraph number of described similar paragraph number/document, and it can according to circumstances be set, for example, if need comparison comparatively accurate, then can omit if worry comparison with described second threshold setting more greatly, then can described second threshold setting is littler.In best mode for carrying out the invention, this second threshold value is set in 0.5~1 the interval.Preferably, in preferred forms of the present invention, the similar paragraph number that not only needs document and existing document is more than or equal to second threshold value of setting, the Hamming code of whole paragraph signature that also needs to satisfy the whole paragraph signature of described document and described existing document is apart from smaller or equal to first threshold, the acquisition mode of this integral body paragraph signature can be joined the content that discloses in the acquiring unit, promptly be with entire article as a paragraph, calculate the paragraph signature of this paragraph by hash algorithm.
By said units, can carry out the similarity comparison between the document comparatively exactly, avoid cheating, and search efficiency is higher, the server/computer processing pressure is less.
In an embodiment of the present invention, described document recognizing apparatus also can be used for screening the copyright property of document, and described judging unit also is used for obtaining according to above-mentioned judged result the copyright property of described document; And the copyright property that obtains described existing document according to above-mentioned judged result.Preferably, described judging unit also is used for the copyright property of the described document similar to existing document is defined as pirate document or doubtful pirate document according to different application scenarioss; Or the attribute definition that is used for the one or more described existing documents similar to described document is pirate document or doubtful pirate document.
In this embodiment, described document recognizing apparatus also comprises:
First receiving element 13 is used to receive the digital document of uploading of not verifying copyright property.Usually, can be by browser or client software login online document website, and the digital document of this locality is uploaded to the server of online document website, the server that promptly is the online document website receives the digital document of uploading, and usually, this digital document is not verify copyright property, it may be the digital document that obtains through various channels, for example, download scanning etc.This digital document can comprise various ways, as text, e-book, picture, PDF or the like.In the application scenarios of present embodiment, generally be from the described digital document of not verifying copyright property of client upload by user or service provider.
Whether recognition unit 14, discerning described digital document is document.In this unit, can pass through the suffix name of the described digital document of identification, whether be document to judge described digital document, for example, if this digital document is a text, its suffix is called txt, doc or the like suffix about document, judges that then this digital document is a document; If this digital document is picture, PDF etc., its suffix is called the suffix of non-documents such as jpg, bmp, pdf, judges that then this digital document is non-document.Certainly, also can stipulate on the server of described online document website, can only upload document files, like this, can exclude the file in other non-documents, promptly be this element and nonessential.
Converting unit 15 is used for by algorithm described digital document being converted to document; Preferably, in the present embodiment, this algorithm can adopt the comparatively general OCR recognizer of industry.Those of ordinary skills can grasp by prior art quantity, just repeat no more at this.Certainly, also can stipulate on the server of described online document website, can only upload document files, like this, can exclude the file in other non-documents, promptly be this element and nonessential.
Storage unit 16 is stored described document; Wherein, the document of storage has comprised digital document and the document by being converted to that is identified as document.Certainly, this step is also nonessential, and the document can be deposited in the internal memory (RAM), and it can be deleted from described internal memory after finishing the examination copyright property.
Legal database 18 is used to store the digital document that has been verified as legal copy.This legal digital document can obtain this legal digital document by the third party who authorizes, this third party can comprise the approach that partner site, writer, the Writers' Union, colleges and universities, publishing house or the like are legal, and the digital document of described legal copy can be uploaded onto the server, promptly be that server receives being verified as legal digital document and being stored in the described legal database 18 of uploading, usually, this digital document can comprise various ways, as text, e-book, picture, PDF or the like.
First indexing units 19, be used to extract described legal copy digital document the paragraph characteristic information and set up index.Preferably, this first indexing units 19 is paragraph characteristic informations that the described acquiring unit 10 of associating extracts digital document, and the method for this extraction paragraph characteristic information can not repeat them here with reference to the method that is disclosed among Fig. 1, Fig. 2.Preferably, can be with this index stores in first indexing units, so that for candidate's inquiry.Certainly in another embodiment, except described paragraph characteristic information is set up index, also information such as the title of described document, author, number of words, length, words are also set up respective index, to satisfy dissimilar inquiries.
Second receiving element 17 is used to receive the digital document that has been verified as legal copy.In this unit, can obtain this legal digital document by the third party who authorizes, also can from above-mentioned legal database, obtain this legal digital document, this third party can comprise the approach that partner site, writer, the Writers' Union, colleges and universities, publishing house or the like are legal, and the digital document of described legal copy can be uploaded onto the server, promptly be that server receives the digital document of uploading that is verified as legal copy, usually, this digital document can comprise various ways, as text, e-book, picture, PDF or the like.
Do not verify copyright data storehouse 20, be used to store the existing digital document of not verifying copyright property.Preferably, in the present embodiment, the described digital document of not verifying copyright property is for being uploaded to the digital document of online document database, and it may be the digital document of not verifying copyright property of online issue, also the digital document of not issuing temporarily of not verifying copyright property.This digital document of not verifying copyright property is uploaded from user or service provider mostly, does not verify its copyright property through regular approach.
Second indexing units 21 is used to extract the paragraph characteristic information of the described digital document of not verifying copyright property and sets up index.Preferably, this second indexing units 21 is paragraph characteristic informations that the described acquiring unit 10 of associating extracts digital document, and the method for this extraction paragraph characteristic information can not repeat them here with reference to the method that is disclosed among Fig. 1, Fig. 2.Certainly, in another embodiment,, also information such as the title of described document, author, number of words, length, words are also set up respective index, to satisfy dissimilar inquiries except described paragraph characteristic information is set up index.
Feedback unit 22 is used to send feedback information.Preferably, be the unit of confirming to send after described document is pirate document feedback information when audit.In the present embodiment, can send feedback information, in general, can send described feedback information, for example, in browser, eject prompting frame, in client, eject prompting frame etc. by the prompting frame form to uploading the described digital document side that does not verify copyright property.Certainly also can return a new page to browser, the content of the digital document of uploading with the prompting side of uploading serves as pirate or does not pass through copyright authentication.
Release unit 23 is used for the described non-pirate document of online issue.Preferably, be to be used for after the described document of audit affirmation is non-pirate document with the online issue of described document.In one embodiment, be that described non-pirate document is added in the on-line documentation database, preferably, in a particular embodiment of the present invention, described online document database promptly is in the legal database, by adding described non-pirate document, can effectively expand the legal document in the described legal database, with the more effective digital document of uploading in the future of screening to described legal database.
Processing unit 24 is used to delete described pirate document.Preferably, be to be used for confirming to delete described pirate document after described document is pirate document when audit.
In the present embodiment, described processing unit 24 also is used for after audit confirms that described document is non-pirate document, and the copyright property of described non-pirate document is labeled as verifies, and/or with described non-pirate document copying/move to and verify the copyright data storehouse.As a special case of present embodiment, this has verified above-mentioned legal database of copyright data storehouse.
By said units, can when uploading, document promptly the document be detected, avoiding follow-up when the document copyright property is detected, the unnecessary pressure that causes to server; And the copyright property of the existing document of processing server storage that can be in batches detects, and efficient is higher.
For the convenience of describing, be divided into various unit with function when describing above the device and describe respectively.Certainly, when implementing the application, can in same or a plurality of softwares and/or hardware, realize the function of each unit.
As seen through the above description of the embodiments, those skilled in the art can be well understood to the application and can realize by the mode that software adds essential general hardware platform.Based on such understanding, the part that the application's technical scheme contributes to prior art in essence in other words can embody with the form of software product, this computer software product can be stored in the storage medium, as ROM/RAM, magnetic disc, CD etc., comprise that some instructions are with so that a computer equipment (can be a personal computer, server, the perhaps network equipment etc.) carry out the described method of some part of each embodiment of the application or embodiment.
Device embodiments described above only is schematic, wherein said unit as the separating component explanation can or can not be physically to separate also, the parts that show as the unit can be or can not be physical locations also, promptly can be positioned at a place, perhaps also can be distributed on a plurality of network element.Can select wherein some or all of module to realize the purpose of present embodiment scheme according to the actual needs.Those of ordinary skills promptly can understand and implement under the situation of not paying creative work.
The application can be used in numerous general or special purpose computingasystem environment or the configuration.For example: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system, the system based on microprocessor, set top box, programmable consumer-elcetronics devices, network PC, small-size computer, mainframe computer, comprise distributed computing environment of above any system or equipment or the like.
The application can describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract, program, object, assembly, data structure or the like.Also can in distributed computing environment, put into practice the application, in these distributed computing environment, by by communication network connected teleprocessing equipment execute the task.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.
Be to be understood that, though this instructions is described according to embodiment, but be not that each embodiment only comprises an independently technical scheme, this narrating mode of instructions only is for clarity sake, those skilled in the art should make instructions as a whole, technical scheme in each embodiment also can form other embodiments that it will be appreciated by those skilled in the art that through appropriate combination.
Above listed a series of detailed description only is specifying at feasibility embodiment of the present invention; they are not in order to restriction protection scope of the present invention, allly do not break away from equivalent embodiment or the change that skill spirit of the present invention done and all should be included within protection scope of the present invention.

Claims (55)

1. a document detection method is characterized in that, described document detection method may further comprise the steps:
S1, obtain and document paragraph characteristic information correspondingly;
S2, the paragraph characteristic information of described document and the paragraph characteristic information of at least one existing document are compared;
S3, judge whether to have the existing document similar to described document according to described comparison result.
2. document detection method according to claim 1 is characterized in that, described paragraph characteristic information is the paragraph signature of default Q-character.
3. document detection method according to claim 2 is characterized in that, obtains the paragraph signature of described default Q-character by hash algorithm.
4. document detection method according to claim 3 is characterized in that, described " obtaining the paragraph signature of described default Q-character by hash algorithm " specifically may further comprise the steps:
S100, each paragraph in the document is cut speech, obtain the speech of this paragraph and two tuples list of word frequency;
S101, two tuples in the described tabulation are carried out the initial weight vector calculation;
S102, described two tuples are calculated by hash algorithm, obtained the Hash character string of default Q-character;
S103, described Hash character string is mapped in the described weight vectors;
The value of S104, the corresponding position of calculating weight vectors, the paragraph that obtains default Q-character is signed.
5. document detection method according to claim 4 is characterized in that, described S103 step specifically comprises:
Judge that in the described Hash character string each is 0 or 1, if 0, then when mapping to described weight vectors corresponding positions, this position is subtracted power; If 1, then when mapping to described weight vectors corresponding positions, this position is weighted.
6. according to claim 4 or 5 described document detection methods, it is characterized in that described S104 step specifically comprises:
Whether the value of judging the corresponding position of described weight vectors is greater than 0; If greater than 0, then the value with the corresponding position of described weight vectors is made as 1, if smaller or equal to 0, then the value with the corresponding position of this weight vectors is made as 0.
7. document detection method according to claim 3, it is characterized in that, when the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during, judge that described document is similar to existing document more than or equal to second threshold value set; When the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during, then judge described document and existing document dissmilarity less than second threshold value set.
8. document detection method according to claim 3, it is characterized in that, when the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during more than or equal to second threshold value set, and total paragraph signature of described document and total paragraph signature calculation result of described existing document judge that less than first threshold described document is similar to existing document; When the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during, then judge described document and existing document dissmilarity less than second threshold value set.
9. according to claim 7 or 8 described document detection methods, it is characterized in that described similar paragraph obtains by following steps:
By algorithm the paragraph signature that described document obtains presetting the paragraph signature of Q-character and having the default Q-character of document is calculated, if result of calculation is greater than the first threshold of being scheduled to, then described paragraph dissmilarity; If result of calculation is smaller or equal to predetermined first threshold, then described paragraph is similar.
10. document detection method according to claim 9, it is characterized in that described " the paragraph signature that described document obtains presetting the paragraph signature of Q-character and having the default Q-character of document being calculated by algorithm " is the distance by the paragraph signature of the paragraph signature of the described document of Hamming code distance calculation and described existing document.
11., it is characterized in that the copyright property of described document that will be similar to existing document is defined as pirate document according to claim 7 or 8 described document detection methods.
12., it is characterized in that the copyright property of described document that will be similar to existing document is defined as doubtful pirate document according to claim 7 or 8 described document detection methods.
13. document detection method according to claim 12 is characterized in that, examines described doubtful pirate document, if audit confirms that described doubtful pirate document is pirate document, then sends feedback information; If audit confirms that described doubtful pirate document is non-pirate document, the described non-pirate document of then online issue.
14. will go 7 or 8 described document detection methods, it is characterized in that the copyright property of one or more described existing documents that will be similar to described document is defined as pirate document according to right.
15., it is characterized in that the attribute definition of one or more described existing documents that will be similar to described document is doubtful pirate document according to claim 7 or 8 described document detection methods.
16. document detection method according to claim 15 is characterized in that, examines described doubtful pirate document, if audit confirms that described doubtful pirate document is pirate document, then deletes described pirate document; If audit confirms that described doubtful pirate document is non-pirate document, then keep described non-pirate document.
17. document detection method according to claim 16 is characterized in that, the copyright property of described non-pirate document is labeled as verifies, and/or with described non-pirate document copying/move to and verify the copyright data storehouse.
18. document detection method according to claim 17 is characterized in that, repeats claim 17 step, until the screening of finishing all existing documents.
19. document detection method according to claim 1 is characterized in that, obtains the copyright property of described document according to described judged result.
20. document detection method according to claim 19 is characterized in that, before described S1 step, also comprises the paragraph characteristic information step that makes up described existing document:
Obtain and be verified as legal digital document;
Extract the paragraph characteristic information of described digital document and set up index.
21. document detection method according to claim 20 is characterized in that, described " making up the paragraph characteristic information of described existing document " step also comprises:
Whether discern described digital document is document;
If, then extract the paragraph characteristic information of described document and set up index, if not, then described digital document is converted to document by algorithm after, extract the paragraph characteristic information of described document and set up index.
22. document detection method according to claim 20 is characterized in that, after described " making up the paragraph characteristic information of described existing document " step, also comprises:
The digital document of not verifying copyright property that reception is uploaded.
23. according to the described document detection method of claim 22, it is characterized in that, after described " digital document of not verifying copyright property that reception is uploaded " step, also comprise:
Judge whether described digital document is document;
If, then carry out the S1 step, if not, then described digital document is converted to document by algorithm after, carry out the S1 step.
24. document detection method according to claim 23 is characterized in that, before described S1 step, also comprises described document is stored.
25. document detection method according to claim 1 is characterized in that, obtains the copyright property of described existing document according to described judged result.
26. document detection method according to claim 25 is characterized in that, before described S1 step, also comprises the paragraph characteristic information step that makes up described existing document:
Obtain the existing digital document of not verifying copyright property;
Extract the paragraph characteristic information of described digital document and set up index.
27. document detection method according to claim 26 is characterized in that, described " making up the paragraph characteristic information of described existing document " step also comprises:
Whether discern described digital document is document;
If, then extract the paragraph characteristic information of described document and set up index, if not, then described digital document is converted to document by algorithm after, extract the paragraph characteristic information of described document and set up index.
28. document detection method according to claim 25 is characterized in that, after described " making up the paragraph characteristic information of described existing document " step, also comprises:
Receive and be verified as legal digital document.
29. according to the described document detection method of claim 28, it is characterized in that, after described " receive and be verified as legal digital document " step, also comprise:
Judge whether described digital document is document;
If, then carry out the S1 step, if not, then described digital document is converted to document by algorithm after, carry out the S1 step.
30. a document detection device is characterized in that, described document detection device comprises:
Acquiring unit is used to obtain and document paragraph characteristic information correspondingly;
Comparing unit is used for the paragraph characteristic information of described document and the paragraph characteristic information of at least one existing document are compared;
Judging unit is used for judging whether to have the existing document similar to described document according to described comparison result.
31. document detection device according to claim 30 is characterized in that, described paragraph characteristic information is the paragraph signature of default Q-character.
32. document detection device according to claim 31 is characterized in that, obtains the paragraph signature of described default Q-character by hash algorithm.
33. document detection device according to claim 32 is characterized in that, described deriving means is used for:
Each paragraph in the document is cut speech, obtain the speech of this paragraph and two tuples list of word frequency;
Two tuples in the described tabulation are carried out the initial weight vector calculation;
Described two tuples are calculated by hash algorithm, obtained the Hash character string of default Q-character;
Described Hash character string is mapped in the described weight vectors;
Calculate the value of the corresponding position of weight vectors, obtain the paragraph signature of default Q-character.
34. document detection device according to claim 33 is characterized in that, described deriving means is used for: each that judge described Hash character string is 0 or 1, if 0, then when mapping to described weight vectors corresponding positions, this position is subtracted power; If 1, then when mapping to described weight vectors corresponding positions, this position is weighted.
35. according to claim 33 or 34 described document detection devices, it is characterized in that described deriving means is used for: whether the value of judging the corresponding position of described weight vectors is greater than 0; If greater than 0, then the value with the corresponding position of described weight vectors is made as 1, if smaller or equal to 0, then the value with the corresponding position of this weight vectors is made as 0.
36. document detection device according to claim 32, it is characterized in that, described judging unit is used for: when the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during more than or equal to second threshold value set, judge that described document is similar to existing document; When the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during, then judge described document and existing document dissmilarity less than second threshold value set.
37. document detection device according to claim 32, it is characterized in that, described judging unit is used for: when the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during more than or equal to second threshold value set, and total paragraph signature of described document and total paragraph signature calculation result of described existing document judge that less than first threshold described document is similar to existing document; When the ratio of total paragraph number of the similar paragraph number of described document and described existing document and described document during, then judge described document and existing document dissmilarity less than second threshold value set.
38. according to claim 36 or 37 described document detection devices, it is characterized in that, described comparing unit is used for by algorithm the paragraph signature that described document obtains presetting the paragraph signature of Q-character and having the default Q-character of document being calculated, if result of calculation is greater than predetermined first threshold, then described paragraph dissmilarity; If result of calculation is smaller or equal to predetermined first threshold, then described paragraph is similar.
39., it is characterized in that the distance of the paragraph signature by the described document of Hamming code distance calculation and the paragraph signature of described existing document according to the described document detection device of claim 38.
40., it is characterized in that described judging unit is used for the copyright property of the described document similar to existing document is defined as pirate document according to claim 36 or 37 described document detection devices.
41., it is characterized in that described judging unit is used for the copyright property of the described document similar to existing document is defined as doubtful pirate document according to claim 36 or 37 described document detection devices.
42., it is characterized in that described document detection device also comprises the unit that is used for sending feedback information after the described document of audit affirmation is pirate document according to the described document detection device of claim 41.
43., it is characterized in that described document detection device also comprises the unit that is used for the described non-pirate document of online issue after the described document of audit affirmation is non-pirate document according to the described document detection device of claim 41.
44. will remove 36 or 37 described document detection devices, it is characterized in that the attribute definition that described judging unit is used for the one or more described existing documents similar to described document is pirate document according to right.
45., it is characterized in that the attribute definition that described judging unit is used for the one or more described existing documents similar to described document is doubtful pirate document according to claim 36 or 37 described document detection devices.
46., it is characterized in that described document detection device comprises that also being used for confirming to delete after described document is pirate document described pirate document when audit deletes the processing unit of described pirated file according to the described document detection device of claim 45.
47. according to the described document detection device of claim 46, it is characterized in that, described processing unit also is used for after audit confirms that described document is non-pirate document, the copyright property of described non-pirate document is labeled as verifies, and/or with described non-pirate document copying/move to and verify the copyright data storehouse.
48. document detection device according to claim 30 is characterized in that described judging unit also is used for obtaining according to described judged result the copyright property of described document.
49., it is characterized in that described document detection device also comprises according to the described document detection device of claim 48:
Be used to store the unit that has been verified as legal digital document; And
The unit that is used to extract the paragraph characteristic information of described digital document and sets up index.
50., it is characterized in that described document detection device also comprises according to the described document detection device of claim 48:
Be used to receive the unit of the digital document of uploading of not verifying copyright property.
51. document detection device according to claim 30 is characterized in that described judging unit also is used for obtaining according to described judged result the copyright property of described existing document.
52., it is characterized in that described document detection device also comprises according to the described document detection device of claim 51:
Be used to store the existing unit of not verifying the digital document of copyright property; And
The unit that is used to extract the paragraph characteristic information of described digital document and sets up index.
53., it is characterized in that described document detection device also comprises according to the described document detection device of claim 51:
Be used to receive the unit that has been verified as legal digital document.
54., it is characterized in that described document detection device also comprises according to any described document detection device in the claim 48 to 53:
Whether be used to discern described digital document is the unit of document;
Be used for described digital document being converted to the unit of document by algorithm.
55., it is characterized in that described document detection device also comprises and is used for unit that described document is stored according to any described document detection device in the claim 48 to 53.
CN2011100808382A 2011-03-31 2011-03-31 Method and device for detecting document Active CN102156689B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011100808382A CN102156689B (en) 2011-03-31 2011-03-31 Method and device for detecting document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011100808382A CN102156689B (en) 2011-03-31 2011-03-31 Method and device for detecting document

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201210340026.1A Division CN102915295B (en) 2011-03-31 2011-03-31 Document detecting method and document detecting device

Publications (2)

Publication Number Publication Date
CN102156689A true CN102156689A (en) 2011-08-17
CN102156689B CN102156689B (en) 2012-11-28

Family

ID=44438191

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011100808382A Active CN102156689B (en) 2011-03-31 2011-03-31 Method and device for detecting document

Country Status (1)

Country Link
CN (1) CN102156689B (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360372A (en) * 2011-10-09 2012-02-22 北京航空航天大学 Cross-language document similarity detection method
CN102955763A (en) * 2011-08-22 2013-03-06 联想(北京)有限公司 Display method and display device
CN102968610A (en) * 2011-08-31 2013-03-13 富士通株式会社 Method and device for processing receipt images
CN103095824A (en) * 2013-01-09 2013-05-08 广东一一五科技有限公司 File uploading control method and system
CN103179216A (en) * 2013-04-16 2013-06-26 上海同岩土木工程科技有限公司 File scanning and automatic unloading method based on Twain protocol
CN103970722A (en) * 2014-05-07 2014-08-06 江苏金智教育信息技术有限公司 Text content duplicate removal method
CN104270474A (en) * 2014-11-02 2015-01-07 佛山美阳瓴电子科技有限公司 Device and method used for sharing information in network
CN105183835A (en) * 2015-08-31 2015-12-23 小米科技有限责任公司 Method and apparatus for information marking in social software
CN105183809A (en) * 2015-08-26 2015-12-23 成都布林特信息技术有限公司 Cloud platform data query method
CN105205104A (en) * 2015-08-26 2015-12-30 成都布林特信息技术有限公司 Cloud platform data acquisition method
CN106649257A (en) * 2016-09-21 2017-05-10 联动优势科技有限公司 Semantic section conversion method and device
CN106844314A (en) * 2017-02-21 2017-06-13 北京焦点新干线信息技术有限公司 A kind of duplicate checking method and device of article
CN107368472A (en) * 2017-07-26 2017-11-21 成都科来软件有限公司 It is a kind of can iteration optimization document analysis result store method
CN107798637A (en) * 2016-08-30 2018-03-13 北京国双科技有限公司 The different acquisition methods and device for sentencing document of accomplice
CN108491458A (en) * 2018-03-02 2018-09-04 深圳市联软科技股份有限公司 A kind of sensitive document detection method, medium and equipment
CN108614827A (en) * 2016-12-12 2018-10-02 阿里巴巴集团控股有限公司 Data segmentation method, judging method and electronic equipment
CN112183052A (en) * 2020-09-29 2021-01-05 百度(中国)有限公司 Document repetition degree detection method, device, equipment and medium
TWI726356B (en) * 2019-07-16 2021-05-01 宏碁股份有限公司 Electronic device and file content management method
CN113138964A (en) * 2021-05-20 2021-07-20 掌阅科技股份有限公司 Electronic book information display method, user terminal and computer storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915295B (en) * 2011-03-31 2015-03-25 百度在线网络技术(北京)有限公司 Document detecting method and document detecting device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833579A (en) * 2010-05-11 2010-09-15 同方知网(北京)技术有限公司 Method and system for automatically detecting academic misconduct literature

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833579A (en) * 2010-05-11 2010-09-15 同方知网(北京)技术有限公司 Method and system for automatically detecting academic misconduct literature

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《大连理工大学学报》 20050331 金博等 基于语义理解的文本相似度算法 第291-297页 第45卷, 第2期 *
《计算机技术与发展》 20090430 赵俊杰等 一种基于段落词频统计的论文抄袭判定算法 第231-233,238页 1-3,7,19,25,30-32,36,48,51 第19卷, 第4期 *

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955763A (en) * 2011-08-22 2013-03-06 联想(北京)有限公司 Display method and display device
CN102955763B (en) * 2011-08-22 2016-07-06 联想(北京)有限公司 Display packing and display device
CN102968610A (en) * 2011-08-31 2013-03-13 富士通株式会社 Method and device for processing receipt images
CN102360372A (en) * 2011-10-09 2012-02-22 北京航空航天大学 Cross-language document similarity detection method
CN103095824B (en) * 2013-01-09 2016-01-20 广东一一五科技有限公司 Files passe control method and system
CN103095824A (en) * 2013-01-09 2013-05-08 广东一一五科技有限公司 File uploading control method and system
CN103179216A (en) * 2013-04-16 2013-06-26 上海同岩土木工程科技有限公司 File scanning and automatic unloading method based on Twain protocol
CN103970722B (en) * 2014-05-07 2017-04-05 江苏金智教育信息技术有限公司 A kind of method of content of text duplicate removal
CN103970722A (en) * 2014-05-07 2014-08-06 江苏金智教育信息技术有限公司 Text content duplicate removal method
CN104270474A (en) * 2014-11-02 2015-01-07 佛山美阳瓴电子科技有限公司 Device and method used for sharing information in network
CN105183809A (en) * 2015-08-26 2015-12-23 成都布林特信息技术有限公司 Cloud platform data query method
CN105205104A (en) * 2015-08-26 2015-12-30 成都布林特信息技术有限公司 Cloud platform data acquisition method
CN105183835A (en) * 2015-08-31 2015-12-23 小米科技有限责任公司 Method and apparatus for information marking in social software
CN105183835B (en) * 2015-08-31 2018-09-04 小米科技有限责任公司 The method and device of information flag in social software
CN107798637A (en) * 2016-08-30 2018-03-13 北京国双科技有限公司 The different acquisition methods and device for sentencing document of accomplice
CN106649257A (en) * 2016-09-21 2017-05-10 联动优势科技有限公司 Semantic section conversion method and device
CN106649257B (en) * 2016-09-21 2019-06-18 联动优势科技有限公司 A kind of conversion method and device of semanteme section
CN108614827A (en) * 2016-12-12 2018-10-02 阿里巴巴集团控股有限公司 Data segmentation method, judging method and electronic equipment
CN106844314A (en) * 2017-02-21 2017-06-13 北京焦点新干线信息技术有限公司 A kind of duplicate checking method and device of article
CN106844314B (en) * 2017-02-21 2019-10-18 北京焦点新干线信息技术有限公司 A kind of duplicate checking method and device of article
CN107368472A (en) * 2017-07-26 2017-11-21 成都科来软件有限公司 It is a kind of can iteration optimization document analysis result store method
CN107368472B (en) * 2017-07-26 2021-01-05 成都科来软件有限公司 Storage method of document analysis result capable of being iteratively optimized
CN108491458A (en) * 2018-03-02 2018-09-04 深圳市联软科技股份有限公司 A kind of sensitive document detection method, medium and equipment
TWI726356B (en) * 2019-07-16 2021-05-01 宏碁股份有限公司 Electronic device and file content management method
CN112183052A (en) * 2020-09-29 2021-01-05 百度(中国)有限公司 Document repetition degree detection method, device, equipment and medium
CN112183052B (en) * 2020-09-29 2024-03-05 百度(中国)有限公司 Document repetition degree detection method, device, equipment and medium
CN113138964A (en) * 2021-05-20 2021-07-20 掌阅科技股份有限公司 Electronic book information display method, user terminal and computer storage medium
CN113138964B (en) * 2021-05-20 2021-11-19 掌阅科技股份有限公司 Electronic book information display method, user terminal and computer storage medium

Also Published As

Publication number Publication date
CN102156689B (en) 2012-11-28

Similar Documents

Publication Publication Date Title
CN102156689B (en) Method and device for detecting document
CN102915295B (en) Document detecting method and document detecting device
US20180075138A1 (en) Electronic document management using classification taxonomy
US10216848B2 (en) Method and system for recommending cloud websites based on terminal access statistics
JP5542859B2 (en) Log management apparatus, log storage method, log search method, and program
KR101407060B1 (en) Method for analysis and validation of online data for digital forensics and system using the same
US20110208715A1 (en) Automatically mining intents of a group of queries
US20090204617A1 (en) Content acquisition system and method of implementation
US9069771B2 (en) Music recognition method and system based on socialized music server
CN110705235B (en) Information input method and device for business handling, storage medium and electronic equipment
US10540325B2 (en) Method and device for identifying junk picture files
CN104866985A (en) Express bill number identification method, device and system
US9218419B2 (en) Snapshot generation for search results page preview
US20180330206A1 (en) Machine-based learning systems, methods, and apparatus for interactively mapping raw data objects to recognized data objects
US10691877B1 (en) Homogenous insertion of interactions into documents
CN115935042B (en) Mortgage asset intelligent duplicate checking method and system based on fusion model
US20090182759A1 (en) Extracting entities from a web page
KR102532216B1 (en) Method for establishing ESG database with structured ESG data using ESG auxiliary tool and ESG service providing system performing the same
WO2019028249A1 (en) Automated reporting system
CN111459936B (en) Data management method, data management device and server
US20130194636A1 (en) Document certificates
US11113520B2 (en) Information processing apparatus and non-transitory computer readable medium
CN108959646B (en) Method, system, device and storage medium for automatically verifying communication number
JP5718630B2 (en) Information processing apparatus, information asset management system, information asset management method, and program
US9251253B2 (en) Expeditious citation indexing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant