CN102156689B - Method and device for detecting document - Google Patents

Method and device for detecting document Download PDF

Info

Publication number
CN102156689B
CN102156689B CN2011100808382A CN201110080838A CN102156689B CN 102156689 B CN102156689 B CN 102156689B CN 2011100808382 A CN2011100808382 A CN 2011100808382A CN 201110080838 A CN201110080838 A CN 201110080838A CN 102156689 B CN102156689 B CN 102156689B
Authority
CN
China
Prior art keywords
document
paragraph
existing
pirate
characteristic information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2011100808382A
Other languages
Chinese (zh)
Other versions
CN102156689A (en
Inventor
周纾
李彦宏
徐兴军
张雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN2011100808382A priority Critical patent/CN102156689B/en
Publication of CN102156689A publication Critical patent/CN102156689A/en
Application granted granted Critical
Publication of CN102156689B publication Critical patent/CN102156689B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Storage Device Security (AREA)

Abstract

The invention provides a method for detecting a document. The method comprises the following steps of: acquiring paragraph characteristic information corresponding to the document; comparing the paragraph characteristic information of the document with paragraph characteristic information of at least one existing document; and judging whether the existing document similar to the document is provided according to a comparison result. The document can be detected by using the paragraph characteristic information, so the similarities of the documents can be compared accurately, and cheating on sectional processing of the document is avoided; moreover, the checking efficiency is higher and the pressing pressure of a server is lower; the method for detecting the document can be used for improving the online document copyright property detection, and detecting the document when the document is uploaded, so that the following document copyright property detection which causes unnecessary pressure on the server can be avoided; and the copyright property detection of the existing documents can be processed in mass, so the efficiency is higher.

Description

Document detection method and device
Technical field
The present invention relates to a kind of document detection method and device, refer in particular to a kind of document detection method and device that is used to compare the lengthy document similarity.
Background technology
Normally, the document detection method that is used for the document similarity is that title, author, the word information through document confirmed.Yet; Such way has following defective: at first, title, author, word information inquiry through document are easy to generate omission; For example; Title, the author information of the document are revised or deletion, are a plurality of parts with the document cutting perhaps, so that can't inquire about or compare other documents accurately through word information; Secondly, if document length to be checked is longer, like minister's piece of writing novel, then inquire about through word information, search efficiency is lower, server or Computer Processing pressure are bigger, influences the operate as normal efficient of server or computing machine.
Summary of the invention
The object of the present invention is to provide a kind of improved document detection method.
Another object of the present invention is to a kind of improved document detection device, described device is used described improved document detection method.
Correspondingly, the document detection method of one embodiment of the present invention comprises:
S1, obtain and document paragraph characteristic information correspondingly;
S2, the paragraph characteristic information of said document and the paragraph characteristic information of at least one existing document are compared;
S3, judge whether to have the existing document similar with said document according to said comparison result; When the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during more than or equal to second threshold value set; And total paragraph characteristic information result of calculation of total paragraph characteristic information of said document and said existing document judges that less than first threshold said document is similar with existing document; When the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during, then judge said document and existing document dissmilarity less than second threshold value set.
As further improvement of the present invention, said paragraph characteristic information is the paragraph signature of preset Q-character.
As further improvement of the present invention, obtain the paragraph signature of said preset Q-character through hash algorithm.
As further improvement of the present invention, said " obtaining the paragraph signature of said preset Q-character through hash algorithm " specifically may further comprise the steps:
S100, each paragraph in the document is cut speech, obtain the speech of this paragraph and the doublet tabulation of word frequency;
S101, the doublet in the said tabulation is carried out the initial weight vector calculation;
S102, said doublet is calculated through hash algorithm, obtained the Hash character string of preset Q-character;
S103, said Hash character string is mapped in the said weight vectors;
The value of S104, the corresponding position of calculating weight vectors, the paragraph that obtains preset Q-character is signed.
As further improvement of the present invention, said S103 step specifically comprises:
Judge that in the said Hash character string each is 0 or 1, if 0, then when mapping to said weight vectors corresponding positions, this position is subtracted power; If 1, then when mapping to said weight vectors corresponding positions, weighting is carried out in this position.
As further improvement of the present invention, said S104 step specifically comprises:
Whether the value of judging the corresponding position of said weight vectors is greater than 0; If greater than 0, then the value with the corresponding position of said weight vectors is made as 1, if smaller or equal to 0, then the value with the corresponding position of this weight vectors is made as 0.
As further improvement of the present invention,, judge that said document is similar with existing document when the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during more than or equal to second threshold value set; When the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during, then judge said document and existing document dissmilarity less than second threshold value set.As further improvement of the present invention, said similar paragraph number obtains through following steps:
Through algorithm the paragraph signature that said document obtains presetting the paragraph signature of Q-character and having the preset Q-character of document is calculated, greater than the first threshold of being scheduled to, then said paragraph is dissimilar as if result of calculation; If result of calculation is smaller or equal to predetermined first threshold, then said paragraph is similar.
As further improvement of the present invention, said " the paragraph signature that said document obtains presetting the paragraph signature of Q-character and having the preset Q-character of document being calculated through algorithm " is the distance through the paragraph signature of the paragraph signature of the said document of Hamming code distance calculation and said existing document.
As further improvement of the present invention, the copyright property of said document that will be similar with existing document is defined as pirate document.
As further improvement of the present invention, the copyright property of said document that will be similar with existing document is defined as doubtful pirate document.
As further improvement of the present invention, examine said doubtful pirate document, if audit confirms that said doubtful pirate document is pirate document, then send feedback information; If audit confirms that said doubtful pirate document is non-pirate document, the said non-pirate document of then online issue.
As further improvement of the present invention, the copyright property of one or more said existing documents that will be similar with said document is defined as pirate document.
As further improvement of the present invention, the attribute definition of one or more said existing documents that will be similar with said document is doubtful pirate document.
As further improvement of the present invention, examine said doubtful pirate document, if audit confirms that said doubtful pirate document is pirate document, then delete said pirate document; If audit confirms that said doubtful pirate document is non-pirate document, then keep said non-pirate document.
As further improvement of the present invention, the copyright property of said non-pirate document is labeled as verifies, and/or with said non-pirate document copying/move to and verify the copyright data storehouse.
As further improvement of the present invention, repeat claim 17 step, until the screening of accomplishing all existing documents.
As further improvement of the present invention, obtain the copyright property of said document according to said judged result.
As further improvement of the present invention, before said S1 step, also comprise the paragraph characteristic information step that makes up existing document:
Obtain and be verified as legal digital document;
Extract the paragraph characteristic information of said digital document and set up index.
As further improvement of the present invention, said " making up the paragraph characteristic information of existing document " step also comprises:
Whether discern said digital document is document;
If, then extract the paragraph characteristic information of said document and set up index, if not, then convert said digital document into document through algorithm after, extract the paragraph characteristic information of said document and set up index.
As further improvement of the present invention, after said " making up the paragraph characteristic information of existing document " step, also comprise:
The digital document of not verifying copyright property that reception is uploaded.
As further improvement of the present invention, after said " digital document of not verifying copyright property that reception is uploaded " step, also comprise:
Judge whether said digital document is document;
If, then carry out the S1 step, if not, then convert said digital document into document through algorithm after, carry out S 1 step.
As further improvement of the present invention, before said S1 step, also comprise said document is stored.
As further improvement of the present invention, obtain the copyright property of said existing document according to said judged result.
As further improvement of the present invention, before said S1 step, also comprise the paragraph characteristic information step that makes up existing document:
Obtain the existing digital document of not verifying copyright property;
Extract the paragraph characteristic information of said digital document and set up index.
As further improvement of the present invention, said " making up the paragraph characteristic information of existing document " step also comprises:
Whether discern said digital document is document;
If, then extract the paragraph characteristic information of said document and set up index, if not, then convert said digital document into document through algorithm after, extract the paragraph characteristic information of said document and set up index.
As further improvement of the present invention, after said " making up the paragraph characteristic information of existing document " step, also comprise:
Receive and be verified as legal digital document.
As further improvement of the present invention, after said " receive and be verified as legal digital document " step, also comprise:
Judge whether said digital document is document;
If, then carry out the S1 step, if not, then convert said digital document into document through algorithm after, carry out the S1 step.Correspondingly, the document detection device of one embodiment of the present invention comprises:
Acquiring unit is used to obtain and document paragraph characteristic information correspondingly;
Comparing unit is used for the paragraph characteristic information of said document and the paragraph characteristic information of at least one existing document are compared;
Judging unit; Be used for judging whether to have the existing document similar with said document according to said comparison result; When the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during more than or equal to second threshold value set; And total paragraph characteristic information result of calculation of total paragraph characteristic information of said document and said existing document judges that less than first threshold said document is similar with existing document; When the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during, then judge said document and existing document dissmilarity less than second threshold value set.
As further improvement of the present invention, said paragraph characteristic information is the paragraph signature of preset Q-character.
As further improvement of the present invention, obtain the paragraph signature of said preset Q-character through hash algorithm.
As further improvement of the present invention, said deriving means is used for:
Each paragraph in the document is cut speech, obtain the speech of this paragraph and the doublet tabulation of word frequency;
Doublet in the said tabulation is carried out the initial weight vector calculation;
Said doublet is calculated through hash algorithm, obtained the Hash character string of preset Q-character;
Said Hash character string is mapped in the said weight vectors;
Calculate the value of the corresponding position of weight vectors, obtain the paragraph signature of preset Q-character.
As further improvement of the present invention, said deriving means is used for: each that judge said Hash character string is 0 or 1, if 0, then when mapping to said weight vectors corresponding positions, this position is subtracted power; If 1, then when mapping to said weight vectors corresponding positions, weighting is carried out in this position.
As further improvement of the present invention, said deriving means is used for: whether the value of judging the corresponding position of said weight vectors is greater than 0; If greater than 0, then the value with the corresponding position of said weight vectors is made as 1, if smaller or equal to 0, then the value with the corresponding position of this weight vectors is made as 0.
As further improvement of the present invention, said judging unit is used for: when the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during more than or equal to second threshold value set, judge that said document is similar with existing document; When the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during, then judge said document and existing document dissmilarity less than second threshold value set.As further improvement of the present invention; Said comparing unit is used for through algorithm the paragraph signature that said document obtains presetting the paragraph signature of Q-character and having the preset Q-character of document being calculated; If result of calculation is greater than predetermined first threshold, then said paragraph is dissimilar; If result of calculation is smaller or equal to predetermined first threshold, then said paragraph is similar.
As further improvement of the present invention, the distance of the paragraph signature through the said document of Hamming code distance calculation and the paragraph signature of said existing document.
As further improvement of the present invention, said judging unit is used for the copyright property of the said document similar with existing document is defined as pirate document.
As further improvement of the present invention, said judging unit is used for the copyright property of the said document similar with existing document is defined as doubtful pirate document.
As further improvement of the present invention, said document detection device also comprises the unit that is used for after the said document of audit affirmation is pirate document, sending feedback information.
As further improvement of the present invention, said document detection device also comprises the unit that is used for the said non-pirate document of online issue after the said document of audit affirmation is non-pirate document.
As further improvement of the present invention, the attribute definition that said judging unit is used for the one or more said existing documents similar with said document is pirate document.
As further improvement of the present invention, the attribute definition that said judging unit is used for the one or more said existing documents similar with said document is doubtful pirate document.
As further improvement of the present invention, said document detection device comprises that also being used for confirming to delete after said document is pirate document said pirate document when audit deletes the processing unit of said pirated file.
As further improvement of the present invention; Said processing unit also is used for after audit confirms that said document is non-pirate document; The copyright property of said non-pirate document is labeled as verifies, and/or with said non-pirate document copying/move to and verify the copyright data storehouse.
As further improvement of the present invention, said judging unit also is used for obtaining according to said judged result the copyright property of said document.
As further improvement of the present invention, said document detection device also comprises:
Be used to store the unit that has been verified as legal digital document; And
The unit that is used to extract the paragraph characteristic information of said digital document and sets up index.
As further improvement of the present invention, said document detection device also comprises:
Be used to receive the unit of the digital document of uploading of not verifying copyright property.
As further improvement of the present invention, said judging unit also is used for obtaining according to said judged result the copyright property of said existing document.
As further improvement of the present invention, said document detection device also comprises:
Be used to store the existing unit of not verifying the digital document of copyright property; And
The unit that is used to extract the paragraph characteristic information of said digital document and sets up index.
As further improvement of the present invention, said document detection device also comprises:
Be used to receive the unit that has been verified as legal digital document.
As further improvement of the present invention, said document detection device also comprises:
Whether be used to discern said digital document is the unit of document;
Be used for said digital document being converted into the unit of document through algorithm.
As further improvement of the present invention, said document detection device also comprises and is used for unit that said document is stored.
The invention has the beneficial effects as follows: the present invention detects document through the paragraph characteristic information; Can carry out the similarity comparison between the document comparatively exactly; Avoided cheating simultaneously, and, made that search efficiency is higher, the server/computer processing pressure is less through such detection mode; In addition, the present invention is used to improve the online document copyright property with the document detection method and detects, can, document promptly the document be detected when uploading, and avoiding follow-up when the document copyright property is detected, the unnecessary pressure that causes to server; Simultaneously, the copyright property of the existing document of the processing server storage that the present invention can be in batches detects, and efficient is higher.
Description of drawings
Fig. 1 is the process flow diagram of document detection method in an embodiment of the present invention;
Fig. 2 is the process flow diagram that obtains the method for paragraph signature in an embodiment of the present invention;
Fig. 3 is the process flow diagram that in an embodiment of the present invention the document detection mode is used for when uploading document screening the copyright property of document;
Fig. 4 is the process flow diagram that in one embodiment of the present invention the document detection mode is used to screen the copyright property of existing document;
Fig. 5 is a process flow diagram of setting up legal database in one embodiment of the present invention;
Fig. 6 sets up the process flow diagram of not verifying the copyright data storehouse in one embodiment of the present invention;
Fig. 7 is the module map of document pick-up unit in one embodiment of the present invention.
Embodiment
Below will combine each embodiment shown in the drawings to describe the present invention.But these embodiments do not limit the present invention, and the conversion on the structure that those of ordinary skill in the art makes according to these embodiments, method or the function all is included in protection scope of the present invention.
As shown in Figure 1, in an embodiment of the present invention, said document detection method comprises:
S 1, obtain and document paragraph characteristic information correspondingly; Wherein, it is the e-file of main body that said document refers to the text, and preferably, in this embodiment, said document also can carry out copy editor's e-file, for example txt file, doc file etc.Through discerning the newline in this e-file, can obtain the paragraph information of said document, said document is divided into one or more paragraphs.In best mode for carrying out the invention, after obtaining said one or more paragraph, can calculate the paragraph characteristic information of said one or more paragraphs through hash algorithm; Preferably, this paragraph characteristic information is the paragraph signature of preset Q-character, in order to improve the efficient of comparing with the paragraph characteristic information of existing document in the S2 step; Take into account the accuracy of comparison simultaneously, in best mode for carrying out the invention, this predetermined characteristic position is 64; For example: 110101000100 ... 011 (64 altogether, each value 0 or 1 does not have other values); Certainly, in other embodiments of the present invention, this predetermined characteristic position also can be 128,256 etc.As shown in Figure 2, in an embodiment of the present invention, obtain the method for the paragraph signature of said preset Q-character, it comprises the steps:
S100, each paragraph in the document is cut speech, obtain the speech of this paragraph and the doublet tabulation of word frequency; This cuts the method for speech, and those of ordinary skills can through prior art on top of repeat no more at this.
S101, the doublet in the said tabulation is carried out the initial weight vector calculation; Wherein, each speech and word frequency all have a weight vectors, and in best mode for carrying out the invention, if should preset Q-character be 64, then this weight vectors just has 64 dimensions, and each dimension promptly is each Q-character of representing in these 64.
S102, said doublet is calculated through hash algorithm, obtained the Hash character string of preset Q-character; In best mode for carrying out the invention, promptly be that 64 Hash character string is calculated and obtained to the speech and the word frequency of this paragraph through hash algorithm.
S103, said Hash character string is mapped in the said weight vectors; Concrete steps are: judge that at first in the said Hash character string each is 0 or 1, if 0, then when mapping to said weight vectors corresponding positions, this position is subtracted power, promptly be (log (f+0.1)); If 1, then when mapping to said weight vectors corresponding positions, weighting is carried out in this position; Promptly be (+log (f+0.1)); Need to prove: f is the frequency of occurrences of word, if unit of account is a paragraph, f then is the total quantity that this word occurs in paragraph; If be that f then is the total quantity that this word occurs in full text in full.
The value of S104, the corresponding position of calculating weight vectors, the paragraph that obtains preset Q-character is signed.Concrete steps are: whether the value of judging the corresponding position of said weight vectors is greater than 0; If greater than 0, then the value with the corresponding position of this weight vectors is made as 1, if smaller or equal to 0, then the value with the corresponding position of this weight vectors is made as 0, like this, can obtain the paragraph signature of said preset Q-character.
S2, the paragraph characteristic information of said document and the paragraph characteristic information of at least one existing document are compared; In preferred forms of the present invention; Promptly being that the paragraph that said document obtains presetting the paragraph signature of Q-character and having the preset Q-character of document is signed according to certain algorithm computation, preferably, is the distance through the paragraph signature and the existing paragraph signature of Hamming code distance calculation document; If this distance is during greater than the first threshold of being scheduled to; Then think these two paragraph dissmilarities, if this distance during smaller or equal to the first threshold of being scheduled to, thinks that then these two paragraphs are similar; In best mode for carrying out the invention, this first threshold is 6.Certainly; This comparison also can be included in the index database that has document and inquire about; Promptly be to compare through the paragraph characteristic of said document and the paragraph characteristic information of a plurality of existing documents, the building mode of this index database will specify at following combination Fig. 5, Fig. 6.
S3, judge whether to have the existing document similar with said document according to said comparison result.Result through comparison can know, when the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during more than or equal to second threshold value of setting, judges that said document is similar with existing document; When the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during, then judge said document and existing document dissmilarity less than second threshold value set.Said second threshold value is the ratio of the total paragraph number of said similar paragraph number/document, and it can according to circumstances be set, for example; If need comparison comparatively accurate; Then can omit if worry comparison with said second threshold setting more greatly, then can said second threshold setting is littler.In best mode for carrying out the invention, this second threshold value is set in 0.5 ~ 1 the interval.Preferably; In preferred forms of the present invention; The similar paragraph number that not only needs document and existing document is more than or equal to second threshold value of setting, and the Hamming code distance of whole paragraph signature of whole paragraph signature and said existing document that also need satisfy said document is smaller or equal to first threshold, and the acquisition mode that this integral body paragraph is signed can be joined the content that discloses in the S1 step; Promptly be with entire article as a paragraph, calculate the paragraph signature of this paragraph through hash algorithm.
Through above-mentioned flow process, can carry out the similarity comparison between the document comparatively exactly, avoid cheating, and search efficiency is higher, the server/computer processing pressure is less.
Along with popularizing and development of internet, online reading has become a kind of main reading method, and simultaneously, shared, the popularization of online document have also become a kind of important information issue means.For example, there is Google books (http://books.google.com/bkshp tab=yp) more famous abroad online reading website, and there is Baidu library (http://wenku.baidu.com/) or the like domestic more famous online reading website.
This open online document sharing, popularization, reading method, though made things convenient for user search greatly, read, share document, easily, the gratis obtains relevant information and knowledge.But this mode also exists big drawback: promptly be because of the sharing, promote of user or service provider, and invaded the literary property of original text author, distribution society, publisher, make the latter suffer enormous economic loss.
For addressing the above problem, the document that needs will share, promote carries out identification, to screen its copyright property.In one embodiment of the invention; Can above-mentioned document detection mode be used to screen the copyright property of document; Its main path can comprise two kinds: the one, and the judged result through above-mentioned steps just obtains the copyright property at the document of upload server; Another is the copyright property of the existing document that obtains having stored in the server of the judged result through above-mentioned steps, will combine accompanying drawing to introduce the idiographic flow of these two kinds of approach below respectively.
As shown in Figure 3, in an embodiment of the present invention, the copyright property of the document that can uploaded through the judged result of above-mentioned document detection method, and can make corresponding operating to said document according to said copyright property, its step comprises:
The digital document of not verifying copyright property that S10, reception are uploaded; In this step, usually, can be through browser or client software login online document website; And the digital document of this locality is uploaded to the server of online document website, promptly be that the server of online document website receives the digital document of uploading, usually; This digital document is not verify copyright property, and it possibly be the digital document that obtains through various channels, for example; Download scanning etc.This digital document can comprise various ways, like text, e-book, picture, PDF or the like.In the application scenarios of this embodiment, generally be from the said digital document of not verifying copyright property of client upload through user or service provider.
Whether S11, the said digital document of identification are document; If, then get into the S13 step, if not, then be introduced into the S12 step after, get into the S13 step again; In this step, can pass through the suffix name of the said digital document of identification, whether be document to judge said digital document, for example, if this digital document is a text, its suffix is called txt, doc or the like suffix about document, judges that then this digital document is a document; If this digital document is picture, PDF etc., its suffix is called the suffix of non-documents such as jpg, bmp, pdf, judges that then this digital document is non-document.Certainly, on the server of said online document website, also can stipulate, can only upload document files, like this, can exclude the file in other non-documents, promptly be this step and nonessential.
S12, convert said digital document into document through algorithm; Preferably, in this embodiment, this algorithm can adopt the comparatively general OCR recognizer of industry.Those of ordinary skills can grasp through prior art quantity, just repeat no more at this.Certainly, on the server of said online document website, also can stipulate, can only upload document files, like this, can exclude the file in other non-documents, promptly be this step and nonessential.
S13, said document is stored; Wherein, the document of storage has comprised digital document and the document through being converted to that is identified as document.Certainly, this step is also nonessential, and the document can be deposited in the internal memory (RAM), and it can be deleted from said internal memory after accomplishing the examination copyright property.
S14, obtain and document paragraph characteristic information correspondingly; Wherein, this step has adopted like Fig. 1, the described concrete steps of Fig. 2, repeats no more at this.
S15, the paragraph characteristic information that the paragraph characteristic information of said document and at least one have been verified as legal existing document are compared; Wherein, the step of specifically carrying out this comparison has adopted like the S2 step among Fig. 1, repeats no more at this.In this embodiment; Said existing document is to set up the existing document of storing in the good legal database in advance; The paragraph characteristic information of said existing document is the index information of setting up through said legal database; Through the paragraph characteristic information of said document and the paragraph characteristic information of the existing document in this index are compared, confirm the copyright property of said document, this step of setting up said index information will be elaborated at following Fig. 5.
S16, judge whether said document is similar with one or more existing documents; In this embodiment, this concrete execution in step can adopt the described S3 step like Fig. 1, repeats no more at this.When judging that said document is similar with existing document, get into the S17 step; When judging that said document and existing document are dissimilar, then directly get into the S19b step;
S17, the copyright property of said document that will be similar with existing document are defined as doubtful pirate document; Preferably; Can one or more said doubtful pirate documents be polymerized to doubtful table; And with said doubtful table and/or the specified path of doubtful pirate document storage in server, the auditor can visit said doubtful table and/or doubtful pirate document through getting into this specified path, to carry out following S18 step; Certainly; Also can said doubtful table and/or doubtful pirate document active push be held to the appointment auditor, should examine, for example through the Email propelling movement so that auditor's very first time handles.In another embodiment of the present invention, can the copyright property of the said document similar with existing document directly be defined as pirate document, and get into the S19a step, send feedback information to the user;
S18, the said doubtful pirate document of audit confirm whether it is pirate document; If confirm as pirate document, then get into the S19a step, if confirm as non-pirate document, then get into the S19b step;
S19a, transmission feedback information; Preferably, in this embodiment, can send feedback information to uploading the said digital document side that does not verify copyright property; In general; Can send said feedback information through the prompting frame form, for example, in browser, eject prompting frame, in client, eject prompting frame etc.Certainly also can return a new page to browser, the content of the digital document of with the prompting side of uploading serves as pirate or does not pass through copyright authentication.
S19b, the said non-pirate document of online issue.In one embodiment; Be that said non-pirate document is added in the on-line documentation database, preferably, in a particular embodiment of the present invention; Said online document database promptly is in the legal database; Through adding said non-pirate document to said legal database, can effectively expand the legal document in the said legal database, with the more effective digital document of uploading in the future of screening.
Through above-mentioned flow process, can, document promptly the document be detected when uploading, avoiding follow-up when the document copyright property is detected, the unnecessary pressure that causes to server.
As shown in Figure 4; In an embodiment of the present invention; Can do not verified the copyright property of the existing document (being stored in the server) of copyright property through the judged result of above-mentioned document detection method; And can make corresponding operating to said existing document according to said copyright property, its step comprises:
S20, reception have been verified as legal digital document; In this step; Can obtain this legal digital document through the third party who authorizes; Also can must obtain this legal digital document through reading legal database, this third party can comprise the approach that partner site, writer, the Writers' Union, colleges and universities, publishing house or the like are legal, and can the digital document of said legal copy be uploaded onto the server; Promptly be that server receives the digital document of uploading that is verified as legal copy; Usually, this digital document can comprise various ways, like text, e-book, picture, PDF or the like.Preferably, this legal digital document can be stored in the above-mentioned legal database.
Whether S21, the said digital document of identification are document; If, then get into the S23 step, if not, then be introduced into the S22 step after, get into the S23 step again; In this step, can pass through the suffix name of the said digital document of identification, whether be document to judge said digital document, for example, if this digital document is a text, its suffix is called txt, doc or the like suffix about document, judges that then this digital document is a document; If this digital document is picture, PDF etc., its suffix is called the suffix of non-documents such as jpg, bmp, pdf, judges that then this digital document is non-document.Certainly, on the server of said online document website, also can stipulate, can only upload document files, like this, can exclude the file in other non-documents, promptly be this step and nonessential.
S22, convert said digital document into document preferably through algorithm, in this embodiment, this algorithm can adopt the comparatively general OCR recognizer of industry.Those of ordinary skills can grasp through prior art quantity, just repeat no more at this.Certainly, on the server of said online document website, also can stipulate, can only upload document files, like this, can exclude the file in other non-documents, promptly be this step and nonessential.
S23, obtain and document paragraph characteristic information correspondingly; Wherein, this step has adopted like Fig. 1, the described concrete steps of Fig. 2, repeats no more at this.
S24, the paragraph characteristic information of said document and the paragraph characteristic information of the existing document that at least one does not verify copyright property are compared; Wherein, the step of specifically carrying out this comparison has adopted like the S2 step among Fig. 1, repeats no more at this.In this embodiment; Said existing document is the existing document of setting up in advance of storing in the copyright data storehouse of not verifying; The paragraph characteristic information of said existing document is an index information of not verifying that through said the copyright data storehouse is set up; Through the paragraph characteristic information of said document and the paragraph characteristic information of the existing document in this index are compared, confirm the copyright property of said existing document, this step of setting up said index information will be elaborated at following Fig. 6.
S25, judge whether said document is similar with one or more existing documents; In this embodiment, this concrete execution in step can adopt the described S3 step like Fig. 1, repeats no more at this.What deserves to be mentioned is: in this step; The relation of said document and existing document generally is an one-to-many; It promptly is the very possible corresponding a plurality of similar pirate documents of a legal document/doubtful pirate document; Compare processing and the corresponding a plurality of pirate documents of said legal document/doubtful pirate document that can be in batches with mode so.When judging that said document is similar with existing document, then get into the S26 step; When judging that said document and existing document are dissimilar, then directly get into the S29 step;
S26, the copyright property of one or more said existing documents that will be similar with said document are defined as doubtful pirate document; Preferably; Can one or more said doubtful pirate documents be polymerized to doubtful table; And with said doubtful table and/or the specified path of doubtful pirate document storage in server, the auditor can visit said doubtful table and/or doubtful pirate document through getting into this specified path, to carry out following S27 step; Certainly; Also can said doubtful table and/or doubtful pirate document active push be held to the appointment auditor, should examine, for example through the Email propelling movement so that auditor's very first time handles.In another embodiment of the present invention, can the copyright property of the one or more said existing documents similar with said document directly be defined as pirate document, and get into the S28 step, directly delete said pirate document;
S27, the said doubtful pirate document of audit confirm whether it is pirate document; If confirm as pirate document, then get into the S28 step, if confirm as non-pirate document, then get into the S29 step;
S28, the said pirate document of deletion; In this embodiment, promptly be that said pirate document is not deleted from said the checking the copyright data storehouse.
S29, the said non-pirate document of reservation.Preferably, in this embodiment, also the copyright property of said non-pirate document is labeled as and verifies, and/or with said non-pirate document copying/move to and verify the copyright data storehouse.As a special case of this embodiment, this has verified above-mentioned legal database of copyright data storehouse.
Preferably, in this embodiment, also can constantly repeat above-mentioned steps, until to being stored in the said existing document completion screening all in the copyright data storehouse of not verifying, promptly be the pirate document of deletion.
Through above-mentioned flow process, the copyright property of the existing document of processing server storage that can be in batches detects, and efficient is higher.
As shown in Figure 5, in an embodiment of the present invention, this legal copy database and the index information that produces according to said legal database are used in the copyright property that obtains said document through the judged result of document detection method.The method that this sets up said index information may further comprise the steps:
S30, obtain and be verified as legal digital document; This obtains channel can be to obtain through the third party who authorizes, and this third party can comprise the approach that partner site, writer, the Writers' Union, colleges and universities, publishing house or the like are legal.
Whether S31, the said digital document of identification are document; If, then get into the S33 step, if not, then carry out the S32 step, after this digital document is changed, get into the S33 step again; In this step, can pass through the suffix name of the said digital document of identification, whether be document to judge said digital document, for example, if this digital document is a text, its suffix is called txt, doc or the like suffix about document, judges that then this digital document is a document; If this digital document is picture, PDF etc., its suffix is called the suffix of non-documents such as jpg, bmp, pdf, judges that then this digital document is non-document.Certainly, on the server of said online document website, also can stipulate, can only upload document files, like this, can exclude the file in other non-documents, promptly be this step and nonessential.
S32, convert said digital document into document through algorithm; In this embodiment, this algorithm can adopt the comparatively general OCR recognizer of industry.Those of ordinary skills can grasp through prior art quantity, just repeat no more at this.Certainly, on the server of said online document website, also can stipulate, can only upload document files, like this, can exclude the file in other non-documents, promptly be this step and nonessential.
The paragraph characteristic information of S33, the said digital document of extraction is also set up index.The method of this extraction paragraph characteristic information can repeat no more at this with reference to the method that is disclosed among Fig. 1, Fig. 2.Preferably, can be with this index stores in first indexing units, so that supply candidate's inquiry.Certainly in another embodiment, except said paragraph characteristic information is set up index, also information such as the title of said document, author, number of words, length, words are also set up respective index, to satisfy dissimilar inquiries.
As shown in Figure 6, in an embodiment of the present invention, do not verify the copyright data storehouse and be used in the copyright property that obtains said existing document through the judged result of document detection method according to the said index information of not verifying that the copyright data storehouse produces.The method that this sets up said index information may further comprise the steps:
S40, obtain the existing digital document of not verifying copyright property; Preferably; In this embodiment; The said digital document of not verifying copyright property is for being uploaded to the digital document of online document database, and it possibly be the digital document of not verifying copyright property of online issue, the digital document of also not issuing temporarily of not verifying copyright property.This digital document of not verifying copyright property is uploaded from user or service provider mostly, not through regular its copyright property of approach checking.
Whether S41, the said digital document of identification are document; If, then get into the S43 step, if not, then carry out the S42 step, after this digital document is changed, get into the S43 step again; In this step, can pass through the suffix name of the said digital document of identification, whether be document to judge said digital document, for example, if this digital document is a text, its suffix is called txt, doc or the like suffix about document, judges that then this digital document is a document; If this digital document is picture, PDF etc., its suffix is called the suffix of non-documents such as jpg, bmp, pdf, judges that then this digital document is non-document.Certainly, on the server of said online document website, also can stipulate, can only upload document files, like this, can exclude the file in other non-documents, promptly be this step and nonessential.
S42, convert said digital document into document through algorithm; In this embodiment, this algorithm can adopt the comparatively general OCR recognizer of industry.Those of ordinary skills can grasp through prior art quantity, just repeat no more at this.Certainly, on the server of said online document website, also can stipulate, can only upload document files, like this, can exclude the file in other non-documents, promptly be this step and nonessential.
The paragraph characteristic information of S43, the said digital document of extraction is also set up index.The method of this extraction paragraph characteristic information can repeat no more at this with reference to the method that is disclosed among Fig. 1, Fig. 2.Preferably, can be with this index stores in second indexing units, so that supply candidate's inquiry.Certainly in another embodiment, except said paragraph characteristic information is set up index, also information such as the title of said document, author, number of words, length, words are also set up respective index, to satisfy dissimilar inquiries.
As shown in Figure 7, be the module map of document pick-up unit in an embodiment of the present invention.Said document detection device has comprised acquiring unit 10, comparing unit 11, judging unit 12, first receiving element 13, recognition unit 14, converting unit 15, storage unit 16, second receiving element 17, legal database 18, first indexing units 19, has been checking copyright data storehouse 20, second indexing units 21, feedback unit 22, release unit 23, and processing unit 24.
In an embodiment of the present invention, said acquiring unit is used to obtain and document paragraph characteristic information correspondingly; Wherein, it is the e-file of main body that said document refers to the text, and preferably, in this embodiment, said document also can carry out copy editor's e-file, for example txt file, doc file etc.Through discerning the newline in this e-file, can obtain the paragraph information of said document, said document is divided into one or more paragraphs.In best mode for carrying out the invention, after obtaining said one or more paragraph, can calculate the paragraph characteristic information of said one or more paragraphs through hash algorithm; Preferably, this paragraph characteristic information is the paragraph signature of preset Q-character, in order to improve the efficient of comparing with the paragraph characteristic information of existing document in the comparing unit; Take into account the accuracy of comparison simultaneously, in best mode for carrying out the invention, this predetermined characteristic position is 64; For example: 110101000100 ... 011 (64 altogether, each value 0 or 1 does not have other values); Certainly, in other embodiments of the present invention, this predetermined characteristic position also can be 128,256 etc.Wherein, when obtaining the paragraph characteristic of said predetermined characteristic position, said acquisition unit can be used for: each paragraph in the document is cut speech, obtain the speech of this paragraph and the doublet tabulation of word frequency; Doublet in the said tabulation is carried out the initial weight vector calculation; Said doublet is calculated through hash algorithm, obtained the Hash character string of preset Q-character; Said Hash character string is mapped in the said weight vectors; Be specially: judge that in the said Hash character string each is 0 or 1, if 0, then when mapping to said weight vectors corresponding positions, this position is subtracted power, promptly be (log (f+0.1)); If 1, then when mapping to said weight vectors corresponding positions, weighting is carried out in this position; Promptly be (+log (f+0.1)); Need to prove: f is the frequency of occurrences of word, if unit of account is a paragraph, f then is the total quantity that this word occurs in paragraph; If be that f then is the total quantity that this word occurs in full text in full; And, calculate the value of the corresponding position of weight vectors, obtain the paragraph signature of preset Q-character.Be specially: whether the value of judging the corresponding position of said weight vectors is greater than 0; If greater than 0, then the value with the corresponding position of this weight vectors is made as 1, if smaller or equal to 0, then the value with the corresponding position of this weight vectors is made as 0, like this, can obtain the paragraph signature of said preset Q-character.
Comparing unit is used for the paragraph characteristic information of said document and the paragraph characteristic information of at least one existing document are compared; In preferred forms of the present invention; Promptly being that the paragraph that said document obtains presetting the paragraph signature of Q-character and having the preset Q-character of document is signed according to certain algorithm computation, preferably, is the distance through the paragraph signature and the existing paragraph signature of Hamming code distance calculation document; If this distance is during greater than the first threshold of being scheduled to; Then think these two paragraph dissmilarities, if this distance during smaller or equal to the first threshold of being scheduled to, thinks that then these two paragraphs are similar; In best mode for carrying out the invention, this first threshold is 6.Certainly; This comparison also can be included in the index database that has document and inquire about; Promptly be to compare through the paragraph characteristic of said document and the paragraph characteristic information of a plurality of existing documents, the building mode of this index database will specify at following combination Fig. 5, Fig. 6.
Judging unit is used for judging whether to have the existing document similar with said document according to said comparison result.When the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during, judge that said document is similar with existing document more than or equal to second threshold value set; When the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during, then judge said document and existing document dissmilarity less than second threshold value set.Said second threshold value is the ratio of the total paragraph number of said similar paragraph number/document, and it can according to circumstances be set, for example; If need comparison comparatively accurate; Then can omit if worry comparison with said second threshold setting more greatly, then can said second threshold setting is littler.In best mode for carrying out the invention, this second threshold value is set in 0.5 ~ 1 the interval.Preferably; In preferred forms of the present invention; The similar paragraph number that not only needs document and existing document is more than or equal to second threshold value of setting, and the Hamming code distance of whole paragraph signature of whole paragraph signature and said existing document that also need satisfy said document is smaller or equal to first threshold, and the acquisition mode that this integral body paragraph is signed can be joined the content that discloses in the acquiring unit; Promptly be with entire article as a paragraph, calculate the paragraph signature of this paragraph through hash algorithm.
Through said units, can carry out the similarity comparison between the document comparatively exactly, avoid cheating, and search efficiency is higher, the server/computer processing pressure is less.
In an embodiment of the present invention, said document recognizing apparatus also can be used for screening the copyright property of document, and said judging unit also is used for obtaining according to above-mentioned judged result the copyright property of said document; And the copyright property that obtains said existing document according to above-mentioned judged result.Preferably, said judging unit also is used for the copyright property of the said document similar with existing document is defined as pirate document or doubtful pirate document according to different application scenes; Or the attribute definition that is used for the one or more said existing documents similar with said document is pirate document or doubtful pirate document.
In this embodiment, said document recognizing apparatus also comprises:
First receiving element 13 is used to receive the digital document of uploading of not verifying copyright property.Usually, can login the online document website through browser or client software, and the digital document of this locality is uploaded to the server of online document website; The server that promptly is the online document website receives the digital document of uploading, and usually, this digital document is not verify copyright property; It possibly be the digital document that obtains through various channels; For example, download scanning etc.This digital document can comprise various ways, like text, e-book, picture, PDF or the like.In the application scenarios of this embodiment, generally be from the said digital document of not verifying copyright property of client upload through user or service provider.
Whether recognition unit 14, discerning said digital document is document.In this unit, can pass through the suffix name of the said digital document of identification, whether be document to judge said digital document, for example, if this digital document is a text, its suffix is called txt, doc or the like suffix about document, judges that then this digital document is a document; If this digital document is picture, PDF etc., its suffix is called the suffix of non-documents such as jpg, bmp, pdf, judges that then this digital document is non-document.Certainly, on the server of said online document website, also can stipulate, can only upload document files, like this, can exclude the file in other non-documents, promptly be this element and nonessential.
Converting unit 15 is used for converting said digital document into document through algorithm; Preferably, in this embodiment, this algorithm can adopt the comparatively general OCR recognizer of industry.Those of ordinary skills can grasp through prior art quantity, just repeat no more at this.Certainly, on the server of said online document website, also can stipulate, can only upload document files, like this, can exclude the file in other non-documents, promptly be this element and nonessential.
Storage unit 16 is stored said document; Wherein, the document of storage has comprised digital document and the document through being converted to that is identified as document.Certainly, this step is also nonessential, and the document can be deposited in the internal memory (RAM), and it can be deleted from said internal memory after accomplishing the examination copyright property.
Legal database 18 is used to store the digital document that has been verified as legal copy.This legal digital document can obtain this legal digital document through the third party who authorizes; This third party can comprise the approach that partner site, writer, the Writers' Union, colleges and universities, publishing house or the like are legal; And can the digital document of said legal copy be uploaded onto the server, promptly be that server receives being verified as legal digital document and being stored in the said legal database 18, usually of uploading; This digital document can comprise various ways, like text, e-book, picture, PDF or the like.
First indexing units 19, be used to extract said legal copy digital document the paragraph characteristic information and set up index.Preferably, this first indexing units 19 is paragraph characteristic informations that the said acquiring unit 10 of associating extracts digital document, and the method for this extraction paragraph characteristic information can repeat no more at this with reference to the method that is disclosed among Fig. 1, Fig. 2.Preferably, can be with this index stores in first indexing units, so that supply candidate's inquiry.Certainly in another embodiment, except said paragraph characteristic information is set up index, also information such as the title of said document, author, number of words, length, words are also set up respective index, to satisfy dissimilar inquiries.
Second receiving element 17 is used to receive the digital document that has been verified as legal copy.In this unit; Can obtain this legal digital document through the third party who authorizes; Also can from above-mentioned legal database, obtain this legal digital document, this third party can comprise the approach that partner site, writer, the Writers' Union, colleges and universities, publishing house or the like are legal, and can the digital document of said legal copy be uploaded onto the server; Promptly be that server receives the digital document of uploading that is verified as legal copy; Usually, this digital document can comprise various ways, like text, e-book, picture, PDF or the like.
Do not verify copyright data storehouse 20, be used to store the existing digital document of not verifying copyright property.Preferably; In this embodiment; The said digital document of not verifying copyright property is for being uploaded to the digital document of online document database, and it possibly be the digital document of not verifying copyright property of online issue, the digital document of also not issuing temporarily of not verifying copyright property.This digital document of not verifying copyright property is uploaded from user or service provider mostly, not through regular its copyright property of approach checking.
Second indexing units 21 is used to extract the paragraph characteristic information of the said digital document of not verifying copyright property and sets up index.Preferably, this second indexing units 21 is paragraph characteristic informations that the said acquiring unit 10 of associating extracts digital document, and the method for this extraction paragraph characteristic information can repeat no more at this with reference to the method that is disclosed among Fig. 1, Fig. 2.Certainly, in another embodiment,, also information such as the title of said document, author, number of words, length, words are also set up respective index, to satisfy dissimilar inquiries except said paragraph characteristic information is set up index.
Feedback unit 22 is used to send feedback information.Preferably, be the unit of confirming to send after said document is pirate document feedback information when audit.In this embodiment, can send feedback information to uploading the said digital document side that does not verify copyright property, in general, can send said feedback information through the prompting frame form, for example, in browser, eject prompting frame, in client, eject prompting frame etc.Certainly also can return a new page to browser, the content of the digital document of with the prompting side of uploading serves as pirate or does not pass through copyright authentication.
Release unit 23 is used for the said non-pirate document of online issue.Preferably, be to be used for after the said document of audit affirmation is non-pirate document with the online issue of said document.In one embodiment; Be that said non-pirate document is added in the on-line documentation database, preferably, in a particular embodiment of the present invention; Said online document database promptly is in the legal database; Through adding said non-pirate document to said legal database, can effectively expand the legal document in the said legal database, with the more effective digital document of uploading in the future of screening.
Processing unit 24 is used to delete said pirate document.Preferably, be to be used for confirming to delete said pirate document after said document is pirate document when audit.
In this embodiment, said processing unit 24 also is used for after audit confirms that said document is non-pirate document, and the copyright property of said non-pirate document is labeled as verifies, and/or with said non-pirate document copying/move to and verify the copyright data storehouse.As a special case of this embodiment, this has verified above-mentioned legal database of copyright data storehouse.
Through said units, can, document promptly the document be detected when uploading, avoiding follow-up when the document copyright property is detected, the unnecessary pressure that causes to server; And the copyright property of the existing document of processing server storage that can be in batches detects, and efficient is higher.
For the convenience of describing, be divided into various unit with function when describing above the device and describe respectively.Certainly, when implementing the application, can in same or a plurality of softwares and/or hardware, realize the function of each unit.
Description through above embodiment can know, those skilled in the art can be well understood to the application and can realize by the mode that software adds essential general hardware platform.Based on such understanding; The part that the application's technical scheme contributes to prior art in essence in other words can be come out with the embodied of software product; This computer software product can be stored in the storage medium, like ROM/RAM, magnetic disc, CD etc., comprises that some instructions are with so that a computer equipment (can be a personal computer; Server, the perhaps network equipment etc.) carry out the described method of some part of each embodiment of the application or embodiment.
Device embodiments described above only is schematic; Wherein said unit as the separating component explanation can or can not be physically to separate also; The parts that show as the unit can be or can not be physical locations also; Promptly can be positioned at a place, perhaps also can be distributed on a plurality of NEs.Can realize the purpose of this embodiment scheme according to the needs selection some or all of module wherein of reality.Those of ordinary skills promptly can understand and implement under the situation of not paying creative work.
The application can be used in numerous general or special purpose computingasystem environment or the configuration.For example: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system, the system based on microprocessor, set top box, programmable consumer-elcetronics devices, network PC, small-size computer, mainframe computer, comprise DCE of above any system or equipment or the like.
The application can describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract, program, object, assembly, data structure or the like.Also can in DCE, put into practice the application, in these DCEs, by through communication network connected teleprocessing equipment execute the task.In DCE, program module can be arranged in this locality and the remote computer storage medium that comprises memory device.
Be to be understood that; Though this instructions is described according to embodiment; But be not that each embodiment only comprises an independently technical scheme, this narrating mode of instructions only is for clarity sake, and those skilled in the art should make instructions as a whole; Technical scheme in each embodiment also can form other embodiments that it will be appreciated by those skilled in the art that through appropriate combination.
The listed a series of detailed description of preceding text only is specifying to feasibility embodiment of the present invention; They are not in order to restriction protection scope of the present invention, allly do not break away from equivalent embodiment or the change that skill of the present invention spirit done and all should be included within protection scope of the present invention.

Claims (53)

1. a document detection method is characterized in that, said document detection method may further comprise the steps:
S1, obtain and document paragraph characteristic information correspondingly;
S2, the paragraph characteristic information of said document and the paragraph characteristic information of at least one existing document are compared;
S3, judge whether to have the existing document similar with said document according to said comparison result; When the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during more than or equal to second threshold value set; And total paragraph characteristic information result of calculation of total paragraph characteristic information of said document and said existing document judges that less than first threshold said document is similar with existing document; When the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during, then judge said document and existing document dissmilarity less than second threshold value set.
2. document detection method according to claim 1 is characterized in that, said paragraph characteristic information is the paragraph signature of preset Q-character.
3. document detection method according to claim 2 is characterized in that, obtains the paragraph signature of said preset Q-character through hash algorithm.
4. document detection method according to claim 3 is characterized in that, said " obtaining the paragraph signature of said preset Q-character through hash algorithm " specifically may further comprise the steps:
S100, each paragraph in the document is cut speech, obtain the speech of this paragraph and the doublet tabulation of word frequency;
S101, the doublet in the said tabulation is carried out the initial weight vector calculation;
S102, said doublet is calculated through hash algorithm, obtained the Hash character string of preset Q-character;
S103, said Hash character string is mapped in the said weight vectors;
The value of S104, the corresponding position of calculating weight vectors, the paragraph that obtains preset Q-character is signed.
5. document detection method according to claim 4 is characterized in that, said S103 step specifically comprises:
Judge that in the said Hash character string each is 0 or 1, if 0, then when mapping to said weight vectors corresponding positions, this position is subtracted power; If 1, then when mapping to said weight vectors corresponding positions, weighting is carried out in this position.
6. according to claim 4 or 5 described document detection methods, it is characterized in that said S104 step specifically comprises:
Whether the value of judging the corresponding position of said weight vectors is greater than 0; If greater than 0, then the value with the corresponding position of said weight vectors is made as 1, if smaller or equal to 0, then the value with the corresponding position of this weight vectors is made as 0.
7. document detection method according to claim 3; It is characterized in that; When the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during, judge that said document is similar with existing document more than or equal to second threshold value set; When the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during, then judge said document and existing document dissmilarity less than second threshold value set.
8. document detection method according to claim 7 is characterized in that, said similar paragraph obtains through following steps:
Through algorithm the paragraph signature that said document obtains presetting the paragraph signature of Q-character and having the preset Q-character of document is calculated, greater than the first threshold of being scheduled to, then said paragraph is dissimilar as if result of calculation; If result of calculation is smaller or equal to predetermined first threshold, then said paragraph is similar.
9. document detection method according to claim 8; It is characterized in that said " the paragraph signature that said document obtains presetting the paragraph signature of Q-character and having the preset Q-character of document being calculated through algorithm " is the distance through the paragraph signature of the paragraph signature of the said document of Hamming code distance calculation and said existing document.
10. document detection method according to claim 7 is characterized in that, the copyright property of said document that will be similar with existing document is defined as pirate document.
11. document detection method according to claim 7 is characterized in that, the copyright property of said document that will be similar with existing document is defined as doubtful pirate document.
12. document detection method according to claim 11 is characterized in that, examines said doubtful pirate document, if audit confirms that said doubtful pirate document is pirate document, then sends feedback information; If audit confirms that said doubtful pirate document is non-pirate document, the said non-pirate document of then online issue.
13. will go 7 described document detection methods, it is characterized in that the copyright property of one or more said existing documents that will be similar with said document is defined as pirate document according to right.
14. document detection method according to claim 7 is characterized in that, the attribute definition of one or more said existing documents that will be similar with said document is doubtful pirate document.
15. document detection method according to claim 14 is characterized in that, examines said doubtful pirate document, if audit confirms that said doubtful pirate document is pirate document, then deletes said pirate document; If audit confirms that said doubtful pirate document is non-pirate document, then keep said non-pirate document.
16. document detection method according to claim 15 is characterized in that, the copyright property of said non-pirate document is labeled as verifies, and/or with said non-pirate document copying/move to and verify the copyright data storehouse.
17. document detection method according to claim 16 is characterized in that, repeats claim 17 step, until the screening of accomplishing all existing documents.
18. document detection method according to claim 1 is characterized in that, obtains the copyright property of said document according to said judged result.
19. document detection method according to claim 18 is characterized in that, before said S1 step, also comprises the paragraph characteristic information step that makes up existing document:
Obtain and be verified as legal digital document;
Extract the paragraph characteristic information of said digital document and set up index.
20. document detection method according to claim 19 is characterized in that, said " making up the paragraph characteristic information of existing document " step also comprises:
Whether discern said digital document is document;
If, then extract the paragraph characteristic information of said document and set up index, if not, then convert said digital document into document through algorithm after, extract the paragraph characteristic information of said document and set up index.
21. document detection method according to claim 19 is characterized in that, after said " making up the paragraph characteristic information of existing document " step, also comprises:
The digital document of not verifying copyright property that reception is uploaded.
22. according to the said document detection method of claim 21, it is characterized in that, after said " digital document of not verifying copyright property that reception is uploaded " step, also comprise:
Judge whether said digital document is document;
If, then carry out the S1 step, if not, then convert said digital document into document through algorithm after, carry out the S1 step.
23. document detection method according to claim 22 is characterized in that, before said S1 step, also comprises said document is stored.
24. document detection method according to claim 1 is characterized in that, obtains the copyright property of said existing document according to said judged result.
25. document detection method according to claim 24 is characterized in that, before said S1 step, also comprises the paragraph characteristic information step that makes up existing document:
Obtain the existing digital document of not verifying copyright property;
Extract the paragraph characteristic information of said digital document and set up index.
26. document detection method according to claim 25 is characterized in that, said " making up the paragraph characteristic information of existing document " step also comprises:
Whether discern said digital document is document;
If, then extract the paragraph characteristic information of said document and set up index, if not, then convert said digital document into document through algorithm after, extract the paragraph characteristic information of said document and set up index.
27. document detection method according to claim 24 is characterized in that, after said " making up the paragraph characteristic information of existing document " step, also comprises:
Receive and be verified as legal digital document.
28. according to the said document detection method of claim 27, it is characterized in that, after said " receive and be verified as legal digital document " step, also comprise:
Judge whether said digital document is document;
If, then carry out the S1 step, if not, then convert said digital document into document through algorithm after, carry out the S1 step.
29. a document detection device is characterized in that, said document detection device comprises:
Acquiring unit is used to obtain and document paragraph characteristic information correspondingly;
Comparing unit is used for the paragraph characteristic information of said document and the paragraph characteristic information of at least one existing document are compared;
Judging unit; Be used for judging whether to have the existing document similar with said document according to said comparison result; When the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during more than or equal to second threshold value set; And total paragraph characteristic information result of calculation of total paragraph characteristic information of said document and said existing document judges that less than first threshold said document is similar with existing document; When the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during, then judge said document and existing document dissmilarity less than second threshold value set.
30. document detection device according to claim 29 is characterized in that, said paragraph characteristic information is the paragraph signature of preset Q-character.
31. document detection device according to claim 30 is characterized in that, obtains the paragraph signature of said preset Q-character through hash algorithm.
32. document detection device according to claim 31 is characterized in that, said deriving means is used for:
Each paragraph in the document is cut speech, obtain the speech of this paragraph and the doublet tabulation of word frequency;
Doublet in the said tabulation is carried out the initial weight vector calculation;
Said doublet is calculated through hash algorithm, obtained the Hash character string of preset Q-character;
Said Hash character string is mapped in the said weight vectors;
Calculate the value of the corresponding position of weight vectors, obtain the paragraph signature of preset Q-character.
33. document detection device according to claim 32 is characterized in that, said deriving means is used for: each that judge said Hash character string is 0 or 1, if 0, then when mapping to said weight vectors corresponding positions, this position is subtracted power; If 1, then when mapping to said weight vectors corresponding positions, weighting is carried out in this position.
34. according to claim 32 or 33 described document detection devices, it is characterized in that said deriving means is used for: whether the value of judging the corresponding position of said weight vectors is greater than 0; If greater than 0, then the value with the corresponding position of said weight vectors is made as 1, if smaller or equal to 0, then the value with the corresponding position of this weight vectors is made as 0.
35. document detection device according to claim 31; It is characterized in that; Said judging unit is used for: when the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during more than or equal to second threshold value set, judge that said document is similar with existing document; When the ratio of total paragraph number of the similar paragraph number of said document and said existing document and said document during, then judge said document and existing document dissmilarity less than second threshold value set.
36. document detection device according to claim 35; It is characterized in that; Said comparing unit is used for through algorithm the paragraph signature that said document obtains presetting the paragraph signature of Q-character and having the preset Q-character of document being calculated; If result of calculation is greater than predetermined first threshold, then said paragraph is dissimilar; If result of calculation is smaller or equal to predetermined first threshold, then said paragraph is similar.
37. document detection device according to claim 36 is characterized in that, the distance of the paragraph signature through the said document of Hamming code distance calculation and the paragraph signature of said existing document.
38. document detection device according to claim 35 is characterized in that, said judging unit is used for the copyright property of the said document similar with existing document is defined as pirate document.
39. document detection device according to claim 35 is characterized in that, said judging unit is used for the copyright property of the said document similar with existing document is defined as doubtful pirate document.
40., it is characterized in that said document detection device also comprises the unit that is used for after the said document of audit affirmation is pirate document, sending feedback information according to the described document detection device of claim 39.
41., it is characterized in that said document detection device also comprises the unit that is used for the said non-pirate document of online issue after the said document of audit affirmation is non-pirate document according to the described document detection device of claim 39.
42. document detection device according to claim 35 is characterized in that, the attribute definition that said judging unit is used for the one or more said existing documents similar with said document is pirate document.
43. document detection device according to claim 35 is characterized in that, the attribute definition that said judging unit is used for the one or more said existing documents similar with said document is doubtful pirate document.
44., it is characterized in that said document detection device also comprises the processing unit that is used for the said pirate document of deletion after the said document of audit affirmation is pirate document according to the described document detection device of claim 43.
45. according to the described document detection device of claim 44; It is characterized in that; Said processing unit also is used for after audit confirms that said document is non-pirate document; The copyright property of said non-pirate document is labeled as verifies, and/or with said non-pirate document copying/move to and verify the copyright data storehouse.
46. document detection device according to claim 29 is characterized in that said judging unit also is used for obtaining according to said judged result the copyright property of said document.
47., it is characterized in that said document detection device also comprises according to the described document detection device of claim 46:
Be used to store the unit that has been verified as legal digital document; And
The unit that is used to extract the paragraph characteristic information of said digital document and sets up index.
48., it is characterized in that said document detection device also comprises according to the described document detection device of claim 46:
Be used to receive the unit of the digital document of uploading of not verifying copyright property.
49. document detection device according to claim 29 is characterized in that said judging unit also is used for obtaining according to said judged result the copyright property of said existing document.
50., it is characterized in that said document detection device also comprises according to the described document detection device of claim 49:
Be used to store the existing unit of not verifying the digital document of copyright property; And
The unit that is used to extract the paragraph characteristic information of said digital document and sets up index.
51., it is characterized in that said document detection device also comprises according to the described document detection device of claim 49:
Be used to receive the unit that has been verified as legal digital document.
52., it is characterized in that said document detection device also comprises according to any described document detection device in the claim 46 to 51:
Whether be used to discern said digital document is the unit of document;
Be used for said digital document being converted into the unit of document through algorithm.
53., it is characterized in that said document detection device also comprises and is used for unit that said document is stored according to any described document detection device in the claim 46 to 51.
CN2011100808382A 2011-03-31 2011-03-31 Method and device for detecting document Active CN102156689B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011100808382A CN102156689B (en) 2011-03-31 2011-03-31 Method and device for detecting document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011100808382A CN102156689B (en) 2011-03-31 2011-03-31 Method and device for detecting document

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201210340026.1A Division CN102915295B (en) 2011-03-31 2011-03-31 Document detecting method and document detecting device

Publications (2)

Publication Number Publication Date
CN102156689A CN102156689A (en) 2011-08-17
CN102156689B true CN102156689B (en) 2012-11-28

Family

ID=44438191

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011100808382A Active CN102156689B (en) 2011-03-31 2011-03-31 Method and device for detecting document

Country Status (1)

Country Link
CN (1) CN102156689B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915295A (en) * 2011-03-31 2013-02-06 百度在线网络技术(北京)有限公司 Document detecting method and document detecting device

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955763B (en) * 2011-08-22 2016-07-06 联想(北京)有限公司 Display packing and display device
CN102968610B (en) * 2011-08-31 2016-03-30 富士通株式会社 Receipt image processing method and equipment
CN102360372B (en) * 2011-10-09 2013-01-30 北京航空航天大学 Cross-language document similarity detection method
CN103095824B (en) * 2013-01-09 2016-01-20 广东一一五科技有限公司 Files passe control method and system
CN103179216A (en) * 2013-04-16 2013-06-26 上海同岩土木工程科技有限公司 File scanning and automatic unloading method based on Twain protocol
CN103970722B (en) * 2014-05-07 2017-04-05 江苏金智教育信息技术有限公司 A kind of method of content of text duplicate removal
CN104270474A (en) * 2014-11-02 2015-01-07 佛山美阳瓴电子科技有限公司 Device and method used for sharing information in network
CN105183809A (en) * 2015-08-26 2015-12-23 成都布林特信息技术有限公司 Cloud platform data query method
CN105205104A (en) * 2015-08-26 2015-12-30 成都布林特信息技术有限公司 Cloud platform data acquisition method
CN105183835B (en) * 2015-08-31 2018-09-04 小米科技有限责任公司 The method and device of information flag in social software
CN107798637A (en) * 2016-08-30 2018-03-13 北京国双科技有限公司 The different acquisition methods and device for sentencing document of accomplice
CN106649257B (en) * 2016-09-21 2019-06-18 联动优势科技有限公司 A kind of conversion method and device of semanteme section
CN108614827A (en) * 2016-12-12 2018-10-02 阿里巴巴集团控股有限公司 Data segmentation method, judging method and electronic equipment
CN106844314B (en) * 2017-02-21 2019-10-18 北京焦点新干线信息技术有限公司 A kind of duplicate checking method and device of article
CN107368472B (en) * 2017-07-26 2021-01-05 成都科来软件有限公司 Storage method of document analysis result capable of being iteratively optimized
CN108491458A (en) * 2018-03-02 2018-09-04 深圳市联软科技股份有限公司 A kind of sensitive document detection method, medium and equipment
TWI726356B (en) * 2019-07-16 2021-05-01 宏碁股份有限公司 Electronic device and file content management method
CN112183052B (en) * 2020-09-29 2024-03-05 百度(中国)有限公司 Document repetition degree detection method, device, equipment and medium
CN113138964B (en) * 2021-05-20 2021-11-19 掌阅科技股份有限公司 Electronic book information display method, user terminal and computer storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833579A (en) * 2010-05-11 2010-09-15 同方知网(北京)技术有限公司 Method and system for automatically detecting academic misconduct literature

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833579A (en) * 2010-05-11 2010-09-15 同方知网(北京)技术有限公司 Method and system for automatically detecting academic misconduct literature

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
赵俊杰等.一种基于段落词频统计的论文抄袭判定算法.《计算机技术与发展》.2009,第19卷(第4期),第231-233,238页. *
金博等.基于语义理解的文本相似度算法.《大连理工大学学报》.2005,第45卷(第2期),第291-297页. *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915295A (en) * 2011-03-31 2013-02-06 百度在线网络技术(北京)有限公司 Document detecting method and document detecting device

Also Published As

Publication number Publication date
CN102156689A (en) 2011-08-17

Similar Documents

Publication Publication Date Title
CN102156689B (en) Method and device for detecting document
CN102915295B (en) Document detecting method and document detecting device
US7991778B2 (en) Triggering actions with captured input in a mixed media environment
US20180075138A1 (en) Electronic document management using classification taxonomy
US8989431B1 (en) Ad hoc paper-based networking with mixed media reality
US9069771B2 (en) Music recognition method and system based on socialized music server
JP5542859B2 (en) Log management apparatus, log storage method, log search method, and program
US8321382B2 (en) Validating aggregate documents
WO2014179314A1 (en) System and method for mobile presentation processing
US8671108B2 (en) Methods and systems for detecting website orphan content
US10691877B1 (en) Homogenous insertion of interactions into documents
CN104866985A (en) Express bill number identification method, device and system
CN110705235B (en) Information input method and device for business handling, storage medium and electronic equipment
CN110928917A (en) Target user determination method and device, computing equipment and medium
US20070185832A1 (en) Managing tasks for multiple file types
CN104778412A (en) Method and system for checking script
US11295124B2 (en) Methods and systems for automatically detecting the source of the content of a scanned document
CN102243622A (en) Systems and methods based on document tag destination prompting and auto routing
WO2019028249A1 (en) Automated reporting system
CN111651416B (en) Sketch project file uploading preview method, system, equipment and medium
JP5718630B2 (en) Information processing apparatus, information asset management system, information asset management method, and program
KR101709952B1 (en) Management server and method for detecting personal information
EP2713285A1 (en) Information processing apparatus, information processing method, and program
CN117493712B (en) PDF document navigable directory extraction method and device, electronic equipment and storage medium
CN111144334B (en) File matching method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant