CN103294671A

CN103294671A - Document detection method and system

Info

Publication number: CN103294671A
Application number: CN2012100406942A
Authority: CN
Inventors: 王炫聪; 孙甲慧; 陈锡彬; 李翔; 黄斌强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority date: 2012-02-22
Filing date: 2012-02-22
Publication date: 2013-09-11
Anticipated expiration: 2032-02-22
Also published as: CN103294671B

Abstract

An embodiment of the invention provides a detection method and system, relates to the field of a data processing technology, and solves the problem that the conventional approximate duplicated document detection method cannot meet higher requirements in terms of a precision ratio and a recall ratio. In the embodiment of the invention, a method combining multi-feature fingerprint inquiry and document similarity comparison is adopted, multi-feature fingerprints can accurately reflect discriminative features of a web page document to be detected and other web page documents, and records in accordance with conditions can be rapidly inquired according to a corresponding relation of the feature fingerprints and approximate duplicated documents in an existing data base, so that the accuracy rate and the efficiency of approximate duplicated document detection can be improved. With the adoption of the detection method of the document similarity comparison, the situation that a web page document to be detected surely belongs to an approximate duplicated document but cannot be detected by the multi-feature fingerprint inquiry due to the fact that the data base is defective can be prevented, so that the recall ration of the approximate duplicated document detection is improved.

Description

The detection method of document and system

Technical field

The present invention relates to the internet data processing technology field, relate in particular to detection method and the system of document.

Background technology

Approximate repetitive file (near-duplicates document) is often referred to: have only the plural web document of minute differences on the internet each other, these difference comprise: counter, timestamp, a small amount of word, a small amount of sentence etc.In addition, for the theme unanimity, namely tell about some web document of same event, even the word between each document, sentence, paragraph, length etc. differ bigger, but from the network user's angle, the meaning that each document is expressed is identical, and these documents also belong to approximate repetitive file.

According to statistics, the web document that repeats on the internet accounts for 30%～45% of whole documents, and wherein major part is owing to mirror image is reprinted generation.In the web document of enormous amount, detect approximate repetitive file, for the efficient that improves web search conclusive effect is arranged.

The approximate repetitive file detection method that adopts mainly contains at present: utilize proper vector apart from the method that directly compares two pieces of document similarities, the web document removing repeat method based on the feature string, simhash algorithm, shingling algorithm etc.But in the process of each method of use, find that every kind of method all can not reach higher requirement aspect precision ratio and the recall ratio.

Summary of the invention

Embodiments of the invention provide a kind of detection method and system of document, can significantly improve precision ratio and recall ratio that approximate repetitive file detects.

For achieving the above object, embodiments of the invention adopt following technical scheme:

A kind of detection method of document comprises: extract character to obtain at least one file characteristics in web document to be measured; Each described file characteristics is carried out Hash calculation obtain corresponding characteristic fingerprint; If in the fingerprint mapping database, do not find the document clusters similar to described web document to be measured according to each described characteristic fingerprint, then the web document in described web document to be measured and the given number of days is carried out similarity relatively; If be contained in the document clusters of document mapping database with the similarity value of described web document to be measured web document greater than similarity threshold in the web document in the described given number of days, then the described document clusters of described document mapping database is destination document bunch, and described web document to be measured is the approximate repetitive file of described destination document bunch.

A kind of detection system of document comprises: feature deriving means is used for extracting character obtaining at least one file characteristics in web document to be measured, and each described file characteristics is carried out Hash calculation obtains corresponding at least one characteristic fingerprint; The document comparison means, be used for when according to each described characteristic fingerprint when the fingerprint mapping database does not find the document clusters similar to described web document to be measured, web document in described web document to be measured and the given number of days is carried out similarity relatively, and when being contained in the document clusters of document mapping database greater than the web document of similarity threshold with the similarity value of described web document to be measured in the web document in determining described given number of days, the described document clusters that makes described document mapping database is destination document bunch, and making described web document to be measured is the approximate repetitive file of described destination document bunch.

In the detection method and system of the document that the embodiment of the invention provides, the method that has adopted many characteristic fingerprint inquiries and document similarity relatively to combine, can in existing Feature Mapping database, inquire about the document clusters similar to this web document to be measured according to a plurality of characteristic fingerprints that from web document to be measured, obtain, come Preliminary detection web document to be measured whether to belong to approximate repetitive file, because a plurality of characteristic fingerprints can reflect web document to be measured and other other feature of web document phase region exactly, and therefore the qualified record of corresponding relation energy fast query according to characteristic fingerprint and approximate repetitive file in the data with existing storehouse can improve accuracy rate and efficient that approximate repetitive file detects.If can not find the document clusters similar to web document to be measured by above-mentioned Preliminary detection, then further by the conventional document similarity detection method that adopts, prevent from belonging to the situation generation that approximate repetitive file but can not be detected really because of the web document to be measured that the database imperfection causes, thereby improved the recall ratio that approximate repetitive file detects.

Description of drawings

In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, to do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below, apparently, accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.

The detection method process flow diagram of the document that Fig. 1 provides for the embodiment of the invention 1;

The block scheme of the detection system of the document that Fig. 2 provides for the embodiment of the invention 1;

The detection method process flow diagram of the document that Fig. 3 provides for the embodiment of the invention 2;

The block scheme of the detection system of the document that Fig. 4 provides for the embodiment of the invention 3.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, the every other embodiment that those of ordinary skills obtain under the prerequisite of not making creative work belongs to the scope of protection of the invention.

Embodiment 1

Present embodiment provides a kind of detection method of document, and as shown in Figure 1, this method comprises the steps.

101, in web document to be measured, extract character to obtain at least one file characteristics.

Particularly, file characteristics can replace web document to be measured, be used for comparing with the file characteristics of other web document, and whether be approximate repetitive file to determine between web document to be measured and other web document.

102, each described file characteristics is carried out Hash calculation and obtain corresponding characteristic fingerprint.

If 103 do not find the document clusters similar to described web document to be measured according to each described characteristic fingerprint in the fingerprint mapping database, then the web document in described web document to be measured and the given number of days is carried out similarity relatively.

Particularly, preserve characteristic fingerprint in the fingerprint mapping database to the (K of document clusters information, V) right, K (key) is: characteristic fingerprint, V (value) is: the information of document clusters under this characteristic fingerprint, document clusters information can comprise bunch head and know the creation-time of weight (characteristic fingerprint belongs to the weight of this document clusters), document clusters etc.Document clusters refers to: the collection of document of being made up of a plurality of approximate repetitive files.

According to the N that from web document to be measured, an obtains characteristic fingerprint, can in the fingerprint mapping database, search the record corresponding to each characteristic fingerprint in N the characteristic fingerprint.Certainly, not necessarily all characteristic fingerprints have corresponding record in the fingerprint mapping database, and therefore, the record strip that finds is counted M≤N.Can specify when X (1≤X≤when M) the bar record all comprises identical document clusters information is arranged in the M bar record, determine that web document to be measured and the document are bunch similar, also can specify the document clusters in the record that comprises the weight limit value in the M bar record is the document clusters similar to web document to be measured, or specifies the alternate manner known to those skilled in the art to determine document clusters similar to web document to be measured in the fingerprint mapping database.

According to the variety of way of above-mentioned appointment, if can not in the fingerprint mapping database, find the document clusters similar to web document to be measured, then the web document in web document to be measured and the given number of days is carried out similarity relatively.

If be contained in the document clusters of document mapping database with the similarity value of described web document to be measured web document greater than similarity threshold in the web document in the 104 described given number of days, then the described document clusters of described document mapping database is destination document bunch, and described web document to be measured is the approximate repetitive file of described destination document bunch.

Particularly, between two documents there be much the algorithm of similarity comparison, commonly used have the longest public string (LCS) method, cosine ratio of similitude than the COS method etc., when when calculating similarity value between two documents that obtain greater than the similarity threshold of appointment, illustrates that two documents are similar document.

Preserve in the document mapping database document document clusters under the document information (K, V) right, K (key) is: document identification, V (value) is: the information of the affiliated document clusters of the content of document and document, document clusters information can comprise a bunch head knowledge.

In this step, selecting the web document in the given number of days is in order further to improve document efficient relatively, this method is particularly useful for the web document of news category, because the web document of news category has certain timeliness n, its approximate repetitive file 99% all is the web document in nearest 3-5 days.

If be contained in the document clusters of document mapping database with the similarity value of web document to be measured web document greater than similarity threshold in the web document in the given number of days, namely in the document mapping database, can find a record, it comprises this similarity value greater than the web document of similarity threshold, document clusters in then should recording is similar to web document to be measured, specifying document clusters in this record is destination document bunch, and web document to be measured is the approximate repetitive file of this destination document bunch.

In the detection method of the document that present embodiment provides, the method that has adopted many characteristic fingerprint inquiries and document similarity relatively to combine, can in existing Feature Mapping database, inquire about the document clusters similar to this web document to be measured according to a plurality of characteristic fingerprints that from web document to be measured, obtain, come Preliminary detection web document to be measured whether to belong to approximate repetitive file, because a plurality of characteristic fingerprints can reflect web document to be measured and other other feature of web document phase region exactly, and therefore the qualified record of corresponding relation energy fast query according to characteristic fingerprint and approximate repetitive file in the data with existing storehouse can improve accuracy rate and efficient that approximate repetitive file detects.If can not find the document clusters similar to web document to be measured by above-mentioned Preliminary detection, then further by adopting conventional document similarity detection method, prevent from belonging to the situation generation that approximate repetitive file but can not be detected really because of the web document to be measured that Feature Mapping database imperfection causes, thereby improved the recall ratio that approximate repetitive file detects.

Present embodiment also provides a kind of detection system of document, as shown in Figure 2, this system comprises: feature deriving means 21 is used for extracting character obtaining at least one file characteristics in web document to be measured, and each described file characteristics is carried out Hash calculation obtains corresponding characteristic fingerprint; Document comparison means 22, be used for when according to each described characteristic fingerprint when the fingerprint mapping database does not find the document clusters similar to described web document to be measured, web document in described web document to be measured and the given number of days is carried out similarity relatively, and when being contained in the document clusters of document mapping database greater than the web document of similarity threshold with the similarity value of described web document to be measured in the web document in determining described given number of days, the described document clusters that makes described document mapping database is destination document bunch, and making described web document to be measured is the approximate repetitive file of described destination document bunch.

Respectively install performed method in the detection system of above-mentioned document and describe in detail in the present embodiment, do not repeat them here.

In the detection system of the document that present embodiment provides, the method that has adopted many characteristic fingerprint inquiries and document similarity relatively to combine, can in existing Feature Mapping database, inquire about the document clusters similar to this web document to be measured according to a plurality of characteristic fingerprints that from web document to be measured, obtain, come Preliminary detection web document to be measured whether to belong to approximate repetitive file, because a plurality of characteristic fingerprints can reflect web document to be measured and other other feature of web document phase region exactly, and therefore the qualified record of corresponding relation energy fast query according to characteristic fingerprint and approximate repetitive file in the data with existing storehouse can improve accuracy rate and efficient that approximate repetitive file detects.If can not find the document clusters similar to web document to be measured by above-mentioned Preliminary detection, then further by adopting conventional document similarity detection method, prevent from belonging to the situation generation that approximate repetitive file but can not be detected really because of the web document to be measured that Feature Mapping database imperfection causes, thereby improved the recall ratio that approximate repetitive file detects.

Embodiment 2

Present embodiment provides a kind of detection method of document, and as shown in Figure 3, this method comprises the steps.

301, in web document to be measured, extract character to obtain at least one file characteristics.

Particularly, file characteristics can replace web document to be measured, be used for comparing with the file characteristics of other web document, and whether be approximate repetitive file to determine between web document to be measured and other web document.Can obtain at least one file characteristics by the following method.

The text that identifies described web document to be measured according to paragraph is divided at least one paragraph; Choose and comprise the maximum N of a character number paragraph in described at least one paragraph; According to punctuation mark each paragraph in the described N paragraph is divided at least one sentence; Choose comprise character number maximum one and described web document to be measured in described each paragraph title as described file characteristics.

In the above-mentioned method of obtaining at least one file characteristics, before the text with described web document to be measured is divided at least one paragraph, also can comprise: described web document to be measured is carried out denoising.

" denoising " refers to: data in the web document are cleared up, and the accuracy rate and the recall ratio that remove meeting pairing approximation repetitive file testing result produce some factors of disturbing, mainly comprise map function and cleaning operation:

Map function: double byte character (mainly being punctuate, numeral and letter) is become the half-angle character.

Cleaning operation: comprise and remove that some are meaningless to system, but can produce interference word, sentence and before stop, such as, " according to ... report ", this sentence having little significance in the news web page document.

Certainly, the method for obtaining file characteristics is not limited to said method, can also for: get and in full make the document feature, get paragraph as file characteristics, get fullstop left and right sides fixed length word as file characteristics etc.

In addition, after obtaining at least one file characteristics, also can comprise: described at least one file characteristics is carried out denoising.For example, as remove punctuation mark etc., further to avoid ignore character to the harmful effect of accuracy rate and recall ratio.

302, each described file characteristics is carried out Hash calculation and obtain corresponding characteristic fingerprint.

303, search whether there be the document clusters similar to described web document to be measured in the fingerprint mapping database according to described at least one characteristic fingerprint.

This step specifically comprises: search the record under in described fingerprint mapping database of each characteristic fingerprint in described at least one characteristic fingerprint; If comprising the bar number of the record of identical document bunch in described at least one record is not less than and specifies the bar number, then described identical document bunch is the document clusters similar to described web document to be measured, specifying the document clusters similar to described web document to be measured is destination document bunch, described web document to be measured is the approximate repetitive file of described destination document bunch, and execution in step 307 then.Otherwise execution in step 304.

Particularly, can preserve characteristic fingerprint in the fingerprint mapping database to the (K of document clusters information, V) right, K (key) is: characteristic fingerprint, V (value) is: the information of document clusters under this characteristic fingerprint, document clusters information can comprise bunch head and know the creation-time of weight (characteristic fingerprint belongs to the weight of this document clusters), document clusters etc.Document clusters refers to: the collection of document of being made up of a plurality of approximate repetitive files.

According to the N that from web document to be measured, an obtains characteristic fingerprint, can in the fingerprint mapping database, search the record corresponding to each characteristic fingerprint in N the characteristic fingerprint, i.e. the record of each characteristic fingerprint under in mapping database in N characteristic fingerprint.Certainly, not necessarily all characteristic fingerprints have corresponding record in the fingerprint mapping database, and therefore, the record strip that finds is counted M≤N.(1≤X≤when M) the bar record all comprises identical document clusters information determines that web document to be measured and the document are bunch similar when X is arranged in the M bar record.Wherein, X is for specifying the bar number, and the value of X is more big, and then the accuracy rate of similar repetitive file detection is more high, makes a concrete analysis of as follows.

The accuracy rate of supposing X=1 (single characteristic fingerprint) is P (f), then error is 1-P (f), the error of X characteristic fingerprint is (1-P (f)) X so, accuracy rate is 1-(1-P (f)) X, therefore, in this step, if the accuracy rate of single characteristic fingerprint (X=1) is 90%, the accuracy rate when X=2 just is 99% so.

As the description among the embodiment 1, determine that web document to be measured also can be the document clusters similar to web document to be measured for specifying the document clusters in the record that comprises the weight limit value in the M bar record to the document bunch similar mode, or specify the alternate manner known to those skilled in the art to determine document clusters similar to web document to be measured in the fingerprint mapping database.

304, the web document in described web document to be measured and the given number of days is carried out similarity relatively.

In this step, selecting the web document in the given number of days is in order to improve document efficient relatively, this method is particularly useful for the web document of news category, because the web document of news category has certain timeliness n, its approximate repetitive file 99% all is the web document in nearest 3-5 days.

Under some situation, web document quantity in the given number of days is also quite big, and the calculated amount that the use computing machine carries out the comparison of document similarity is also very big, has influenced the efficient that approximate repetitive file detects, in order further to reduce calculated amount, before this step, also can increase following steps:

Step 1, the title of described web document to be measured is carried out participle, obtain at least one keyword.

Step 2, use described at least one keyword preserved, search hit-count in the word document inverted list of web document and title participle structure thereof in by described given number of days and be not less than the document of predetermined number of times as document to be compared.

Particularly, the purpose of structure inverted list is in order to use crux word in the title to go for the document of correspondence.For " Red Cross Society of China denies the rope contribution of issuing a notice by the civil affairs department " news, the word behind the title participle is had " Red Cross ", " contribution " etc., and body matter is made as A such as, title, then Gou Zao inverted list is:

The Red Cross--＞A

Contribution--＞A

Searching the document that hit-count is not less than predetermined number of times in word document inverted list can illustrate by following example as this step of document to be compared.

Such as, be divided into A, B, three words of C behind the title participle of a certain web document to be measured, use the document that finds in the word document inverted list of these three words web document and title participle thereof structure in by described given number of days to be:

A finds e, f, three pieces of documents of g;

B finds d, e, three pieces of documents of g;

C finds two pieces of documents of e, h.

So, the hit-count of each that find piece document is respectively:

D:1 time (only hitting the B word);

E:3 time (hitting A, B, three words of C);

F:1 time (only hitting the A word);

G:2 time (hitting two words of A, B);

H:1 time (only hitting the C word);

Suppose that predetermined number of times is 2 times, then satisfying the document that hit-count is not less than predetermined number of times is e and g, as document to be compared, make the step that the web document in web document to be measured and the given number of days is carried out the similarity comparison be specially: web document to be measured and document to be compared are carried out the similarity comparison.Therefore, need carry out similarity document number relatively with web document to be measured and reduce to 2 (e, g) by 5 (d, e, f, g, h).

305, whether be contained in the document clusters of document mapping database with the similarity value of described web document to be measured web document greater than similarity threshold in the web document in the described given number of days.

Particularly, can preserve in the document mapping database document document clusters under the document information (K, V) right, K (key) is: document identification, V (value) is: the information of document clusters under the content of document and the document, document clusters information can comprise a bunch head knowledge.

If be contained in the document clusters of document mapping database with the similarity value of web document to be measured web document greater than similarity threshold in the web document in the given number of days, namely in the document mapping database, can find a record, it comprises this similarity value greater than the web document of similarity threshold, document clusters in then should recording is similar to web document to be measured, specifying document clusters in this record is destination document bunch, and web document to be measured is the approximate repetitive file of this destination document bunch.Next execution in step 307.

If do not comprise in the web document in the described given number of days to the similarity value of described web document to be measured greater than the web document of similarity threshold (be web document) not similar to web document to be measured in the web document in the given number of days, perhaps, if be not contained in the document clusters of described document mapping database (being that web document similar to web document to be measured in the web document in the given number of days does not have corresponding record in the document mapping database) in the web document in the described given number of days greater than the web document of similarity threshold with the similarity value of described web document to be measured, illustrate that then any one document clusters in web document to be measured and fingerprint mapping database and the document mapping database is all dissimilar, do not belong to approximate repetitive file, therefore, execution in step 306.

306, generate new document bunch according to described web document to be measured and described at least one characteristic fingerprint.

Particularly, bunch head that can produce new document bunch according to the election algorithm of appointment is known, and this bunch head is known the characteristic fingerprint of the title that can be web document to be measured, also can be other any characteristic fingerprint except the characteristic fingerprint of title.

307, new database more.

Above-mentioned database comprises fingerprint mapping database and document mapping database, and more operating under following three kinds of situations of new database is triggered:

After situation one, the execution in step 303, if search the existence document clusters similar to web document to be measured in the fingerprint mapping database according to described at least one characteristic fingerprint, then specifying document clusters similar to web document to be measured in the fingerprint mapping database is destination document bunch, web document to be measured is the approximate repetitive file of described document clusters target, and execution in step 307 then;

After situation two, the execution in step 305, if be contained in the document clusters of document mapping database with the similarity value of web document to be measured web document greater than similarity threshold in the web document in the given number of days, then the described document clusters of specified documents mapping database is destination document bunch, web document to be measured is the approximate repetitive file of described destination document bunch, and execution in step 307 then;

After situation three, the execution in step 306, execution in step 307.

For situation one and situation two, more the operation of new database is specially: according to web document to be measured, at least one characteristic fingerprint and destination document bunch renewal fingerprint mapping database and document mapping database.

According to foregoing description, can preserve characteristic fingerprint in the fingerprint mapping database to the (K of document clusters information, V) right, K (key) is: characteristic fingerprint, V (value) is: the information of document clusters under this characteristic fingerprint, document clusters information can comprise bunch head and know the creation-time of weight (characteristic fingerprint belongs to the weight of this document clusters), document clusters etc.

Renewal operation for the fingerprint mapping database can comprise: if a characteristic fingerprint is contained in the described fingerprint mapping database in described at least one characteristic fingerprint, and be contained in the record under the described destination document bunch, then the feature weight in the record under the described destination document bunch be weighted; If a characteristic fingerprint is contained in the described fingerprint mapping database in described at least one characteristic fingerprint, but be not contained in record under the described destination document bunch, then power fallen in the feature weight in the record under the described destination document bunch; If a characteristic fingerprint is not contained in the described fingerprint mapping database in described at least one characteristic fingerprint, then generate a newly-built record that comprises characteristic fingerprint described in described at least one characteristic fingerprint and described destination document bunch, the weight in the described newly-built record is initial value.

Particularly, if some characteristic fingerprints did not occur in the fingerprint mapping database: then generate a newly-built record in the fingerprint mapping database, comprise this characteristic fingerprint, destination document bunch, weight etc. in this record, the value of weight is set to initial value a0, for example a0=0.

If some characteristic fingerprints occurred in the fingerprint mapping database, but document clusters in record be destination document bunch under it, then power is fallen in weight, namely deducts a designated value to current weighted value, for example deducts 1.

If some characteristic fingerprints occurred in the fingerprint mapping database, and document clusters in record is destination document bunch under it, then weight is weighted, and adds a designated value namely for current weighted value, for example adds 1.

The situation that weight is negative value appears after power is fallen in the weight in certain record of fingerprint mapping database, can increase step: if the feature weight value in the record is less than weight threshold under the destination document bunch, the record under the described destination document of deletion bunch from the Feature Mapping database then.

Not accessed and take storage space for a long time for fear of certain record of fingerprint mapping database, can discharge storage space as follows: if the time and the interval overtime threshold value of current time of power or weighting operation are fallen in the feature weight in the record under the destination document bunch, then from described Feature Mapping database, delete the record under the described destination document bunch.

Certainly, to the renewal of fingerprint mapping database also can only increase non-existent in new, the former database (K, V) right, upgrade and do not carry out weight.(K, V) to information such as, weight, creation-times, then corresponding renewal operation is also different, upgrades operation and being mainly used in sustainable the carrying out of process that approximate repetitive file is detected if that the content that comprises in the fingerprint mapping database is not limited to is above-mentioned.

Can comprise for the renewal of fingerprint mapping database operation: increase a record that comprises described web document to be measured and described destination document bunch in the document mapping database.

Certainly, if the content that comprises in the document mapping database is not limited to document to the (K of the information of the affiliated document clusters of document, V) right, then corresponding renewal operation is also different, upgrade the reference that operation is mainly used in making web document to be measured to judge as the similarity of other web document to be measured, and make sustainable the carrying out of process that approximate repetitive file detects.

For situation three, because existing any one document clusters is all dissimilar in web document to be measured and the database, therefore need new document bunch be increased in the database by step of updating, so that the reference that web document to be measured can be judged as the similarity of other web document to be measured.

Above-mentionedly reduce when carrying out relatively document number of similarity with web document to be measured by searching word document inverted list when adopting, after the step of above-mentioned more new database finishes, also can comprise: will be preserved by the word document inverted list of at least one keyword and web document to be measured structure.Wherein, at least one keyword is that the title for the treatment of survey grid page or leaf document carries out obtaining behind the participle.Can make the inverted list of web document to be measured as the reference of the similarity judgement of other web document to be measured by this step.

In the detection method of the document that present embodiment provides, the method that has adopted many characteristic fingerprint inquiries and document similarity relatively to combine, can in existing Feature Mapping database, inquire about the document clusters similar to this web document to be measured according to a plurality of characteristic fingerprints that from web document to be measured, obtain, come Preliminary detection web document to be measured whether to belong to approximate repetitive file, because a plurality of characteristic fingerprints can reflect web document to be measured and other other feature of web document phase region exactly, and therefore the qualified record of corresponding relation energy fast query according to characteristic fingerprint and approximate repetitive file in the data with existing storehouse can improve accuracy rate and efficient that approximate repetitive file detects.If can not find the document clusters similar to web document to be measured by above-mentioned Preliminary detection, then further by the conventional document similarity detection method that adopts, prevent from belonging to the situation generation that approximate repetitive file but can not be detected really because of the web document to be measured that the database imperfection causes, thereby improved the recall ratio that approximate repetitive file detects.

Embodiment 3

Present embodiment provides a kind of detection system of document, as shown in Figure 4, comprise: feature deriving means 41 is used for extracting character obtaining at least one file characteristics in web document to be measured, and each described file characteristics is carried out Hash calculation obtains corresponding characteristic fingerprint; Document comparison means 43, be used for when according to each described characteristic fingerprint when the fingerprint mapping database does not find the document clusters similar to described web document to be measured, web document in described web document to be measured and the given number of days is carried out similarity relatively, and when being contained in the document clusters of document mapping database greater than the web document of similarity threshold with the similarity value of described web document to be measured in the web document in determining described given number of days, the described document clusters that makes described document mapping database is destination document bunch, and making described web document to be measured is the approximate repetitive file of described destination document bunch.

The detection system of above-mentioned document also can comprise: bunch voting device 44, be used for when the web document in the described given number of days does not comprise with the similarity value of described web document to be measured greater than the web document of similarity threshold, perhaps, when not being contained in the document clusters of described document mapping database greater than the web document of similarity threshold with the similarity value of described web document to be measured in the web document in the described given number of days, generate new document bunch according to described web document to be measured and described at least one characteristic fingerprint.

Above-mentioned document comparison means 43 also can be used for: when finding the document clusters similar to described web document to be measured according to described at least one characteristic fingerprint in described fingerprint mapping database, making document clusters similar to described web document to be measured in the described fingerprint mapping database is destination document bunch, and described web document to be measured is the approximate repetitive file of described destination document bunch.

The detection system of above-mentioned document also can comprise: bunch information updating apparatus 45 is used for according to described web document to be measured, described at least one characteristic fingerprint and described destination document bunch described fingerprint mapping database of renewal and described document mapping database.

Above-mentioned bunch information updating apparatus 45 also is used in described fingerprint mapping database and the described document mapping database and generates the newly-built record that comprises described new document bunch.

Above-mentioned feature deriving means 41 can comprise: paragraph division unit 411 is used for being divided at least one paragraph according to the text that paragraph identifies described web document to be measured; Paragraph is chosen unit 412, is used for choosing described at least one paragraph and comprises the maximum N of a character number paragraph; Sentence division unit 413 is used for according to punctuation mark each paragraph of a described N paragraph being divided at least one; Feature Selection unit 414 is used for choosing described each paragraph and comprises the title of character number maximum and described web document to be measured as described file characteristics.

Above-mentioned feature deriving means 41 also wraps can draw together document denoising unit 415, be used for described according to the paragraph sign text of described web document to be measured is divided at least one paragraph before, described web document to be measured is carried out denoising.Can comprise that also feature deriving means also comprises: feature denoising unit 416, be used for described according to the paragraph sign text of described web document to be measured is divided at least one paragraph before, described web document to be measured is carried out denoising.

Above-mentioned system also can comprise bunch selecting device 42, be used for searching the document clusters similar to described web document to be measured according to described at least one characteristic fingerprint in the fingerprint mapping database, described bunch of selecting device 42 specifically can comprise: record search unit 421 is used for searching the record of described each characteristic fingerprint of at least one characteristic fingerprint under in described fingerprint mapping database; First determining unit 422 is used for being not less than when specifying bar to count when bar number that described at least one record comprises the record of identical document bunch, and making described identical document bunch is the document clusters similar to described web document to be measured.

The web document with in described web document to be measured and the given number of days of above-mentioned document comparison means 43 carry out similarity algorithm relatively can for the longest public string LCS method or cosine ratio of similitude than the COS method.

The detection system of above-mentioned document also can comprise: document to be compared is determined device 46, comprising: participle unit 461, be used for the title of described web document to be measured is carried out participle, obtain at least one keyword; Second determining unit 462, be used for to use described at least one keyword preserved, by described given number of days in the word document inverted list of web document and title participle structure thereof search hit-count and be not less than the document of predetermined number of times as document to be compared; Described document comparison means 43 also is used for: described web document to be measured and described document to be compared are carried out similarity relatively.

The detection system of above-mentioned document also can comprise: inverted list updating device 47 is used for and will be preserved by the word document inverted list of described at least one keyword and described web document structure to be measured.

Above-mentioned bunch information updating apparatus 45 can comprise: fall power unit 451, be used for when characteristic fingerprint of described at least one characteristic fingerprint is contained in the described fingerprint mapping database, but when not being contained in the record under the described destination document bunch, power is fallen in the feature weight in the record under the described destination document bunch; Weighted units 452, be used for when characteristic fingerprint of described at least one characteristic fingerprint is contained in the described fingerprint mapping database, and when being contained in the record under the described destination document bunch, the feature weight in the record under the described destination document bunch is weighted; First record generation unit 453, be used for when characteristic fingerprint of described at least one characteristic fingerprint is not contained in the described fingerprint mapping database, generate a newly-built record that comprises characteristic fingerprint described in described at least one characteristic fingerprint and described destination document bunch, the weight in the described newly-built record is initial value.

Above-mentioned bunch information updating apparatus 45 also can comprise: record deletion unit 454, when being used for the feature weight value that under described destination document bunch, records less than weight threshold, perhaps, when the feature weight in the record under the described destination document bunch being carried out the interval overtime threshold value of described time of falling power or weighting operation and current time, from described Feature Mapping database, delete the record under the described destination document bunch.

Above-mentioned bunch information updating apparatus 45 also can comprise: second record generation unit 455 is used for increasing a record that comprises described web document to be measured and described destination document bunch in described document mapping database.

Above-mentioned each device and the performed method in unit describe in detail in embodiment 1 and embodiment 2, do not repeat them here.

In the detection system of the document that present embodiment provides, the method that has adopted many characteristic fingerprint inquiries and document similarity relatively to combine, can in existing Feature Mapping database, inquire about the document clusters similar to this web document to be measured according to a plurality of characteristic fingerprints that from web document to be measured, obtain, come Preliminary detection web document to be measured whether to belong to approximate repetitive file, because a plurality of characteristic fingerprints can reflect web document to be measured and other other feature of web document phase region exactly, and therefore the qualified record of corresponding relation energy fast query according to characteristic fingerprint and approximate repetitive file in the data with existing storehouse can improve accuracy rate and efficient that approximate repetitive file detects.If can not find the document clusters similar to web document to be measured by above-mentioned Preliminary detection, then further by the conventional document similarity detection method that adopts, prevent from belonging to the situation generation that approximate repetitive file but can not be detected really because of the web document to be measured that the database imperfection causes, thereby improved the recall ratio that approximate repetitive file detects.

Through the above description of the embodiments, the those skilled in the art can be well understood to the present invention and can realize by the mode that software adds essential common hardware, can certainly pass through hardware, but the former is better embodiment under a lot of situation.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of software product, this computer software product is stored in the storage medium that can read, floppy disk as computing machine, hard disk or CD etc., comprise some instructions with so that computer equipment (can be personal computer, server, the perhaps network equipment etc.) carry out the described method of each embodiment of the present invention.

The above; only be the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of described claim.

Claims

1. the detection method of a document is characterized in that, comprising:

In web document to be measured, extract character to obtain at least one file characteristics;

Each described file characteristics is carried out Hash calculation obtain corresponding characteristic fingerprint;

If in the fingerprint mapping database, do not find the document clusters similar to described web document to be measured according to each described characteristic fingerprint, then the web document in described web document to be measured and the given number of days is carried out similarity relatively;

If be contained in the document clusters of document mapping database with the similarity value of described web document to be measured web document greater than similarity threshold in the web document in the described given number of days, then the described document clusters of described document mapping database is destination document bunch, and described web document to be measured is the approximate repetitive file of described destination document bunch.

2. method according to claim 1 is characterized in that, also comprises:

If do not comprise in the web document in the described given number of days and the similarity value of described web document to be measured web document greater than similarity threshold, perhaps, if be not contained in the document clusters of described document mapping database with the similarity value of described web document to be measured web document greater than similarity threshold in the web document in the described given number of days, then generate new document bunch according to described web document to be measured and described at least one characteristic fingerprint.

3. method according to claim 1 is characterized in that, also comprises:

If in described fingerprint mapping database, find the document clusters similar to described web document to be measured according to described at least one characteristic fingerprint, then the document clusters similar to described web document to be measured is destination document bunch in the described fingerprint mapping database, and described web document to be measured is the approximate repetitive file of described destination document bunch.

4. according to claim 1 or 3 described methods, it is characterized in that, also comprise:

According to described web document to be measured, described at least one characteristic fingerprint and described destination document bunch described fingerprint mapping database of renewal and described document mapping database.

5. method according to claim 2 is characterized in that, also comprises:

In described fingerprint mapping database and described document mapping database, generate the newly-built record comprise described new document bunch.

6. according to each described method of claim 1～3, it is characterized in that the described character that extracts comprises to obtain at least one file characteristics in web document to be measured:

The text that identifies described web document to be measured according to paragraph is divided at least one paragraph;

Choose and comprise the maximum N of a character number paragraph in described at least one paragraph;

According to punctuation mark each paragraph in the described N paragraph is divided at least one sentence;

Choose comprise character number maximum one and described web document to be measured in described each paragraph title as described file characteristics.

7. method according to claim 6 is characterized in that, described according to paragraph sign the text of described web document to be measured is divided at least one paragraph before, also comprise:

Described web document to be measured is carried out denoising.

8. method according to claim 6 is characterized in that, also comprises:

Described at least one file characteristics is carried out denoising.

9. method according to claim 1, it is characterized in that, after the corresponding characteristic fingerprint of described acquisition, also comprise according to described at least one characteristic fingerprint searching whether there be the document clusters similar to described web document to be measured in the described fingerprint mapping database, specifically comprise:

Search the record under in described fingerprint mapping database of each characteristic fingerprint in described at least one characteristic fingerprint;

Be not less than and specify the bar number if comprise the bar number of the record of identical document bunch in described at least one record, then described identical document bunch is the document clusters similar to described web document to be measured.

10. method according to claim 1 is characterized in that, described web document in described web document to be measured and the given number of days is carried out similarity relatively before, also comprise:

Title to described web document to be measured carries out participle, obtains at least one keyword;

Use described at least one keyword preserved, search hit-count in the word document inverted list of web document and title participle thereof structure in by described given number of days and be not less than the document of predetermined number of times as document to be compared;

Described web document in described web document to be measured and the given number of days is carried out similarity relatively, comprising: will described web document to be measured and described document to be compared carry out the similarity comparison.

11. method according to claim 10 is characterized in that, also comprises: will be preserved by the word document inverted list of described at least one keyword and described web document structure to be measured.

12. method according to claim 4 is characterized in that, and is described according to described web document to be measured, described at least one characteristic fingerprint and the described destination document bunch described fingerprint mapping database of renewal, comprising:

If a characteristic fingerprint is contained in the described fingerprint mapping database in described at least one characteristic fingerprint, and be contained in the record under the described destination document bunch, then the feature weight in the record under the described destination document bunch be weighted;

If a characteristic fingerprint is contained in the described fingerprint mapping database in described at least one characteristic fingerprint, but be not contained in the record under the described destination document bunch, then power fallen in the feature weight in the record under the described destination document bunch;

If a characteristic fingerprint is not contained in the described fingerprint mapping database in described at least one characteristic fingerprint, then generate a newly-built record that comprises characteristic fingerprint described in described at least one characteristic fingerprint and described destination document bunch, the weight in the described newly-built record is initial value.

13. method according to claim 12 is characterized in that, also comprises:

If the feature weight value under the described destination document bunch in the record is less than weight threshold, perhaps, if the feature weight in the record under the described destination document bunch is carried out the described time and the interval overtime threshold value of current time of falling power or weighting operation, then from described Feature Mapping database, delete the record under the described destination document bunch.

14. method according to claim 4 is characterized in that, and is described according to described web document to be measured, described at least one characteristic fingerprint and the described destination document bunch described document mapping database of renewal, comprising:

Increase a record that comprises described web document to be measured and described destination document bunch in described document mapping database.

15. the detection system of a document is characterized in that, comprising:

Feature deriving means is used for extracting character obtaining at least one file characteristics in web document to be measured, and each described file characteristics is carried out Hash calculation obtains corresponding characteristic fingerprint;

The document comparison means, be used for when according to each described characteristic fingerprint when the fingerprint mapping database does not find the document clusters similar to described web document to be measured, web document in described web document to be measured and the given number of days is carried out similarity relatively, and when being contained in the document clusters of document mapping database greater than the web document of similarity threshold with the similarity value of described web document to be measured in the web document in determining described given number of days, the described document clusters that makes described document mapping database is destination document bunch, and making described web document to be measured is the approximate repetitive file of described destination document bunch.

16. system according to claim 15 is characterized in that, also comprises:

Bunch voting device, be used for when the web document in the described given number of days does not comprise with the similarity value of described web document to be measured greater than the web document of similarity threshold, perhaps, when not being contained in the document clusters of described document mapping database greater than the web document of similarity threshold with the similarity value of described web document to be measured in the web document in the described given number of days, generate new document bunch according to described web document to be measured and described at least one characteristic fingerprint.

17. system according to claim 15, it is characterized in that, described document comparison means also is used for: when according to described at least one characteristic fingerprint when described fingerprint mapping database finds the document clusters similar to described web document to be measured, making document clusters similar to described web document to be measured in the described fingerprint mapping database is destination document bunch, and described web document to be measured is the approximate repetitive file of described destination document bunch.

18. according to each described system of claim 15～17, it is characterized in that, also comprise a bunch information updating apparatus, be used for according to described web document to be measured, described at least one characteristic fingerprint and described destination document bunch described fingerprint mapping database of renewal and described document mapping database.

19. system according to claim 18 is characterized in that, described bunch of information updating apparatus also is used for generating the newly-built record that comprises described new document bunch in described fingerprint mapping database and described document mapping database.

20. according to each described system of claim 15～17, it is characterized in that described feature deriving means comprises:

The paragraph division unit is used for being divided at least one paragraph according to the text that paragraph identifies described web document to be measured;

Paragraph is chosen the unit, is used for choosing described at least one paragraph and comprises the maximum N of a character number paragraph;

The sentence division unit is used for according to punctuation mark each paragraph of a described N paragraph being divided at least one;

The Feature Selection unit is used for choosing described each paragraph and comprises the title of character number maximum and described web document to be measured as in the described file characteristics.

21. system according to claim 15, it is characterized in that, also comprise: bunch selecting device, be used for searching the document clusters similar to described web document to be measured according to described at least one characteristic fingerprint in the fingerprint mapping database, described bunch of selecting device specifically comprises:

The record search unit is used for searching the record of described each characteristic fingerprint of at least one characteristic fingerprint under in described fingerprint mapping database;

First determining unit is used for being not less than when specifying bar to count when bar number that described at least one record comprises the record of identical document bunch, and making described identical document bunch is the document clusters similar to described web document to be measured.

22. system according to claim 18 is characterized in that, described bunch of information updating apparatus comprises:

The power unit falls, be used for when characteristic fingerprint of described at least one characteristic fingerprint is contained in the described fingerprint mapping database, but when not being contained in the record under the described destination document bunch, power is fallen in the feature weight in the record under the described destination document bunch;

Weighted units, be used for when characteristic fingerprint of described at least one characteristic fingerprint is contained in the described fingerprint mapping database, and when being contained in the record under the described destination document bunch, the feature weight in the record under the described destination document bunch is weighted;

First record generation unit, be used for when characteristic fingerprint of described at least one characteristic fingerprint is not contained in the described fingerprint mapping database, generate a newly-built record that comprises characteristic fingerprint described in described at least one characteristic fingerprint and described destination document bunch, the weight in the described newly-built record is initial value.

23. system according to claim 18 is characterized in that, described bunch of information updating apparatus also comprises:

Second record generation unit is used for increasing a record that comprises described web document to be measured and described destination document bunch in described document mapping database.