CN102779188B

CN102779188B - Duplicated text removal system and method

Info

Publication number: CN102779188B
Application number: CN201210227111.7A
Authority: CN
Inventors: 卢宏林
Original assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Current assignee: Beijing Hongxiang Technical Service Co Ltd
Priority date: 2012-06-29
Filing date: 2012-06-29
Publication date: 2015-11-25
Anticipated expiration: 2032-06-29
Also published as: CN102779188A

Abstract

The invention provides a kind of duplicated text removal system, described system comprises: segmentation module, is suitable for target text and text to be compared to be divided into segmentation section according to segmentation symbol, and target text is pressed identical mode composition sequence with the segmentation section of text to be compared; Cryptographic hash computing module, is suitable for selected target sequence in target text, calculates the cryptographic hash of all or part of sequence in the cryptographic hash of target sequence and text to be compared; Disappear molality block, is suitable for by the cryptographic hash of described target sequence successively compared with the cryptographic hash of sequence in text to be compared, if there is identical cryptographic hash, then performs the retry that disappears.

Description

Duplicated text removal system and method

Technical field

The present invention relates to the duplicated text removal in search engine, particularly duplicated text removal system and method.

Background technology

Duplicated text removal is one of primary demand in search engine.The at present conventional weighing method that disappears has the weighing method that disappears based on centre word, and carries out Hash calculation for full text or part text and disappear heavily etc.

Means conventional are at present the weighing methods that disappear based on centre word.Extract the centre word in text, then utilize centre word to carry out disappearing heavily, even if carried out the amendment of certain content in such text, as long as centre word does not change, just can find it is duplicate contents.

But be obvious based on the weighing method shortcoming that disappears of centre word, be easy to originally irrelevant article to be judged to be repetition: such as in the article of some professional classes, centre word disappears heavily, and often error rate is higher.As the news of team identical in sports tournament, due to team's title, the title of coach team member, affiliated club, affiliated city, the contents such as play are relatively fixing, no matter team is 1 year different game content even in several years, and the centre word extracted is often all more similar.Utilize centre word to disappear and heavyly just probably the article of two matches of not theing least concerned is judged to be repetition.

In addition, said method is easily being that identical article is judged to not repeat originally: such as different web sites is when reprinting same section article, some websites can in the page put in article, and some websites can be a few part content cutting, and every part uses an independent page.Like this, when extracting centre word, because the page length of two websites is different, the centre word extracted also can be different, therefore originally identical article can be judged as the unduplicated page.

Technique scheme is based on relevant to language: extract centre word time first need to carry out participle, and participle be language be correlated with.Different from English to the segmenting method of Chinese.Therefore, cannot use on foreign language the centre word weighing method that disappears that Chinese is suitable for.

Summary of the invention

The technical problem to be solved in the present invention is that the heavy accuracy that disappears of prior art is low, the problem of poor fault tolerance.

A kind of duplicated text removal system, described system comprises:

Segmentation module, is suitable for target text and text to be compared to be divided into segmentation section according to segmentation symbol, and target text is pressed identical mode composition sequence with the segmentation section of text to be compared;

Cryptographic hash computing module, is suitable for selected target sequence in target text, calculates the cryptographic hash of all or part of sequence in the cryptographic hash of target sequence and text to be compared;

Disappear molality block, is suitable for by the cryptographic hash of described target sequence successively compared with the cryptographic hash of sequence in text to be compared, if there is identical cryptographic hash, then performs the retry that disappears.

Wherein, described segmentation symbol comprises symbol in ASCII character and/or Chinese full-shape, half-angle punctuation mark.

Wherein, described segmentation symbol is one or more.

Wherein, the molality block that disappears described in comprises further:

Text mark unit, is suitable for the text mark repeat sign to repeating; And/or

Text suppression unit, the text be suitable for repeating carries out deletion action.

Wherein, described sequence is made up of a segmentation section or two or more continuous print segmentation section.

Wherein, if the cryptographic hash of described text to be compared only calculating section sequence, then the select location of target sequence in target text is corresponding with the position of partial sequence in text to be compared.

Wherein, described segmentation module comprises segmentation section threshold value setting module further, is suitable for the segmentation hop count amount obtaining composition sequence according to statistics.

Wherein, described in the molality block that disappears comprise a multiple goal gene comparision unit further, be suitable for when the number of described target sequence is greater than 1, successively by the cryptographic hash of each target sequence compared with the sequence in text to be compared.

A kind of duplicated text removal visits method, and described method comprises:

Target text and text to be compared are divided into segmentation section according to segmentation symbol, and target text is pressed identical mode composition sequence with the segmentation section of text to be compared;

Selected target sequence in target text, calculates the cryptographic hash of all or part of sequence in the cryptographic hash of target sequence and text to be compared;

By the cryptographic hash of described target sequence successively compared with the cryptographic hash of sequence in text to be compared, if there is identical cryptographic hash, then perform the retry that disappears.

Wherein, described segmentation symbol is one or more.

Wherein, the described execution retry that disappears comprises further:

To the text mark repeat sign repeated; And/or

Deletion action is carried out to the text repeated.

Wherein, the segmentation section quantity basis statistics of described composition sequence obtains.

Wherein, described the cryptographic hash of described target sequence to be comprised further successively compared with the cryptographic hash of sequence in text to be compared, when the number of described target sequence is greater than 1, successively by the cryptographic hash of each target sequence compared with the sequence in text to be compared.

The weighing method that disappears separating division statements sequence based on symbol provided by the invention, both the heavy accuracy that disappears can have been ensured, solution centre word disappears and heavily different article is judged to be the problem of repetition, relatively high fault-tolerance can be ensured again, simultaneously, separate by symbol and only have relation with selected symbol, have nothing to do with concrete syntax.Therefore this method is applicable to different language.Employ this programme in information system after, the heavy accuracy that disappears and recall rate obtain obviously to be improved.

Accompanying drawing explanation

Fig. 1 is the duplicated text removal system construction drawing of an embodiment of the present invention;

Fig. 2 is the molality block structural diagram that disappears of an embodiment of the present invention;

Fig. 3 is the multiple goal comparing unit structural drawing of an embodiment of the present invention;

Fig. 4 is the duplicated text removal method flow diagram of an embodiment of the present invention;

Fig. 5 is the retry process flow diagram that disappears of an embodiment of the present invention;

Fig. 6 be an embodiment of the present invention compare comparison procedure process flow diagram more.

Embodiment

Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples for illustration of the present invention, but are not used for limiting the scope of the invention.

Fig. 1 presents the duplicated text removal system of a kind of embodiment of the present invention, and described system comprises segmentation module 100, cryptographic hash computing module 102 and the molality block 104 that disappears.

Segmentation module 100, is suitable for target text and text to be compared to be divided into segmentation section according to segmentation symbol, and target text is pressed identical mode composition sequence with the segmentation section of text to be compared.

Such as, selected needs carry out the punctuation mark separated.As the comma of ASCII character and strong point etc., also has Chinese comma, Chinese fullstop etc.Then according to these punctuation marks, be one group of statement sequence by the division of teaching contents of article.

First section of article separate by punctuate after statement sequence be: " the 1st, the 1st section of article; the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article; the 5th, the 1st section of article; the 6th, the 1st section of article, the 7th, the 1st section of article, the 8th, the 1st section of article; the 9th, the 1st section of article; the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, the 1st section of article ".

Original text the 1st and last sentence, when reprinting first section of article, are revised by second section of article.Its by split module 100 separate by punctuate after statement sequence be: " after the 2nd section of Revising the 1st; the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article; the 5th, the 1st section of article; the 6th, the 1st section of article, the 7th, the 1st section of article, the 8th, the 1st section of article; the 9th, the 1st section of article; the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, article after the 2nd section of amendment ".

Segmentation module 100 also comprises a segmentation section threshold value setting module, is suitable for the segmentation hop count amount obtaining composition sequence according to statistics.In addition, the quantity splitting section can also be set by artificial experience.Find in practice, occur the article of 5 identical contents continuously, accuracy and the recall rate of the weight that disappears are all very high.Here continuous five statements mentioned disappear heavily in news to prove statement quantity relatively reliably, can suitably adjust when processing the article of other types.

Like this, just a complete sequence is formed with every 5.The sequence of the 1st section of article is respectively:

" the 1st, the 1st section of article, the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article ",

" the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, ",

" the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article, ",

" the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article, the 8th, the 1st section of article, ",

……

" the 8th, the 1st section of article, the 9th, the 1st section of article, the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, the 1st section of article ".

The sequence of the 2nd section of article is respectively:

" after the 2nd section of Revising the 1st, the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article ",

……

" the 8th, the 1st section of article, the 9th, the 1st section of article, the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, article after the 2nd section of amendment ".

Cryptographic hash computing module 102, is suitable for selected target sequence in target text, calculates the cryptographic hash of all or part of sequence in the cryptographic hash of target sequence and text to be compared;

Therefore, after above-mentioned sequence has divided, successively by cryptographic hash computing module 102 selected target sequence in the 1st section of article, cryptographic hash calculating has been carried out to the sequence of two sections of articles simultaneously.

The possibility of changing due to medium content in article is smaller, just chooses the target sequence of middle statement sequence as follow-up article.Such as, if first section of article does not find to repeat with other articles, just selected 4th statement sequence is as the target sequence of this section of article, namely statement sequence: the cryptographic hash of " the 4th, the 1st section of article; the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article; the 8th, the 1st section of article " saves as target cryptographic hash, is used for judging whether follow-up article repeats.

Meanwhile, in order to improve fault-tolerance, also several statement sequence can be got as target sequence by multiselect, as in the middle of article, relatively forward, or rearward position chooses a target sequence respectively relatively.

And for the long text of length, if calculate whole statement sequence, calculated amount can be larger.Can only the statement sequence of some be selected to calculate from anterior or afterbody.The higher heavy accuracy that disappears can be ensured like this, also can alleviate calculating pressure simultaneously.

Disappear molality block 104, is suitable for by the cryptographic hash of described target sequence successively compared with the cryptographic hash of sequence in text to be compared, if there is identical cryptographic hash, then performs the retry that disappears.

Like this, when comparing second section of article, first three cryptographic hash does not find to repeat with other articles, but when comparing the 4th cryptographic hash, finds to repeat with first section of article.

So, then by text mark unit 200, to the text mark repeat sign repeated; And/or by text suppression unit 202, deletion action is carried out to the text repeated.

Meanwhile, when selected multiple target sequence, multiple goal gene comparision unit 300 by the cryptographic hash of target sequence successively compared with the cryptographic hash of sequence in text to be compared.

If text extracting part sub-sequence to be compared calculates cryptographic hash, then can preferably in the middle part of text to be compared, beginning, end up and centre respectively extracting part sub-sequence carry out described cryptographic hash calculating.

Fig. 4 presents the duplicated text removal method of a kind of embodiment of the present invention.

First, in step S400, target text and text to be compared are divided into segmentation section according to segmentation symbol, and target text is pressed identical mode composition sequence with the segmentation section of text to be compared.

The segmentation section quantity basis statistics of composition sequence obtains.In addition, the quantity splitting section can also be set by artificial experience.Find in practice, occur the article of 5 identical contents continuously, accuracy and the recall rate of the weight that disappears are all very high.Here continuous five statements mentioned disappear heavily in news to prove statement quantity relatively reliably, can suitably adjust when processing the article of other types.

……

The sequence of the 2nd section of article is respectively:

……

Again, in step S402, selected target sequence in target text, calculates the cryptographic hash of all or part of sequence in the cryptographic hash of target sequence and text to be compared;

After above-mentioned sequence has divided, selected target sequence in the 1st section of article, has carried out cryptographic hash calculating to the sequence of two sections of articles simultaneously.

Finally, in step s 404, by the cryptographic hash of described target sequence successively compared with the cryptographic hash of sequence in text to be compared, if there is identical cryptographic hash, then the retry that disappears is performed.

So, then the retry that disappears is performed.Namely S500 is to the text mark repeat sign repeated; And/or S502 carries out deletion action, see Fig. 5 to the text repeated.

Meanwhile, when selected multiple target sequence, then step S600 is performed, by the cryptographic hash of target sequence successively compared with the cryptographic hash of sequence in text to be compared, see Fig. 6.

Should be noted that, in all parts of controller of the present invention, the function that will realize according to it and logical partitioning has been carried out to parts wherein, but, the present invention is not limited to this, can repartition all parts as required or combine, such as, can be single parts by some component combinations, or some parts can be decomposed into more subassembly further.

All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all parts in the controller of the embodiment of the present invention.The present invention can also be embodied as part or all the equipment or system program (such as, computer program and computer program) that are suitable for performing method as described herein.Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.

The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some systems, several in these systems can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.

Claims

1. a duplicated text removal system, is characterized in that, described system comprises:

Cryptographic hash computing module, be suitable in the middle of target text, relatively forward, and/or continuous multiple segmentation section is chosen respectively as selected target sequence in rearward position relatively, calculate the cryptographic hash of all or part of sequence in the cryptographic hash of target sequence and text to be compared, and the cryptographic hash of described target sequence is preserved as target cryptographic hash;

Disappear molality block, is suitable for by the cryptographic hash of described target sequence successively compared with the cryptographic hash of sequence in text to be compared, if there is identical cryptographic hash, then performs the retry that disappears;

2. the system as claimed in claim 1, is characterized in that, described segmentation symbol comprises symbol in ASCII character and/or Chinese full-shape, half-angle punctuation mark.

3. the system as claimed in claim 1, is characterized in that, described segmentation symbol is one or more.

4. the system as claimed in claim 1, is characterized in that, described in the molality block that disappears comprise further:

Text mark unit, is suitable for the text mark repeat sign to repeating; And/or

5. the system as claimed in claim 1, is characterized in that, described sequence is split section by a segmentation section or two or more continuous print and formed.

6. the system as claimed in claim 1, is characterized in that, described segmentation module comprises segmentation section threshold value setting module further, is suitable for the segmentation hop count amount obtaining composition sequence according to statistics.

7. the system as described in claim 1 or 6, it is characterized in that, the described molality block that disappears comprises a multiple goal gene comparision unit further, is suitable for when the number of described target sequence is greater than 1, successively by the cryptographic hash of each target sequence compared with the sequence in text to be compared.

8. a duplicated text removal method, is characterized in that, described method comprises:

In the middle of target text, relatively forward, and/or continuous multiple segmentation section is chosen respectively as selected target sequence in rearward position relatively, calculate the cryptographic hash of all or part of sequence in the cryptographic hash of described target sequence and text to be compared, and the cryptographic hash of described target sequence is preserved as target cryptographic hash;

By the cryptographic hash of described target sequence successively compared with the cryptographic hash of sequence in text to be compared, if there is identical cryptographic hash, then perform the retry that disappears;

9. method as claimed in claim 8, is characterized in that, described segmentation symbol comprises symbol in ASCII character and/or Chinese full-shape, half-angle punctuation mark.

10. method as claimed in claim 8, it is characterized in that, described segmentation symbol is one or more.

11. methods as claimed in claim 8, is characterized in that, the described execution retry that disappears comprises further:

To the text mark repeat sign repeated; And/or

Deletion action is carried out to the text repeated.

12. methods as claimed in claim 8, is characterized in that, described sequence is split section by a segmentation section or two or more continuous print and formed.

13. methods as claimed in claim 8, is characterized in that, the segmentation section quantity basis statistics of described composition sequence obtains.

14. methods as described in claim 8 or 13, it is characterized in that, described the cryptographic hash of described target sequence to be comprised further successively compared with the cryptographic hash of sequence in text to be compared, when the number of described target sequence is greater than 1, successively by the cryptographic hash of each target sequence compared with the sequence in text to be compared.