Summary of the invention
The technical problem to be solved in the present invention is that the heavy accuracy that disappears of prior art is low, the problem of poor fault tolerance.
A kind of duplicated text removal system, described system comprises:
Segmentation module, is suitable for target text and text to be compared to be divided into segmentation section according to segmentation symbol, and target text is pressed identical mode composition sequence with the segmentation section of text to be compared;
Cryptographic hash computing module, is suitable for selected target sequence in target text, calculates the cryptographic hash of all or part of sequence in the cryptographic hash of target sequence and text to be compared;
Disappear molality block, is suitable for by the cryptographic hash of described target sequence successively compared with the cryptographic hash of sequence in text to be compared, if there is identical cryptographic hash, then performs the retry that disappears.
Wherein, described segmentation symbol comprises symbol in ASCII character and/or Chinese full-shape, half-angle punctuation mark.
Wherein, described segmentation symbol is one or more.
Wherein, the molality block that disappears described in comprises further:
Text mark unit, is suitable for the text mark repeat sign to repeating; And/or
Text suppression unit, the text be suitable for repeating carries out deletion action.
Wherein, described sequence is made up of a segmentation section or two or more continuous print segmentation section.
Wherein, if the cryptographic hash of described text to be compared only calculating section sequence, then the select location of target sequence in target text is corresponding with the position of partial sequence in text to be compared.
Wherein, described segmentation module comprises segmentation section threshold value setting module further, is suitable for the segmentation hop count amount obtaining composition sequence according to statistics.
Wherein, described in the molality block that disappears comprise a multiple goal gene comparision unit further, be suitable for when the number of described target sequence is greater than 1, successively by the cryptographic hash of each target sequence compared with the sequence in text to be compared.
A kind of duplicated text removal visits method, and described method comprises:
Target text and text to be compared are divided into segmentation section according to segmentation symbol, and target text is pressed identical mode composition sequence with the segmentation section of text to be compared;
Selected target sequence in target text, calculates the cryptographic hash of all or part of sequence in the cryptographic hash of target sequence and text to be compared;
By the cryptographic hash of described target sequence successively compared with the cryptographic hash of sequence in text to be compared, if there is identical cryptographic hash, then perform the retry that disappears.
Wherein, described segmentation symbol comprises symbol in ASCII character and/or Chinese full-shape, half-angle punctuation mark.
Wherein, described segmentation symbol is one or more.
Wherein, the described execution retry that disappears comprises further:
To the text mark repeat sign repeated; And/or
Deletion action is carried out to the text repeated.
Wherein, described sequence is made up of a segmentation section or two or more continuous print segmentation section.
Wherein, if the cryptographic hash of described text to be compared only calculating section sequence, then the select location of target sequence in target text is corresponding with the position of partial sequence in text to be compared.
Wherein, the segmentation section quantity basis statistics of described composition sequence obtains.
Wherein, described the cryptographic hash of described target sequence to be comprised further successively compared with the cryptographic hash of sequence in text to be compared, when the number of described target sequence is greater than 1, successively by the cryptographic hash of each target sequence compared with the sequence in text to be compared.
The weighing method that disappears separating division statements sequence based on symbol provided by the invention, both the heavy accuracy that disappears can have been ensured, solution centre word disappears and heavily different article is judged to be the problem of repetition, relatively high fault-tolerance can be ensured again, simultaneously, separate by symbol and only have relation with selected symbol, have nothing to do with concrete syntax.Therefore this method is applicable to different language.Employ this programme in information system after, the heavy accuracy that disappears and recall rate obtain obviously to be improved.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples for illustration of the present invention, but are not used for limiting the scope of the invention.
Fig. 1 presents the duplicated text removal system of a kind of embodiment of the present invention, and described system comprises segmentation module 100, cryptographic hash computing module 102 and the molality block 104 that disappears.
Segmentation module 100, is suitable for target text and text to be compared to be divided into segmentation section according to segmentation symbol, and target text is pressed identical mode composition sequence with the segmentation section of text to be compared.
Such as, selected needs carry out the punctuation mark separated.As the comma of ASCII character and strong point etc., also has Chinese comma, Chinese fullstop etc.Then according to these punctuation marks, be one group of statement sequence by the division of teaching contents of article.
First section of article separate by punctuate after statement sequence be: " the 1st, the 1st section of article; the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article; the 5th, the 1st section of article; the 6th, the 1st section of article, the 7th, the 1st section of article, the 8th, the 1st section of article; the 9th, the 1st section of article; the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, the 1st section of article ".
Original text the 1st and last sentence, when reprinting first section of article, are revised by second section of article.Its by split module 100 separate by punctuate after statement sequence be: " after the 2nd section of Revising the 1st; the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article; the 5th, the 1st section of article; the 6th, the 1st section of article, the 7th, the 1st section of article, the 8th, the 1st section of article; the 9th, the 1st section of article; the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, article after the 2nd section of amendment ".
Segmentation module 100 also comprises a segmentation section threshold value setting module, is suitable for the segmentation hop count amount obtaining composition sequence according to statistics.In addition, the quantity splitting section can also be set by artificial experience.Find in practice, occur the article of 5 identical contents continuously, accuracy and the recall rate of the weight that disappears are all very high.Here continuous five statements mentioned disappear heavily in news to prove statement quantity relatively reliably, can suitably adjust when processing the article of other types.
Like this, just a complete sequence is formed with every 5.The sequence of the 1st section of article is respectively:
" the 1st, the 1st section of article, the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article ",
" the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, ",
" the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article, ",
" the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article, the 8th, the 1st section of article, ",
……
" the 8th, the 1st section of article, the 9th, the 1st section of article, the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, the 1st section of article ".
The sequence of the 2nd section of article is respectively:
" after the 2nd section of Revising the 1st, the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article ",
" the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, ",
" the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article, ",
" the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article, the 8th, the 1st section of article, ",
……
" the 8th, the 1st section of article, the 9th, the 1st section of article, the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, article after the 2nd section of amendment ".
Cryptographic hash computing module 102, is suitable for selected target sequence in target text, calculates the cryptographic hash of all or part of sequence in the cryptographic hash of target sequence and text to be compared;
Therefore, after above-mentioned sequence has divided, successively by cryptographic hash computing module 102 selected target sequence in the 1st section of article, cryptographic hash calculating has been carried out to the sequence of two sections of articles simultaneously.
The possibility of changing due to medium content in article is smaller, just chooses the target sequence of middle statement sequence as follow-up article.Such as, if first section of article does not find to repeat with other articles, just selected 4th statement sequence is as the target sequence of this section of article, namely statement sequence: the cryptographic hash of " the 4th, the 1st section of article; the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article; the 8th, the 1st section of article " saves as target cryptographic hash, is used for judging whether follow-up article repeats.
Meanwhile, in order to improve fault-tolerance, also several statement sequence can be got as target sequence by multiselect, as in the middle of article, relatively forward, or rearward position chooses a target sequence respectively relatively.
And for the long text of length, if calculate whole statement sequence, calculated amount can be larger.Can only the statement sequence of some be selected to calculate from anterior or afterbody.The higher heavy accuracy that disappears can be ensured like this, also can alleviate calculating pressure simultaneously.
Disappear molality block 104, is suitable for by the cryptographic hash of described target sequence successively compared with the cryptographic hash of sequence in text to be compared, if there is identical cryptographic hash, then performs the retry that disappears.
Like this, when comparing second section of article, first three cryptographic hash does not find to repeat with other articles, but when comparing the 4th cryptographic hash, finds to repeat with first section of article.
So, then by text mark unit 200, to the text mark repeat sign repeated; And/or by text suppression unit 202, deletion action is carried out to the text repeated.
Meanwhile, when selected multiple target sequence, multiple goal gene comparision unit 300 by the cryptographic hash of target sequence successively compared with the cryptographic hash of sequence in text to be compared.
If text extracting part sub-sequence to be compared calculates cryptographic hash, then can preferably in the middle part of text to be compared, beginning, end up and centre respectively extracting part sub-sequence carry out described cryptographic hash calculating.
Fig. 4 presents the duplicated text removal method of a kind of embodiment of the present invention.
First, in step S400, target text and text to be compared are divided into segmentation section according to segmentation symbol, and target text is pressed identical mode composition sequence with the segmentation section of text to be compared.
Such as, selected needs carry out the punctuation mark separated.As the comma of ASCII character and strong point etc., also has Chinese comma, Chinese fullstop etc.Then according to these punctuation marks, be one group of statement sequence by the division of teaching contents of article.
First section of article separate by punctuate after statement sequence be: " the 1st, the 1st section of article; the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article; the 5th, the 1st section of article; the 6th, the 1st section of article, the 7th, the 1st section of article, the 8th, the 1st section of article; the 9th, the 1st section of article; the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, the 1st section of article ".
Original text the 1st and last sentence, when reprinting first section of article, are revised by second section of article.Its by split module 100 separate by punctuate after statement sequence be: " after the 2nd section of Revising the 1st; the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article; the 5th, the 1st section of article; the 6th, the 1st section of article, the 7th, the 1st section of article, the 8th, the 1st section of article; the 9th, the 1st section of article; the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, article after the 2nd section of amendment ".
The segmentation section quantity basis statistics of composition sequence obtains.In addition, the quantity splitting section can also be set by artificial experience.Find in practice, occur the article of 5 identical contents continuously, accuracy and the recall rate of the weight that disappears are all very high.Here continuous five statements mentioned disappear heavily in news to prove statement quantity relatively reliably, can suitably adjust when processing the article of other types.
Like this, just a complete sequence is formed with every 5.The sequence of the 1st section of article is respectively:
" the 1st, the 1st section of article, the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article ",
" the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, ",
" the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article, ",
" the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article, the 8th, the 1st section of article, ",
……
" the 8th, the 1st section of article, the 9th, the 1st section of article, the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, the 1st section of article ".
The sequence of the 2nd section of article is respectively:
" after the 2nd section of Revising the 1st, the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article ",
" the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, ",
" the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article, ",
" the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article, the 8th, the 1st section of article, ",
……
" the 8th, the 1st section of article, the 9th, the 1st section of article, the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, article after the 2nd section of amendment ".
Again, in step S402, selected target sequence in target text, calculates the cryptographic hash of all or part of sequence in the cryptographic hash of target sequence and text to be compared;
After above-mentioned sequence has divided, selected target sequence in the 1st section of article, has carried out cryptographic hash calculating to the sequence of two sections of articles simultaneously.
The possibility of changing due to medium content in article is smaller, just chooses the target sequence of middle statement sequence as follow-up article.Such as, if first section of article does not find to repeat with other articles, just selected 4th statement sequence is as the target sequence of this section of article, namely statement sequence: the cryptographic hash of " the 4th, the 1st section of article; the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article; the 8th, the 1st section of article " saves as target cryptographic hash, is used for judging whether follow-up article repeats.
Meanwhile, in order to improve fault-tolerance, also several statement sequence can be got as target sequence by multiselect, as in the middle of article, relatively forward, or rearward position chooses a target sequence respectively relatively.
And for the long text of length, if calculate whole statement sequence, calculated amount can be larger.Can only the statement sequence of some be selected to calculate from anterior or afterbody.The higher heavy accuracy that disappears can be ensured like this, also can alleviate calculating pressure simultaneously.
Finally, in step s 404, by the cryptographic hash of described target sequence successively compared with the cryptographic hash of sequence in text to be compared, if there is identical cryptographic hash, then the retry that disappears is performed.
Like this, when comparing second section of article, first three cryptographic hash does not find to repeat with other articles, but when comparing the 4th cryptographic hash, finds to repeat with first section of article.
So, then the retry that disappears is performed.Namely S500 is to the text mark repeat sign repeated; And/or S502 carries out deletion action, see Fig. 5 to the text repeated.
Meanwhile, when selected multiple target sequence, then step S600 is performed, by the cryptographic hash of target sequence successively compared with the cryptographic hash of sequence in text to be compared, see Fig. 6.
If text extracting part sub-sequence to be compared calculates cryptographic hash, then can preferably in the middle part of text to be compared, beginning, end up and centre respectively extracting part sub-sequence carry out described cryptographic hash calculating.
Should be noted that, in all parts of controller of the present invention, the function that will realize according to it and logical partitioning has been carried out to parts wherein, but, the present invention is not limited to this, can repartition all parts as required or combine, such as, can be single parts by some component combinations, or some parts can be decomposed into more subassembly further.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all parts in the controller of the embodiment of the present invention.The present invention can also be embodied as part or all the equipment or system program (such as, computer program and computer program) that are suitable for performing method as described herein.Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some systems, several in these systems can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.