CN102779188B - Duplicated text removal system and method - Google Patents

Duplicated text removal system and method Download PDF

Info

Publication number
CN102779188B
CN102779188B CN201210227111.7A CN201210227111A CN102779188B CN 102779188 B CN102779188 B CN 102779188B CN 201210227111 A CN201210227111 A CN 201210227111A CN 102779188 B CN102779188 B CN 102779188B
Authority
CN
China
Prior art keywords
text
sequence
section
cryptographic hash
article
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210227111.7A
Other languages
Chinese (zh)
Other versions
CN102779188A (en
Inventor
卢宏林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Hongxiang Technical Service Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201210227111.7A priority Critical patent/CN102779188B/en
Publication of CN102779188A publication Critical patent/CN102779188A/en
Application granted granted Critical
Publication of CN102779188B publication Critical patent/CN102779188B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of duplicated text removal system, described system comprises: segmentation module, is suitable for target text and text to be compared to be divided into segmentation section according to segmentation symbol, and target text is pressed identical mode composition sequence with the segmentation section of text to be compared; Cryptographic hash computing module, is suitable for selected target sequence in target text, calculates the cryptographic hash of all or part of sequence in the cryptographic hash of target sequence and text to be compared; Disappear molality block, is suitable for by the cryptographic hash of described target sequence successively compared with the cryptographic hash of sequence in text to be compared, if there is identical cryptographic hash, then performs the retry that disappears.

Description

Duplicated text removal system and method
Technical field
The present invention relates to the duplicated text removal in search engine, particularly duplicated text removal system and method.
Background technology
Duplicated text removal is one of primary demand in search engine.The at present conventional weighing method that disappears has the weighing method that disappears based on centre word, and carries out Hash calculation for full text or part text and disappear heavily etc.
Means conventional are at present the weighing methods that disappear based on centre word.Extract the centre word in text, then utilize centre word to carry out disappearing heavily, even if carried out the amendment of certain content in such text, as long as centre word does not change, just can find it is duplicate contents.
But be obvious based on the weighing method shortcoming that disappears of centre word, be easy to originally irrelevant article to be judged to be repetition: such as in the article of some professional classes, centre word disappears heavily, and often error rate is higher.As the news of team identical in sports tournament, due to team's title, the title of coach team member, affiliated club, affiliated city, the contents such as play are relatively fixing, no matter team is 1 year different game content even in several years, and the centre word extracted is often all more similar.Utilize centre word to disappear and heavyly just probably the article of two matches of not theing least concerned is judged to be repetition.
In addition, said method is easily being that identical article is judged to not repeat originally: such as different web sites is when reprinting same section article, some websites can in the page put in article, and some websites can be a few part content cutting, and every part uses an independent page.Like this, when extracting centre word, because the page length of two websites is different, the centre word extracted also can be different, therefore originally identical article can be judged as the unduplicated page.
Technique scheme is based on relevant to language: extract centre word time first need to carry out participle, and participle be language be correlated with.Different from English to the segmenting method of Chinese.Therefore, cannot use on foreign language the centre word weighing method that disappears that Chinese is suitable for.
Summary of the invention
The technical problem to be solved in the present invention is that the heavy accuracy that disappears of prior art is low, the problem of poor fault tolerance.
A kind of duplicated text removal system, described system comprises:
Segmentation module, is suitable for target text and text to be compared to be divided into segmentation section according to segmentation symbol, and target text is pressed identical mode composition sequence with the segmentation section of text to be compared;
Cryptographic hash computing module, is suitable for selected target sequence in target text, calculates the cryptographic hash of all or part of sequence in the cryptographic hash of target sequence and text to be compared;
Disappear molality block, is suitable for by the cryptographic hash of described target sequence successively compared with the cryptographic hash of sequence in text to be compared, if there is identical cryptographic hash, then performs the retry that disappears.
Wherein, described segmentation symbol comprises symbol in ASCII character and/or Chinese full-shape, half-angle punctuation mark.
Wherein, described segmentation symbol is one or more.
Wherein, the molality block that disappears described in comprises further:
Text mark unit, is suitable for the text mark repeat sign to repeating; And/or
Text suppression unit, the text be suitable for repeating carries out deletion action.
Wherein, described sequence is made up of a segmentation section or two or more continuous print segmentation section.
Wherein, if the cryptographic hash of described text to be compared only calculating section sequence, then the select location of target sequence in target text is corresponding with the position of partial sequence in text to be compared.
Wherein, described segmentation module comprises segmentation section threshold value setting module further, is suitable for the segmentation hop count amount obtaining composition sequence according to statistics.
Wherein, described in the molality block that disappears comprise a multiple goal gene comparision unit further, be suitable for when the number of described target sequence is greater than 1, successively by the cryptographic hash of each target sequence compared with the sequence in text to be compared.
A kind of duplicated text removal visits method, and described method comprises:
Target text and text to be compared are divided into segmentation section according to segmentation symbol, and target text is pressed identical mode composition sequence with the segmentation section of text to be compared;
Selected target sequence in target text, calculates the cryptographic hash of all or part of sequence in the cryptographic hash of target sequence and text to be compared;
By the cryptographic hash of described target sequence successively compared with the cryptographic hash of sequence in text to be compared, if there is identical cryptographic hash, then perform the retry that disappears.
Wherein, described segmentation symbol comprises symbol in ASCII character and/or Chinese full-shape, half-angle punctuation mark.
Wherein, described segmentation symbol is one or more.
Wherein, the described execution retry that disappears comprises further:
To the text mark repeat sign repeated; And/or
Deletion action is carried out to the text repeated.
Wherein, described sequence is made up of a segmentation section or two or more continuous print segmentation section.
Wherein, if the cryptographic hash of described text to be compared only calculating section sequence, then the select location of target sequence in target text is corresponding with the position of partial sequence in text to be compared.
Wherein, the segmentation section quantity basis statistics of described composition sequence obtains.
Wherein, described the cryptographic hash of described target sequence to be comprised further successively compared with the cryptographic hash of sequence in text to be compared, when the number of described target sequence is greater than 1, successively by the cryptographic hash of each target sequence compared with the sequence in text to be compared.
The weighing method that disappears separating division statements sequence based on symbol provided by the invention, both the heavy accuracy that disappears can have been ensured, solution centre word disappears and heavily different article is judged to be the problem of repetition, relatively high fault-tolerance can be ensured again, simultaneously, separate by symbol and only have relation with selected symbol, have nothing to do with concrete syntax.Therefore this method is applicable to different language.Employ this programme in information system after, the heavy accuracy that disappears and recall rate obtain obviously to be improved.
Accompanying drawing explanation
Fig. 1 is the duplicated text removal system construction drawing of an embodiment of the present invention;
Fig. 2 is the molality block structural diagram that disappears of an embodiment of the present invention;
Fig. 3 is the multiple goal comparing unit structural drawing of an embodiment of the present invention;
Fig. 4 is the duplicated text removal method flow diagram of an embodiment of the present invention;
Fig. 5 is the retry process flow diagram that disappears of an embodiment of the present invention;
Fig. 6 be an embodiment of the present invention compare comparison procedure process flow diagram more.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples for illustration of the present invention, but are not used for limiting the scope of the invention.
Fig. 1 presents the duplicated text removal system of a kind of embodiment of the present invention, and described system comprises segmentation module 100, cryptographic hash computing module 102 and the molality block 104 that disappears.
Segmentation module 100, is suitable for target text and text to be compared to be divided into segmentation section according to segmentation symbol, and target text is pressed identical mode composition sequence with the segmentation section of text to be compared.
Such as, selected needs carry out the punctuation mark separated.As the comma of ASCII character and strong point etc., also has Chinese comma, Chinese fullstop etc.Then according to these punctuation marks, be one group of statement sequence by the division of teaching contents of article.
First section of article separate by punctuate after statement sequence be: " the 1st, the 1st section of article; the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article; the 5th, the 1st section of article; the 6th, the 1st section of article, the 7th, the 1st section of article, the 8th, the 1st section of article; the 9th, the 1st section of article; the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, the 1st section of article ".
Original text the 1st and last sentence, when reprinting first section of article, are revised by second section of article.Its by split module 100 separate by punctuate after statement sequence be: " after the 2nd section of Revising the 1st; the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article; the 5th, the 1st section of article; the 6th, the 1st section of article, the 7th, the 1st section of article, the 8th, the 1st section of article; the 9th, the 1st section of article; the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, article after the 2nd section of amendment ".
Segmentation module 100 also comprises a segmentation section threshold value setting module, is suitable for the segmentation hop count amount obtaining composition sequence according to statistics.In addition, the quantity splitting section can also be set by artificial experience.Find in practice, occur the article of 5 identical contents continuously, accuracy and the recall rate of the weight that disappears are all very high.Here continuous five statements mentioned disappear heavily in news to prove statement quantity relatively reliably, can suitably adjust when processing the article of other types.
Like this, just a complete sequence is formed with every 5.The sequence of the 1st section of article is respectively:
" the 1st, the 1st section of article, the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article ",
" the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, ",
" the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article, ",
" the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article, the 8th, the 1st section of article, ",
……
" the 8th, the 1st section of article, the 9th, the 1st section of article, the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, the 1st section of article ".
The sequence of the 2nd section of article is respectively:
" after the 2nd section of Revising the 1st, the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article ",
" the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, ",
" the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article, ",
" the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article, the 8th, the 1st section of article, ",
……
" the 8th, the 1st section of article, the 9th, the 1st section of article, the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, article after the 2nd section of amendment ".
Cryptographic hash computing module 102, is suitable for selected target sequence in target text, calculates the cryptographic hash of all or part of sequence in the cryptographic hash of target sequence and text to be compared;
Therefore, after above-mentioned sequence has divided, successively by cryptographic hash computing module 102 selected target sequence in the 1st section of article, cryptographic hash calculating has been carried out to the sequence of two sections of articles simultaneously.
The possibility of changing due to medium content in article is smaller, just chooses the target sequence of middle statement sequence as follow-up article.Such as, if first section of article does not find to repeat with other articles, just selected 4th statement sequence is as the target sequence of this section of article, namely statement sequence: the cryptographic hash of " the 4th, the 1st section of article; the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article; the 8th, the 1st section of article " saves as target cryptographic hash, is used for judging whether follow-up article repeats.
Meanwhile, in order to improve fault-tolerance, also several statement sequence can be got as target sequence by multiselect, as in the middle of article, relatively forward, or rearward position chooses a target sequence respectively relatively.
And for the long text of length, if calculate whole statement sequence, calculated amount can be larger.Can only the statement sequence of some be selected to calculate from anterior or afterbody.The higher heavy accuracy that disappears can be ensured like this, also can alleviate calculating pressure simultaneously.
Disappear molality block 104, is suitable for by the cryptographic hash of described target sequence successively compared with the cryptographic hash of sequence in text to be compared, if there is identical cryptographic hash, then performs the retry that disappears.
Like this, when comparing second section of article, first three cryptographic hash does not find to repeat with other articles, but when comparing the 4th cryptographic hash, finds to repeat with first section of article.
So, then by text mark unit 200, to the text mark repeat sign repeated; And/or by text suppression unit 202, deletion action is carried out to the text repeated.
Meanwhile, when selected multiple target sequence, multiple goal gene comparision unit 300 by the cryptographic hash of target sequence successively compared with the cryptographic hash of sequence in text to be compared.
If text extracting part sub-sequence to be compared calculates cryptographic hash, then can preferably in the middle part of text to be compared, beginning, end up and centre respectively extracting part sub-sequence carry out described cryptographic hash calculating.
Fig. 4 presents the duplicated text removal method of a kind of embodiment of the present invention.
First, in step S400, target text and text to be compared are divided into segmentation section according to segmentation symbol, and target text is pressed identical mode composition sequence with the segmentation section of text to be compared.
Such as, selected needs carry out the punctuation mark separated.As the comma of ASCII character and strong point etc., also has Chinese comma, Chinese fullstop etc.Then according to these punctuation marks, be one group of statement sequence by the division of teaching contents of article.
First section of article separate by punctuate after statement sequence be: " the 1st, the 1st section of article; the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article; the 5th, the 1st section of article; the 6th, the 1st section of article, the 7th, the 1st section of article, the 8th, the 1st section of article; the 9th, the 1st section of article; the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, the 1st section of article ".
Original text the 1st and last sentence, when reprinting first section of article, are revised by second section of article.Its by split module 100 separate by punctuate after statement sequence be: " after the 2nd section of Revising the 1st; the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article; the 5th, the 1st section of article; the 6th, the 1st section of article, the 7th, the 1st section of article, the 8th, the 1st section of article; the 9th, the 1st section of article; the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, article after the 2nd section of amendment ".
The segmentation section quantity basis statistics of composition sequence obtains.In addition, the quantity splitting section can also be set by artificial experience.Find in practice, occur the article of 5 identical contents continuously, accuracy and the recall rate of the weight that disappears are all very high.Here continuous five statements mentioned disappear heavily in news to prove statement quantity relatively reliably, can suitably adjust when processing the article of other types.
Like this, just a complete sequence is formed with every 5.The sequence of the 1st section of article is respectively:
" the 1st, the 1st section of article, the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article ",
" the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, ",
" the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article, ",
" the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article, the 8th, the 1st section of article, ",
……
" the 8th, the 1st section of article, the 9th, the 1st section of article, the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, the 1st section of article ".
The sequence of the 2nd section of article is respectively:
" after the 2nd section of Revising the 1st, the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article ",
" the 2nd, the 1st section of article, the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, ",
" the 3rd, the 1st section of article, the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article, ",
" the 4th, the 1st section of article, the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article, the 8th, the 1st section of article, ",
……
" the 8th, the 1st section of article, the 9th, the 1st section of article, the 10th, the 1st section of article, the 11st, the 1st section of article, the 12nd, article after the 2nd section of amendment ".
Again, in step S402, selected target sequence in target text, calculates the cryptographic hash of all or part of sequence in the cryptographic hash of target sequence and text to be compared;
After above-mentioned sequence has divided, selected target sequence in the 1st section of article, has carried out cryptographic hash calculating to the sequence of two sections of articles simultaneously.
The possibility of changing due to medium content in article is smaller, just chooses the target sequence of middle statement sequence as follow-up article.Such as, if first section of article does not find to repeat with other articles, just selected 4th statement sequence is as the target sequence of this section of article, namely statement sequence: the cryptographic hash of " the 4th, the 1st section of article; the 5th, the 1st section of article, the 6th, the 1st section of article, the 7th, the 1st section of article; the 8th, the 1st section of article " saves as target cryptographic hash, is used for judging whether follow-up article repeats.
Meanwhile, in order to improve fault-tolerance, also several statement sequence can be got as target sequence by multiselect, as in the middle of article, relatively forward, or rearward position chooses a target sequence respectively relatively.
And for the long text of length, if calculate whole statement sequence, calculated amount can be larger.Can only the statement sequence of some be selected to calculate from anterior or afterbody.The higher heavy accuracy that disappears can be ensured like this, also can alleviate calculating pressure simultaneously.
Finally, in step s 404, by the cryptographic hash of described target sequence successively compared with the cryptographic hash of sequence in text to be compared, if there is identical cryptographic hash, then the retry that disappears is performed.
Like this, when comparing second section of article, first three cryptographic hash does not find to repeat with other articles, but when comparing the 4th cryptographic hash, finds to repeat with first section of article.
So, then the retry that disappears is performed.Namely S500 is to the text mark repeat sign repeated; And/or S502 carries out deletion action, see Fig. 5 to the text repeated.
Meanwhile, when selected multiple target sequence, then step S600 is performed, by the cryptographic hash of target sequence successively compared with the cryptographic hash of sequence in text to be compared, see Fig. 6.
If text extracting part sub-sequence to be compared calculates cryptographic hash, then can preferably in the middle part of text to be compared, beginning, end up and centre respectively extracting part sub-sequence carry out described cryptographic hash calculating.
Should be noted that, in all parts of controller of the present invention, the function that will realize according to it and logical partitioning has been carried out to parts wherein, but, the present invention is not limited to this, can repartition all parts as required or combine, such as, can be single parts by some component combinations, or some parts can be decomposed into more subassembly further.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all parts in the controller of the embodiment of the present invention.The present invention can also be embodied as part or all the equipment or system program (such as, computer program and computer program) that are suitable for performing method as described herein.Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some systems, several in these systems can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.

Claims (14)

1. a duplicated text removal system, is characterized in that, described system comprises:
Segmentation module, is suitable for target text and text to be compared to be divided into segmentation section according to segmentation symbol, and target text is pressed identical mode composition sequence with the segmentation section of text to be compared;
Cryptographic hash computing module, be suitable in the middle of target text, relatively forward, and/or continuous multiple segmentation section is chosen respectively as selected target sequence in rearward position relatively, calculate the cryptographic hash of all or part of sequence in the cryptographic hash of target sequence and text to be compared, and the cryptographic hash of described target sequence is preserved as target cryptographic hash;
Disappear molality block, is suitable for by the cryptographic hash of described target sequence successively compared with the cryptographic hash of sequence in text to be compared, if there is identical cryptographic hash, then performs the retry that disappears;
Wherein, if the cryptographic hash of described text to be compared only calculating section sequence, then the select location of target sequence in target text is corresponding with the position of partial sequence in text to be compared.
2. the system as claimed in claim 1, is characterized in that, described segmentation symbol comprises symbol in ASCII character and/or Chinese full-shape, half-angle punctuation mark.
3. the system as claimed in claim 1, is characterized in that, described segmentation symbol is one or more.
4. the system as claimed in claim 1, is characterized in that, described in the molality block that disappears comprise further:
Text mark unit, is suitable for the text mark repeat sign to repeating; And/or
Text suppression unit, the text be suitable for repeating carries out deletion action.
5. the system as claimed in claim 1, is characterized in that, described sequence is split section by a segmentation section or two or more continuous print and formed.
6. the system as claimed in claim 1, is characterized in that, described segmentation module comprises segmentation section threshold value setting module further, is suitable for the segmentation hop count amount obtaining composition sequence according to statistics.
7. the system as described in claim 1 or 6, it is characterized in that, the described molality block that disappears comprises a multiple goal gene comparision unit further, is suitable for when the number of described target sequence is greater than 1, successively by the cryptographic hash of each target sequence compared with the sequence in text to be compared.
8. a duplicated text removal method, is characterized in that, described method comprises:
Target text and text to be compared are divided into segmentation section according to segmentation symbol, and target text is pressed identical mode composition sequence with the segmentation section of text to be compared;
In the middle of target text, relatively forward, and/or continuous multiple segmentation section is chosen respectively as selected target sequence in rearward position relatively, calculate the cryptographic hash of all or part of sequence in the cryptographic hash of described target sequence and text to be compared, and the cryptographic hash of described target sequence is preserved as target cryptographic hash;
By the cryptographic hash of described target sequence successively compared with the cryptographic hash of sequence in text to be compared, if there is identical cryptographic hash, then perform the retry that disappears;
Wherein, if the cryptographic hash of described text to be compared only calculating section sequence, then the select location of target sequence in target text is corresponding with the position of partial sequence in text to be compared.
9. method as claimed in claim 8, is characterized in that, described segmentation symbol comprises symbol in ASCII character and/or Chinese full-shape, half-angle punctuation mark.
10. method as claimed in claim 8, it is characterized in that, described segmentation symbol is one or more.
11. methods as claimed in claim 8, is characterized in that, the described execution retry that disappears comprises further:
To the text mark repeat sign repeated; And/or
Deletion action is carried out to the text repeated.
12. methods as claimed in claim 8, is characterized in that, described sequence is split section by a segmentation section or two or more continuous print and formed.
13. methods as claimed in claim 8, is characterized in that, the segmentation section quantity basis statistics of described composition sequence obtains.
14. methods as described in claim 8 or 13, it is characterized in that, described the cryptographic hash of described target sequence to be comprised further successively compared with the cryptographic hash of sequence in text to be compared, when the number of described target sequence is greater than 1, successively by the cryptographic hash of each target sequence compared with the sequence in text to be compared.
CN201210227111.7A 2012-06-29 2012-06-29 Duplicated text removal system and method Active CN102779188B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210227111.7A CN102779188B (en) 2012-06-29 2012-06-29 Duplicated text removal system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210227111.7A CN102779188B (en) 2012-06-29 2012-06-29 Duplicated text removal system and method

Publications (2)

Publication Number Publication Date
CN102779188A CN102779188A (en) 2012-11-14
CN102779188B true CN102779188B (en) 2015-11-25

Family

ID=47124100

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210227111.7A Active CN102779188B (en) 2012-06-29 2012-06-29 Duplicated text removal system and method

Country Status (1)

Country Link
CN (1) CN102779188B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106354730B (en) * 2015-07-16 2019-12-10 北京国双科技有限公司 Method and device for identifying repeated content of webpage text in webpage analysis
CN108345586B (en) * 2018-02-09 2021-04-02 重庆电信系统集成有限公司 Text duplicate removal method and system
CN110442803A (en) * 2019-08-09 2019-11-12 网易传媒科技(北京)有限公司 Data processing method, device, medium and the calculating equipment executed by calculating equipment
CN110750615B (en) * 2019-09-30 2020-07-24 贝壳找房(北京)科技有限公司 Text repeatability judgment method and device, electronic equipment and storage medium
CN110765756B (en) * 2019-10-29 2023-12-01 北京齐尔布莱特科技有限公司 Text processing method, device, computing equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315622A (en) * 2007-05-30 2008-12-03 香港中文大学 System and method for detecting file similarity
CN101404037A (en) * 2008-11-18 2009-04-08 西安交通大学 Method for detecting and positioning electronic text contents plagiary
CN101645082A (en) * 2009-04-17 2010-02-10 华中科技大学 Similar web page duplicate-removing system based on parallel programming mode

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011029212A1 (en) * 2009-09-08 2011-03-17 中国科学院计算技术研究所 Hash method and hash device based on double-counting bloom filters

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315622A (en) * 2007-05-30 2008-12-03 香港中文大学 System and method for detecting file similarity
CN101404037A (en) * 2008-11-18 2009-04-08 西安交通大学 Method for detecting and positioning electronic text contents plagiary
CN101645082A (en) * 2009-04-17 2010-02-10 华中科技大学 Similar web page duplicate-removing system based on parallel programming mode

Also Published As

Publication number Publication date
CN102779188A (en) 2012-11-14

Similar Documents

Publication Publication Date Title
CN102779188B (en) Duplicated text removal system and method
CN103076892B (en) A kind of method and apparatus of the input candidate item for providing corresponding to input character string
CN105528372B (en) A kind of address search method and equipment
MX2008014865A (en) Method and apparatus for multilingual spelling corrections.
WO2012099801A4 (en) Ordering document content
CN104008091A (en) Sentiment value based web text sentiment analysis method
CN111160013B (en) Text error correction method and device
CN102760142A (en) Method and device for extracting subject label in search result aiming at searching query
EP2458331A3 (en) Road estimation device and method for estimating road
CN102314492A (en) Method and equipment for acquiring candidate document sections matched with target document section
CN101887415B (en) Automatic extraction method for text document theme word meaning
CN104951429A (en) Recognition method and device for page headers and page footers of format electronic document
CN104050299A (en) Method for paper duplicate checking
CN106547924A (en) The sentiment analysis method and device of text message
JP2013257761A5 (en)
CN104281275B (en) The input method of a kind of English and device
CN104239285A (en) New article chapter detecting method and device
CN104951478A (en) Information processing method and information processing device
CN103034553A (en) Intelligent verification algorithm, method and device for report designer
CN102737017B (en) Method and apparatus for extracting page theme
CN105119910A (en) Template-based online social network rubbish information real-time detecting method
CN106919603B (en) Method and device for calculating word segmentation weight in query word mode
CN103984731A (en) Self-adaption topic tracing method and device under microblog environment
CN103984685A (en) Method, device and equipment for classifying items to be classified
Echizen-ya et al. Automatic evaluation of machine translation based on recursive acquisition of an intuitive common parts continuum

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220725

Address after: 300450 No. 9-3-401, No. 39, Gaoxin 6th Road, Binhai Science Park, Binhai New Area, Tianjin

Patentee after: 3600 Technology Group Co.,Ltd.

Address before: Room 112, block D, No. 28, Xinjiekou outer street, Xicheng District, Beijing 100088 (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230627

Address after: 1765, floor 17, floor 15, building 3, No. 10 Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: Beijing Hongxiang Technical Service Co.,Ltd.

Address before: 300450 No. 9-3-401, No. 39, Gaoxin 6th Road, Binhai Science Park, Binhai New Area, Tianjin

Patentee before: 3600 Technology Group Co.,Ltd.