A kind of method that identical news cluster is realized based on title fingerprint and text fingerprint
Technical field
The invention belongs to based on title fingerprint and text fingerprint, search field, in particular to one kind realize that identical news is gathered
The method of class.
Background technology
In information (or news) search field, usually there is chained address difference, but content is extremely close or even complete
Identical information data, we term it repeat or approximate repetition.This is due to being referred between multiple source of news websites mutually, mutually
It is more universal that situation about even directly replicating mutually is quoted.Due to all containing similar content, institute in the information data that these repeat
Generally can all hit the retrieval of user, and as content is similar, so Relevance scores are also almost identical, cause their meetings
Concentration represents in front of the user.User sees the data of a large amount of repetitions, and this undoubtedly causes user obtain a small amount of new letter
Breath, has a strong impact on Consumer's Experience;Meanwhile, duplicate data also consumes substantial amounts of resource for index and search procedure.For
Solution problem above, it is desirable to by certain methods, detect the information of repetition, so as to will weight during index order
Multiple document is eliminated, and to reduce resource loss, and provides more preferable Consumer's Experience.
The method for calculating repetition information at present has verification and technology, N-gram fingerprint computation techniques, Simhash fingerprint skills
Art.
Verification and technology by each byte in information content is calculated and verification and technology simple, but in can only detecting
Hold identical information document.Meanwhile, any document containing same text can obtain identical verification and
N-gram fingerprint computation techniques with N as step-length, select some word strings to represent document content from information content.N-
Gram fingerprint techniques take word string that length is N from content at random as user supplied video content using fingerprints, and consideration word string is in the text
Importance.
Simhash technology is the user supplied video content using fingerprints that every information calculates 64bit, then by comparing all data, meter two-by-two
Fingerprint difference degree is calculated, so as to judge whether new information is similar to some of an information set before piece.Simhash needs two-by-two
The fingerprint similarity of all documents is calculated, amount of calculation is huge, and efficiency of algorithm is not high, be poorly suitable for the higher money of ageing requirement
The application of news search engine.
Content of the invention
For the deficiencies in the prior art, the invention provides a kind of realize identical news based on title fingerprint and text fingerprint
The method of cluster.For verification and, deficiencies of the N-gram and Simhash in identical news cluster application, design is a kind of simply
Effectively detection repeats the method for information to carry out repetition news cluster.
The purpose of the present invention is realized using following technical proposals:
A kind of method for realizing identical news cluster based on title fingerprint and text fingerprint, which thes improvement is that, described
Method includes:
(1) title pretreatment;
(2) title fingerprint is calculated;
(3) text pretreatment;
(4) cutting word weight in text is calculated;
(5) search in fingerprint base;
(6) new information is stored in fingerprint base;
(7) update fingerprint base to process.
Preferably, described step (1) includes removing the noise character in title, and double byte character in title is switched to half widths
Symbol.
Preferably, step (2) include according to title content calculate verification and, take a 64bit verification and, as
Title fingerprint.
Preferably, described step (3) include removing the noise character in text, then carry out normalization.
Preferably, described step (4) include the maximum front M word of weighting weight as core word, the big top n of weighting weight time
Word is used as descriptor.
Preferably, described step (4) include sorting M core word and N number of descriptor respectively.
Preferably, step (5) include
Title fingerprint identical information, then this information document is same;
The different information of title fingerprint, judges whether its core word is similar to descriptor.
Preferably, described step (6) are mismatched with information in fingerprint base including new information, then distribute one for new information
User supplied video content using fingerprints.
Preferably, described step (7) include that new information is similar with a certain information in fingerprint base, but the two text dissmilarity
In fingerprint base, the title fingerprint of new information is then only preserved, do not preserve its text.
Compared with the prior art, beneficial effects of the present invention are:
(1) recognized using the repetition information based on the present invention, algorithm is simple and clear, effect is significant.After tested, for information is searched for
In common repetition situation, such as directly replicate, title adjustment, serial news, the discrimination of content fine setting is up to more than 99%.
(2) algorithm mates to core word first, quickly eliminates the dissimilar article of theme, greatly improves knowledge
Other efficiency, recognizes that in ten million DBMS the time once is less than 1ms.Due to only preserving the basic letter of core word and descriptor
Breath, has saved memory space, and the space consumed by the historical information of ten million DBMS of storage is less than 500MB.
Description of the drawings
A kind of method flow for realizing identical news cluster based on title fingerprint and text fingerprint that Fig. 1 is provided for the present invention
Figure.
Fig. 2 realizes the method flow diagram of identical news cluster for the text fingerprint that the present invention is provided.
Specific embodiment
Below in conjunction with the accompanying drawings the specific embodiment of the present invention is described in further detail.
Information is mainly characterized by title and text two parts.Title is for the summary of information theme;Text be for
Information theme is specifically described.Two document contents are similar, then the theme of two documents is necessarily identical.If it means that two
Piece document is similar for the summary (title) of same subject, then two document similarities;If two documents are for same subject
Concrete explaination (text) is similar, then two document similarities.
Find that information title is shorter, and text is longer by statistics.When the theme corresponding to title is considered, it is suitable for consideration
Wherein all words, and consider text corresponding to theme when, it is only necessary to consider most important of which those.
The present invention program is specific as follows:
1. title pretreatment.
Noise character in title is removed first, including:At space, tab, punctuation mark, start of header and ending
Digital number at " figure/group figure ", ending.Double byte character in title is switched to half-angle character again.
2. title fingerprint is calculated.
According to title content calculate verification and, take a 64bit verification and, as title fingerprint.
3. text pretreatment.
Remove the noise character in text first, only retain Chinese character, English alphabet, numeral.Normalization is carried out again:Will be complete
Angle character is converted to half-angle character, upper case character is converted to lowercase character.
4. pair text participle.
Using Forward Maximum Method segmentation methods.
5. the weight of each word in text is calculated.
A large amount of information are counted in advance, have therefrom extracted the common word of 12W or so, and have recorded the reverse text of each word
Shelves frequency (Inverse Document Frequency, IDF) value.According to text word segmentation result, word weight is calculated.Calculate weight
When consider word frequency (Term Frequency, TF) value and IDF value of this word in a large amount of information that each word occurs.According to just
Literary length, used as core word, the big top n word of weighting weight time is used as descriptor for the maximum front M word of weighting weight:
If i. positive cliction number for [16 ,+∞), then M=3, N=13;
If ii. positive cliction number for [10,16), then M=2, N=8;
If iii. positive cliction number is for (0,10), then M is positive cliction number, N=0.
Respectively M core word and N number of descriptor are sorted.
6. search in fingerprint base.
As described in Figure 2, if the information of existing same title fingerprint, then it is assumed that this information document is same.If
There is not the information of same title fingerprint, then see if there is core word identical information.If there is core word identical information, continue
Check whether the descriptor of the two is similar.The similar rule of descriptor is:In two groups of descriptors, there is more than 75% word identical,
Then think that two groups of descriptors are similar.If core word is identical, and descriptor is also similar, then it is assumed that this information document phase therewith
Seemingly.
7. new information is stored in fingerprint base.
If new information is dissimilar with all information in fingerprint base, distribute user supplied video content using fingerprints (generally for new information
With the title fingerprint of new information as user supplied video content using fingerprints).If new information is similar with a certain information in fingerprint base, newly information is interior
Hold the user supplied video content using fingerprints that fingerprint is equal to information in storehouse.
8. undue transmission is processed.Due to the title and text of information can differentiate similar, if without restriction, may
There is content " to drift about " phenomenon.
Such as:The title of A is identical with the title of B, and the content of B is similar to C, the content of C similar to the content of D but with B gaps
Larger, may result in A is considered as similar to D, but the two content actual differs.The present invention solve scheme be:If new provide
News are similar with a certain information in fingerprint base, but the two text dissimilar (core word is different, or descriptor differs greatly), then only
The title fingerprint of new information is preserved in fingerprint base, i.e. in one group of similar news, does not only mutually preserve its text
Closely similar positive literary talent can be further used for judging other identical news.
Embodiment
World Wide Web is released news (http first://world.huanqiu.com/exclusive/2013-07/
4105939.html), entitled《Diaoyu Island problem is considered as obstruction Sino-Japanese relations " diplomatic problems " by day》, 3 text core words
For [Japanese side] [middle side] [sovereignty], 13 text descriptors are [fishing] [proposal] [opinion] [dispute] [territory] [recognizing] [harm
Hinder] [Japan] [adviser] [accidental] [island] [presence] [things that are outside the scope of one's own job].
World Wide Web news (http reprints in sohu.com://roll.sohu.com/20130709/n381077312.shtml),
Title and text do not make an amendment.According to title identical algorithms, cluster is to together.
World Wide Web news (http reprints in Sina website://news.sina.com.cn/c/2013-07-09/
073027615240.shtml), title and text have been finely tuned:Entitled《Japanese last month proposes Diaoyu Island problem as " diplomacy
Problem "》, text core word is [Japanese side] [middle side] [sovereignty], and text descriptor is that [Sino-Japan] [fishing] [obstruction] [proposal] is [main
] [dispute] [territory] [diplomacy] [recognizing] [island] [Japan] [adviser] [accidental].According to text Similarity algorithm, with World Wide Web just
Literary core word is identical, and descriptor same number is 10, meets text rule of similarity, and cluster is to together.While the mark of Sina website
Topic and body also may continue as standard and continue to differentiate.
News of the global new military net based on World Wide Web has write news (http again://www.xinjunshi.com/
Jujiao/20130710/101163.html), entitled《Day prime minister peace times is withdrawn the previous remark to Diaoyu Island suddenly, and the Japanese whole nation is a piece of to gurgle
So》, text core word is [Japanese side] [middle side] [sovereignty], and text descriptor is [adviser] [fishing] [diplomacy] [territory] [Japan]
[dispute] [proposal] [recognizing] [cabinet] [opinion] [government] [opposition] [obstruction].Identical with Sina website text core word, description
Word same number is 10, meets text rule of similarity, and cluster is to together.But identical with World Wide Web text core word, but describe
Word same number is only 9, does not meet text rule of similarity, so the body of global new military net can not continue as mark
Quasi- continuation differentiates, and the title of global new military net can continue to differentiate.
Military affairs of showing sword are released news (http://www.liangjian.com/news/201307/45254.html), mark
Entitled《Day prime minister peace times suddenly Diaoyu Island the is withdrawn the previous remark Japanese whole nation is in commotion》, identical with the title of global new military net, gather
Class is to together.
Finally it should be noted that:Above example is only in order to technical scheme to be described rather than a limitation, most
Pipe has been described in detail to the present invention with reference to above-described embodiment, and those of ordinary skill in the art should be understood:Still
The specific embodiment of the present invention can be modified or equivalent, and without departing from any of spirit and scope of the invention
Modification or equivalent, which all should be covered in the middle of scope of the presently claimed invention.