CN103699567B - A kind of method that identical news cluster is realized based on title fingerprint and text fingerprint - Google Patents

A kind of method that identical news cluster is realized based on title fingerprint and text fingerprint Download PDF

Info

Publication number
CN103699567B
CN103699567B CN201310538608.5A CN201310538608A CN103699567B CN 103699567 B CN103699567 B CN 103699567B CN 201310538608 A CN201310538608 A CN 201310538608A CN 103699567 B CN103699567 B CN 103699567B
Authority
CN
China
Prior art keywords
fingerprint
title
text
information
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310538608.5A
Other languages
Chinese (zh)
Other versions
CN103699567A (en
Inventor
王放
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongsou Cloud Business Network Technology Co ltd
Original Assignee
Beijing Zhongsou Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongsou Network Technology Co ltd filed Critical Beijing Zhongsou Network Technology Co ltd
Priority to CN201310538608.5A priority Critical patent/CN103699567B/en
Publication of CN103699567A publication Critical patent/CN103699567A/en
Application granted granted Critical
Publication of CN103699567B publication Critical patent/CN103699567B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of method for realizing identical news cluster based on title fingerprint and text fingerprint, methods described includes:(1)Title pretreatment;(2)Title fingerprint is calculated;(3)Text pretreatment;(4)Calculate cutting word weight in text;(5)Search in fingerprint base;(6)New information is stored in fingerprint base;(7)Update fingerprint base to process.The repetition information identification of the present invention, algorithm are simple and clear, effect is significant.After tested, for common repetition situation in information search, such as directly replicate, title adjustment, serial news, the discrimination of content fine setting is up to more than 99%.Algorithm mates to core word first, quickly eliminates the dissimilar article of theme, greatly improves recognition efficiency, recognizes that the time once is less than 1ms in ten million DBMS.Due to only preserving the essential information of core word and descriptor, memory space is saved, the space consumed by the historical information of ten million DBMS of storage is less than 500MB.

Description

A kind of method that identical news cluster is realized based on title fingerprint and text fingerprint
Technical field
The invention belongs to based on title fingerprint and text fingerprint, search field, in particular to one kind realize that identical news is gathered The method of class.
Background technology
In information (or news) search field, usually there is chained address difference, but content is extremely close or even complete Identical information data, we term it repeat or approximate repetition.This is due to being referred between multiple source of news websites mutually, mutually It is more universal that situation about even directly replicating mutually is quoted.Due to all containing similar content, institute in the information data that these repeat Generally can all hit the retrieval of user, and as content is similar, so Relevance scores are also almost identical, cause their meetings Concentration represents in front of the user.User sees the data of a large amount of repetitions, and this undoubtedly causes user obtain a small amount of new letter Breath, has a strong impact on Consumer's Experience;Meanwhile, duplicate data also consumes substantial amounts of resource for index and search procedure.For Solution problem above, it is desirable to by certain methods, detect the information of repetition, so as to will weight during index order Multiple document is eliminated, and to reduce resource loss, and provides more preferable Consumer's Experience.
The method for calculating repetition information at present has verification and technology, N-gram fingerprint computation techniques, Simhash fingerprint skills Art.
Verification and technology by each byte in information content is calculated and verification and technology simple, but in can only detecting Hold identical information document.Meanwhile, any document containing same text can obtain identical verification and
N-gram fingerprint computation techniques with N as step-length, select some word strings to represent document content from information content.N- Gram fingerprint techniques take word string that length is N from content at random as user supplied video content using fingerprints, and consideration word string is in the text Importance.
Simhash technology is the user supplied video content using fingerprints that every information calculates 64bit, then by comparing all data, meter two-by-two Fingerprint difference degree is calculated, so as to judge whether new information is similar to some of an information set before piece.Simhash needs two-by-two The fingerprint similarity of all documents is calculated, amount of calculation is huge, and efficiency of algorithm is not high, be poorly suitable for the higher money of ageing requirement The application of news search engine.
Content of the invention
For the deficiencies in the prior art, the invention provides a kind of realize identical news based on title fingerprint and text fingerprint The method of cluster.For verification and, deficiencies of the N-gram and Simhash in identical news cluster application, design is a kind of simply Effectively detection repeats the method for information to carry out repetition news cluster.
The purpose of the present invention is realized using following technical proposals:
A kind of method for realizing identical news cluster based on title fingerprint and text fingerprint, which thes improvement is that, described Method includes:
(1) title pretreatment;
(2) title fingerprint is calculated;
(3) text pretreatment;
(4) cutting word weight in text is calculated;
(5) search in fingerprint base;
(6) new information is stored in fingerprint base;
(7) update fingerprint base to process.
Preferably, described step (1) includes removing the noise character in title, and double byte character in title is switched to half widths Symbol.
Preferably, step (2) include according to title content calculate verification and, take a 64bit verification and, as Title fingerprint.
Preferably, described step (3) include removing the noise character in text, then carry out normalization.
Preferably, described step (4) include the maximum front M word of weighting weight as core word, the big top n of weighting weight time Word is used as descriptor.
Preferably, described step (4) include sorting M core word and N number of descriptor respectively.
Preferably, step (5) include
Title fingerprint identical information, then this information document is same;
The different information of title fingerprint, judges whether its core word is similar to descriptor.
Preferably, described step (6) are mismatched with information in fingerprint base including new information, then distribute one for new information User supplied video content using fingerprints.
Preferably, described step (7) include that new information is similar with a certain information in fingerprint base, but the two text dissmilarity In fingerprint base, the title fingerprint of new information is then only preserved, do not preserve its text.
Compared with the prior art, beneficial effects of the present invention are:
(1) recognized using the repetition information based on the present invention, algorithm is simple and clear, effect is significant.After tested, for information is searched for In common repetition situation, such as directly replicate, title adjustment, serial news, the discrimination of content fine setting is up to more than 99%.
(2) algorithm mates to core word first, quickly eliminates the dissimilar article of theme, greatly improves knowledge Other efficiency, recognizes that in ten million DBMS the time once is less than 1ms.Due to only preserving the basic letter of core word and descriptor Breath, has saved memory space, and the space consumed by the historical information of ten million DBMS of storage is less than 500MB.
Description of the drawings
A kind of method flow for realizing identical news cluster based on title fingerprint and text fingerprint that Fig. 1 is provided for the present invention Figure.
Fig. 2 realizes the method flow diagram of identical news cluster for the text fingerprint that the present invention is provided.
Specific embodiment
Below in conjunction with the accompanying drawings the specific embodiment of the present invention is described in further detail.
Information is mainly characterized by title and text two parts.Title is for the summary of information theme;Text be for Information theme is specifically described.Two document contents are similar, then the theme of two documents is necessarily identical.If it means that two Piece document is similar for the summary (title) of same subject, then two document similarities;If two documents are for same subject Concrete explaination (text) is similar, then two document similarities.
Find that information title is shorter, and text is longer by statistics.When the theme corresponding to title is considered, it is suitable for consideration Wherein all words, and consider text corresponding to theme when, it is only necessary to consider most important of which those.
The present invention program is specific as follows:
1. title pretreatment.
Noise character in title is removed first, including:At space, tab, punctuation mark, start of header and ending Digital number at " figure/group figure ", ending.Double byte character in title is switched to half-angle character again.
2. title fingerprint is calculated.
According to title content calculate verification and, take a 64bit verification and, as title fingerprint.
3. text pretreatment.
Remove the noise character in text first, only retain Chinese character, English alphabet, numeral.Normalization is carried out again:Will be complete Angle character is converted to half-angle character, upper case character is converted to lowercase character.
4. pair text participle.
Using Forward Maximum Method segmentation methods.
5. the weight of each word in text is calculated.
A large amount of information are counted in advance, have therefrom extracted the common word of 12W or so, and have recorded the reverse text of each word Shelves frequency (Inverse Document Frequency, IDF) value.According to text word segmentation result, word weight is calculated.Calculate weight When consider word frequency (Term Frequency, TF) value and IDF value of this word in a large amount of information that each word occurs.According to just Literary length, used as core word, the big top n word of weighting weight time is used as descriptor for the maximum front M word of weighting weight:
If i. positive cliction number for [16 ,+∞), then M=3, N=13;
If ii. positive cliction number for [10,16), then M=2, N=8;
If iii. positive cliction number is for (0,10), then M is positive cliction number, N=0.
Respectively M core word and N number of descriptor are sorted.
6. search in fingerprint base.
As described in Figure 2, if the information of existing same title fingerprint, then it is assumed that this information document is same.If There is not the information of same title fingerprint, then see if there is core word identical information.If there is core word identical information, continue Check whether the descriptor of the two is similar.The similar rule of descriptor is:In two groups of descriptors, there is more than 75% word identical, Then think that two groups of descriptors are similar.If core word is identical, and descriptor is also similar, then it is assumed that this information document phase therewith Seemingly.
7. new information is stored in fingerprint base.
If new information is dissimilar with all information in fingerprint base, distribute user supplied video content using fingerprints (generally for new information With the title fingerprint of new information as user supplied video content using fingerprints).If new information is similar with a certain information in fingerprint base, newly information is interior Hold the user supplied video content using fingerprints that fingerprint is equal to information in storehouse.
8. undue transmission is processed.Due to the title and text of information can differentiate similar, if without restriction, may There is content " to drift about " phenomenon.
Such as:The title of A is identical with the title of B, and the content of B is similar to C, the content of C similar to the content of D but with B gaps Larger, may result in A is considered as similar to D, but the two content actual differs.The present invention solve scheme be:If new provide News are similar with a certain information in fingerprint base, but the two text dissimilar (core word is different, or descriptor differs greatly), then only The title fingerprint of new information is preserved in fingerprint base, i.e. in one group of similar news, does not only mutually preserve its text Closely similar positive literary talent can be further used for judging other identical news.
Embodiment
World Wide Web is released news (http first://world.huanqiu.com/exclusive/2013-07/ 4105939.html), entitled《Diaoyu Island problem is considered as obstruction Sino-Japanese relations " diplomatic problems " by day》, 3 text core words For [Japanese side] [middle side] [sovereignty], 13 text descriptors are [fishing] [proposal] [opinion] [dispute] [territory] [recognizing] [harm Hinder] [Japan] [adviser] [accidental] [island] [presence] [things that are outside the scope of one's own job].
World Wide Web news (http reprints in sohu.com://roll.sohu.com/20130709/n381077312.shtml), Title and text do not make an amendment.According to title identical algorithms, cluster is to together.
World Wide Web news (http reprints in Sina website://news.sina.com.cn/c/2013-07-09/ 073027615240.shtml), title and text have been finely tuned:Entitled《Japanese last month proposes Diaoyu Island problem as " diplomacy Problem "》, text core word is [Japanese side] [middle side] [sovereignty], and text descriptor is that [Sino-Japan] [fishing] [obstruction] [proposal] is [main ] [dispute] [territory] [diplomacy] [recognizing] [island] [Japan] [adviser] [accidental].According to text Similarity algorithm, with World Wide Web just Literary core word is identical, and descriptor same number is 10, meets text rule of similarity, and cluster is to together.While the mark of Sina website Topic and body also may continue as standard and continue to differentiate.
News of the global new military net based on World Wide Web has write news (http again://www.xinjunshi.com/ Jujiao/20130710/101163.html), entitled《Day prime minister peace times is withdrawn the previous remark to Diaoyu Island suddenly, and the Japanese whole nation is a piece of to gurgle So》, text core word is [Japanese side] [middle side] [sovereignty], and text descriptor is [adviser] [fishing] [diplomacy] [territory] [Japan] [dispute] [proposal] [recognizing] [cabinet] [opinion] [government] [opposition] [obstruction].Identical with Sina website text core word, description Word same number is 10, meets text rule of similarity, and cluster is to together.But identical with World Wide Web text core word, but describe Word same number is only 9, does not meet text rule of similarity, so the body of global new military net can not continue as mark Quasi- continuation differentiates, and the title of global new military net can continue to differentiate.
Military affairs of showing sword are released news (http://www.liangjian.com/news/201307/45254.html), mark Entitled《Day prime minister peace times suddenly Diaoyu Island the is withdrawn the previous remark Japanese whole nation is in commotion》, identical with the title of global new military net, gather Class is to together.
Finally it should be noted that:Above example is only in order to technical scheme to be described rather than a limitation, most Pipe has been described in detail to the present invention with reference to above-described embodiment, and those of ordinary skill in the art should be understood:Still The specific embodiment of the present invention can be modified or equivalent, and without departing from any of spirit and scope of the invention Modification or equivalent, which all should be covered in the middle of scope of the presently claimed invention.

Claims (8)

1. a kind of method that identical news cluster is realized based on title fingerprint and text fingerprint, it is characterised in that methods described bag Include:
(1) title pretreatment;
(2) title fingerprint is calculated;
(3) text pretreatment;
(4) cutting word weight in text is calculated;
(5) search in fingerprint base;
(6) new information is stored in fingerprint base;
(7) update fingerprint base to process;
Step (5) include
Title fingerprint identical information, then this information document is same;
The different information of title fingerprint, judges whether its core word is similar to descriptor.
2. a kind of method for realizing identical news cluster based on title fingerprint and text fingerprint as claimed in claim 1, which is special Levy and be, described step (1) includes removing the noise character in title, and double byte character in title is switched to half-angle character.
3. a kind of method for realizing identical news cluster based on title fingerprint and text fingerprint as claimed in claim 1, which is special Levy and be, step (2) include calculating verification and taking the verification of a 64bit and refer to as title according to title content Stricture of vagina.
4. a kind of method for realizing identical news cluster based on title fingerprint and text fingerprint as claimed in claim 1, which is special Levy and be, described step (3) include removing the noise character in text, then carry out normalization.
5. a kind of method for realizing identical news cluster based on title fingerprint and text fingerprint as claimed in claim 1, which is special Levy and be, described step (4) include the maximum front M word of weighting weight as core word, the secondary big top n word of weighting weight is used as retouching Predicate.
6. a kind of method for realizing identical news cluster based on title fingerprint and text fingerprint as claimed in claim 1, which is special Levy and be, described step (4) include sorting M core word and N number of descriptor respectively.
7. a kind of method for realizing identical news cluster based on title fingerprint and text fingerprint as claimed in claim 1, which is special Levy and be, described step (6) are mismatched with information in fingerprint base including new information, then distribute a content for new information and refer to Stricture of vagina.
8. a kind of method for realizing identical news cluster based on title fingerprint and text fingerprint as claimed in claim 1, which is special Levy and be, described step (7) include that new information is similar with a certain information in fingerprint base, but the two text is simultaneously dissimilar then only in finger The title fingerprint of new information is preserved in stricture of vagina storehouse, does not preserve its text.
CN201310538608.5A 2013-11-04 2013-11-04 A kind of method that identical news cluster is realized based on title fingerprint and text fingerprint Expired - Fee Related CN103699567B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310538608.5A CN103699567B (en) 2013-11-04 2013-11-04 A kind of method that identical news cluster is realized based on title fingerprint and text fingerprint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310538608.5A CN103699567B (en) 2013-11-04 2013-11-04 A kind of method that identical news cluster is realized based on title fingerprint and text fingerprint

Publications (2)

Publication Number Publication Date
CN103699567A CN103699567A (en) 2014-04-02
CN103699567B true CN103699567B (en) 2017-03-15

Family

ID=50361095

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310538608.5A Expired - Fee Related CN103699567B (en) 2013-11-04 2013-11-04 A kind of method that identical news cluster is realized based on title fingerprint and text fingerprint

Country Status (1)

Country Link
CN (1) CN103699567B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989033A (en) * 2015-02-03 2016-10-05 北京中搜网络技术股份有限公司 Information duplication eliminating method based on information fingerprints
CN106055539B (en) * 2016-05-27 2018-12-28 中国科学技术信息研究所 The method and apparatus that name disambiguates
CN107402960B (en) * 2017-06-15 2020-11-10 成都优易数据有限公司 Reverse index optimization algorithm based on semantic mood weighting
CN107515931B (en) * 2017-08-28 2023-04-25 华中科技大学 Repeated data detection method based on clustering
CN110162632B (en) * 2019-05-17 2021-04-09 北京百分点科技集团股份有限公司 Method for discovering news special events
CN110245275B (en) * 2019-06-18 2023-09-01 中电科大数据研究院有限公司 Large-scale similar news headline rapid normalization method
CN111737966B (en) * 2020-06-11 2024-03-01 北京百度网讯科技有限公司 Document repetition detection method, device, equipment and readable storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6633868B1 (en) * 2000-07-28 2003-10-14 Shermann Loyall Min System and method for context-based document retrieval
DE102009045382A1 (en) * 2009-10-06 2011-04-07 Robert Bosch Gmbh Method for analyzing e.g. path name of sound file utilized for playing on DVD player, involves forming checksum with respect to cut part of file name and added to file name that is reduced around cut part
CN101694670B (en) * 2009-10-20 2012-07-04 北京航空航天大学 Chinese Web document online clustering method based on common substrings
CN102289524B (en) * 2011-09-26 2013-01-30 深圳市万兴软件有限公司 Data recovery method and system

Also Published As

Publication number Publication date
CN103699567A (en) 2014-04-02

Similar Documents

Publication Publication Date Title
CN103699567B (en) A kind of method that identical news cluster is realized based on title fingerprint and text fingerprint
Ladani et al. Stopword identification and removal techniques on tc and ir applications: A survey
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
Nagwani et al. A frequent term and semantic similarity based single document text summarization algorithm
CN108052500B (en) Text key information extraction method and device based on semantic analysis
US9256649B2 (en) Method and system of filtering and recommending documents
Mhatre et al. Dimensionality reduction for sentiment analysis using pre-processing techniques
JP5273735B2 (en) Text summarization method, apparatus and program
Man Feature extension for short text categorization using frequent term sets
Srinivas et al. A weighted tag similarity measure based on a collaborative weight model
Sabbah et al. Hybrid support vector machine based feature selection method for text classification.
CN101576872B (en) Chinese text processing method and device thereof
Ngo et al. Wordnet-based information retrieval using common hypernyms and combined features
CN105574004B (en) A kind of removing duplicate webpages method and apparatus
CN104850609B (en) A kind of filter method for rising space class keywords
CN104866547B (en) A kind of filter method for combined characters class keywords
CN113656576B (en) Article summary generation method, apparatus, computing device and storage medium
Chen et al. A query substitution-search result refinement approach for long query web searches
Fareed et al. Syntactic open domain Arabic question/answering system for factoid questions
Li et al. Keyphrase extraction and grouping based on association rules
Saenko et al. Filtering abstract senses from image search results
Balaji et al. Finding related research papers using semantic and co-citation proximity analysis
Lingwal Noise reduction and content retrieval from web pages
Rahaman et al. Language independent statistical approach for extracting keywords
Aggarwal et al. Exploring esa to improve word relatedness

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20170428

Address after: 100086 Beijing, Haidian District, North Third Ring Road West, No. 43, building 5, floor 08-09, No. 2

Patentee after: BEIJING ZHONGSOU CLOUD BUSINESS NETWORK TECHNOLOGY Co.,Ltd.

Address before: Shou Heng Technology Building No. 51 Beijing 100191 Haidian District Xueyuan Road room 0902

Patentee before: BEIJING ZHONGSOU NETWORK TECHNOLOGY Co.,Ltd.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170315

Termination date: 20211104