CN101180624B - 基于链接的垃圾检测 - Google Patents

基于链接的垃圾检测 Download PDF

Info

Publication number
CN101180624B
CN101180624B CN2005800372291A CN200580037229A CN101180624B CN 101180624 B CN101180624 B CN 101180624B CN 2005800372291 A CN2005800372291 A CN 2005800372291A CN 200580037229 A CN200580037229 A CN 200580037229A CN 101180624 B CN101180624 B CN 101180624B
Authority
CN
China
Prior art keywords
choice
tabulation
document
effective mass
link
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2005800372291A
Other languages
English (en)
Chinese (zh)
Other versions
CN101180624A (zh
Inventor
帕维尔·别尔欣
佐尔坦·I·真吉
简·佩德森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Altaba Inc
Original Assignee
Yahoo Inc until 2017
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yahoo Inc until 2017 filed Critical Yahoo Inc until 2017
Publication of CN101180624A publication Critical patent/CN101180624A/zh
Application granted granted Critical
Publication of CN101180624B publication Critical patent/CN101180624B/zh
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99935Query augmenting and refining, e.g. inexact access
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99937Sorting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99943Generating database or data structure, e.g. via user interface

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
CN2005800372291A 2004-10-28 2005-10-26 基于链接的垃圾检测 Expired - Fee Related CN101180624B (zh)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US62329504P 2004-10-28 2004-10-28
US60/623,295 2004-10-28
US11/198,471 2005-08-04
US11/198,471 US7533092B2 (en) 2004-10-28 2005-08-04 Link-based spam detection
PCT/US2005/038619 WO2006049996A2 (en) 2004-10-28 2005-10-26 Link-based spam detection

Publications (2)

Publication Number Publication Date
CN101180624A CN101180624A (zh) 2008-05-14
CN101180624B true CN101180624B (zh) 2012-05-09

Family

ID=35705210

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2005800372291A Expired - Fee Related CN101180624B (zh) 2004-10-28 2005-10-26 基于链接的垃圾检测

Country Status (6)

Country Link
US (1) US7533092B2 (enExample)
EP (1) EP1817697A2 (enExample)
JP (1) JP4908422B2 (enExample)
KR (1) KR101230687B1 (enExample)
CN (1) CN101180624B (enExample)
WO (1) WO2006049996A2 (enExample)

Families Citing this family (83)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7466663B2 (en) 2000-10-26 2008-12-16 Inrotis Technology, Limited Method and apparatus for identifying components of a network having high importance for network integrity
US7693830B2 (en) 2005-08-10 2010-04-06 Google Inc. Programmable search engine
US7743045B2 (en) * 2005-08-10 2010-06-22 Google Inc. Detecting spam related and biased contexts for programmable search engines
US20070038614A1 (en) * 2005-08-10 2007-02-15 Guha Ramanathan V Generating and presenting advertisements based on context data for programmable search engines
US7716199B2 (en) * 2005-08-10 2010-05-11 Google Inc. Aggregating context data for programmable search engines
US8125922B2 (en) * 2002-10-29 2012-02-28 Searchbolt Limited Method and apparatus for generating a ranked index of web pages
US7505964B2 (en) 2003-09-12 2009-03-17 Google Inc. Methods and systems for improving a search ranking using related queries
US7606793B2 (en) 2004-09-27 2009-10-20 Microsoft Corporation System and method for scoping searches using index keys
US20060069667A1 (en) * 2004-09-30 2006-03-30 Microsoft Corporation Content evaluation
US7533092B2 (en) * 2004-10-28 2009-05-12 Yahoo! Inc. Link-based spam detection
US20060123478A1 (en) * 2004-12-02 2006-06-08 Microsoft Corporation Phishing detection, prevention, and notification
US7634810B2 (en) * 2004-12-02 2009-12-15 Microsoft Corporation Phishing detection, prevention, and notification
US20110197114A1 (en) * 2004-12-08 2011-08-11 John Martin Electronic message response and remediation system and method
US7962510B2 (en) * 2005-02-11 2011-06-14 Microsoft Corporation Using content analysis to detect spam web pages
WO2007002820A2 (en) * 2005-06-28 2007-01-04 Yahoo! Inc. Search engine with augmented relevance ranking by community participation
US20070078939A1 (en) * 2005-09-26 2007-04-05 Technorati, Inc. Method and apparatus for identifying and classifying network documents as spam
US20090299819A1 (en) * 2006-03-04 2009-12-03 John Stannard Davis, III Behavioral Trust Rating Filtering System
US7580931B2 (en) * 2006-03-13 2009-08-25 Microsoft Corporation Topic distillation via subsite retrieval
WO2007123416A1 (en) * 2006-04-24 2007-11-01 Telenor Asa Method and device for efficiently ranking documents in a similarity graph
US7634476B2 (en) * 2006-07-25 2009-12-15 Microsoft Corporation Ranking of web sites by aggregating web page ranks
US20080033797A1 (en) * 2006-08-01 2008-02-07 Microsoft Corporation Search query monetization-based ranking and filtering
US20080126331A1 (en) * 2006-08-25 2008-05-29 Xerox Corporation System and method for ranking reference documents
US8661029B1 (en) 2006-11-02 2014-02-25 Google Inc. Modifying search result ranking based on implicit user feedback
US20080114753A1 (en) * 2006-11-15 2008-05-15 Apmath Ltd. Method and a device for ranking linked documents
US20080147669A1 (en) * 2006-12-14 2008-06-19 Microsoft Corporation Detecting web spam from changes to links of web sites
US7885952B2 (en) * 2006-12-20 2011-02-08 Microsoft Corporation Cloaking detection utilizing popularity and market value
US7693833B2 (en) * 2007-02-01 2010-04-06 John Nagle System and method for improving integrity of internet search
US8595204B2 (en) * 2007-03-05 2013-11-26 Microsoft Corporation Spam score propagation for web spam detection
US7680851B2 (en) 2007-03-07 2010-03-16 Microsoft Corporation Active spam testing system
US8938463B1 (en) 2007-03-12 2015-01-20 Google Inc. Modifying search result ranking based on implicit user feedback and a model of presentation bias
US8694374B1 (en) * 2007-03-14 2014-04-08 Google Inc. Detecting click spam
US7756987B2 (en) * 2007-04-04 2010-07-13 Microsoft Corporation Cybersquatter patrol
US20080270549A1 (en) * 2007-04-26 2008-10-30 Microsoft Corporation Extracting link spam using random walks and spam seeds
US7930303B2 (en) * 2007-04-30 2011-04-19 Microsoft Corporation Calculating global importance of documents based on global hitting times
US9092510B1 (en) 2007-04-30 2015-07-28 Google Inc. Modifying search result ranking based on a temporal element of user feedback
US7853589B2 (en) * 2007-04-30 2010-12-14 Microsoft Corporation Web spam page classification using query-dependent data
US7788254B2 (en) * 2007-05-04 2010-08-31 Microsoft Corporation Web page analysis using multiple graphs
US7941391B2 (en) 2007-05-04 2011-05-10 Microsoft Corporation Link spam detection using smooth classification function
US9430577B2 (en) * 2007-05-31 2016-08-30 Microsoft Technology Licensing, Llc Search ranger system and double-funnel model for search spam analyses and browser protection
US8667117B2 (en) * 2007-05-31 2014-03-04 Microsoft Corporation Search ranger system and double-funnel model for search spam analyses and browser protection
US7873635B2 (en) * 2007-05-31 2011-01-18 Microsoft Corporation Search ranger system and double-funnel model for search spam analyses and browser protection
US8244737B2 (en) * 2007-06-18 2012-08-14 Microsoft Corporation Ranking documents based on a series of document graphs
US8438189B2 (en) * 2007-07-23 2013-05-07 Microsoft Corporation Local computation of rank contributions
US8694511B1 (en) 2007-08-20 2014-04-08 Google Inc. Modifying search result ranking based on populations
US8041338B2 (en) * 2007-09-10 2011-10-18 Microsoft Corporation Mobile wallet and digital payment
US8909655B1 (en) 2007-10-11 2014-12-09 Google Inc. Time based ranking
US9348912B2 (en) 2007-10-18 2016-05-24 Microsoft Technology Licensing, Llc Document length as a static relevance feature for ranking search results
US20090177690A1 (en) * 2008-01-03 2009-07-09 Sinem Guven Determining an Optimal Solution Set Based on Human Selection
US8219549B2 (en) * 2008-02-06 2012-07-10 Microsoft Corporation Forum mining for suspicious link spam sites detection
US8010482B2 (en) * 2008-03-03 2011-08-30 Microsoft Corporation Locally computable spam detection features and robust pagerank
US8812493B2 (en) 2008-04-11 2014-08-19 Microsoft Corporation Search results ranking using editing distance and document information
US20090307191A1 (en) * 2008-06-10 2009-12-10 Li Hong C Techniques to establish trust of a web page to prevent malware redirects from web searches or hyperlinks
EP2169568A1 (en) 2008-09-17 2010-03-31 OGS Search Limited Method and apparatus for generating a ranked index of web pages
US7974970B2 (en) * 2008-10-09 2011-07-05 Yahoo! Inc. Detection of undesirable web pages
US8396865B1 (en) 2008-12-10 2013-03-12 Google Inc. Sharing search engine relevance data between corpora
US9009146B1 (en) 2009-04-08 2015-04-14 Google Inc. Ranking search results based on similar queries
US8447760B1 (en) 2009-07-20 2013-05-21 Google Inc. Generating a related set of documents for an initial set of documents
US8498974B1 (en) 2009-08-31 2013-07-30 Google Inc. Refining search results
US8972391B1 (en) 2009-10-02 2015-03-03 Google Inc. Recent interest based relevance scoring
US8874555B1 (en) 2009-11-20 2014-10-28 Google Inc. Modifying scoring data based on historical changes
US8615514B1 (en) 2010-02-03 2013-12-24 Google Inc. Evaluating website properties by partitioning user feedback
US8924379B1 (en) 2010-03-05 2014-12-30 Google Inc. Temporal-based score adjustments
US8959093B1 (en) 2010-03-15 2015-02-17 Google Inc. Ranking search results based on anchors
US8738635B2 (en) * 2010-06-01 2014-05-27 Microsoft Corporation Detection of junk in search result ranking
US9623119B1 (en) 2010-06-29 2017-04-18 Google Inc. Accentuating search results
US8832083B1 (en) 2010-07-23 2014-09-09 Google Inc. Combining user feedback
US8707441B1 (en) * 2010-08-17 2014-04-22 Symantec Corporation Techniques for identifying optimized malicious search engine results
US8874566B2 (en) 2010-09-09 2014-10-28 Disney Enterprises, Inc. Online content ranking system based on authenticity metric values for web elements
US9002867B1 (en) 2010-12-30 2015-04-07 Google Inc. Modifying ranking data based on document changes
CN102214245B (zh) * 2011-07-12 2013-09-11 厦门大学 基于关键词共现的研究热点图论分析方法
CN102222115B (zh) * 2011-07-12 2013-09-11 厦门大学 基于关键词共现的研究热点边连通度分析方法
CN102571768B (zh) * 2011-12-26 2014-11-26 北京大学 一种钓鱼网站检测方法
CN102591965B (zh) * 2011-12-30 2014-07-09 奇智软件(北京)有限公司 一种黑链检测的方法及装置
US9495462B2 (en) 2012-01-27 2016-11-15 Microsoft Technology Licensing, Llc Re-ranking search results
US9002832B1 (en) 2012-06-04 2015-04-07 Google Inc. Classifying sites as low quality sites
US9183499B1 (en) 2013-04-19 2015-11-10 Google Inc. Evaluating quality based on neighbor features
CN103345499A (zh) * 2013-06-28 2013-10-09 宇龙计算机通信科技(深圳)有限公司 一种搜索引擎的搜索结果处理方法及装置
CN103412922B (zh) * 2013-08-12 2017-02-08 曙光信息产业股份有限公司 一种数据查询处理方法
US20170046376A1 (en) * 2015-04-03 2017-02-16 Yahoo! Inc. Method and system for monitoring data quality and dependency
CN105373598B (zh) * 2015-10-27 2017-03-15 广州神马移动信息科技有限公司 作弊站点识别方法及装置
CN108304395B (zh) * 2016-02-05 2022-09-06 北京迅奥科技有限公司 网页作弊检测
CN108984630B (zh) * 2018-06-20 2021-08-24 天津大学 复杂网络中节点重要性在垃圾网页检测中的应用方法
US12235952B2 (en) * 2021-07-21 2025-02-25 Y.E. Hub Armenia LLC Method and system for prioritizing web-resources for malicious data assessment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6728752B1 (en) * 1999-01-26 2004-04-27 Xerox Corporation System and method for information browsing using multi-modal features
US20040143600A1 (en) * 1993-06-18 2004-07-22 Musgrove Timothy Allen Content aggregation method and apparatus for on-line purchasing system
CN1536483A (zh) * 2003-04-04 2004-10-13 陈文中 网络信息抽取及处理的方法及系统

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4167652A (en) 1974-10-17 1979-09-11 Telefonaktiebolaget L M Ericsson Method and apparatus for the interchanges of PCM word
US6285999B1 (en) 1997-01-10 2001-09-04 The Board Of Trustees Of The Leland Stanford Junior University Method for node ranking in a linked database
US6678681B1 (en) 1999-03-10 2004-01-13 Google Inc. Information extraction from a database
US6985431B1 (en) 1999-08-27 2006-01-10 International Business Machines Corporation Network switch and components and method of operation
US6404752B1 (en) 1999-08-27 2002-06-11 International Business Machines Corporation Network switch using network processor and methods
US6865575B1 (en) 2000-07-06 2005-03-08 Google, Inc. Methods and apparatus for using a modified index to provide search results in response to an ambiguous search query
US6529903B2 (en) 2000-07-06 2003-03-04 Google, Inc. Methods and apparatus for using a modified index to provide search results in response to an ambiguous search query
US20040193503A1 (en) 2000-10-04 2004-09-30 Eder Jeff Scott Interactive sales performance management system
US7197470B1 (en) 2000-10-11 2007-03-27 Buzzmetrics, Ltd. System and method for collection analysis of electronic discussion methods
US20040236673A1 (en) 2000-10-17 2004-11-25 Eder Jeff Scott Collaborative risk transfer system
CA2323883C (en) 2000-10-19 2016-02-16 Patrick Ryan Morin Method and device for classifying internet objects and objects stored oncomputer-readable media
AU2002312567A1 (en) 2001-06-20 2003-01-08 Arbor Networks, Inc. Detecting network misuse
US7089252B2 (en) 2002-04-25 2006-08-08 International Business Machines Corporation System and method for rapid computation of PageRank
US20040002988A1 (en) 2002-06-26 2004-01-01 Praveen Seshadri System and method for modeling subscriptions and subscribers as data
US7346839B2 (en) 2003-09-30 2008-03-18 Google Inc. Information retrieval based on historical data
US20050210008A1 (en) 2004-03-18 2005-09-22 Bao Tran Systems and methods for analyzing documents over a network
US7343374B2 (en) * 2004-03-29 2008-03-11 Yahoo! Inc. Computation of page authority weights using personalized bookmarks
US20060064411A1 (en) * 2004-09-22 2006-03-23 William Gross Search engine using user intent
US20060085391A1 (en) 2004-09-24 2006-04-20 Microsoft Corporation Automatic query suggestions
US20060218010A1 (en) 2004-10-18 2006-09-28 Bioveris Corporation Systems and methods for obtaining, storing, processing and utilizing immunologic information of individuals and populations
US7533092B2 (en) * 2004-10-28 2009-05-12 Yahoo! Inc. Link-based spam detection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040143600A1 (en) * 1993-06-18 2004-07-22 Musgrove Timothy Allen Content aggregation method and apparatus for on-line purchasing system
US6728752B1 (en) * 1999-01-26 2004-04-27 Xerox Corporation System and method for information browsing using multi-modal features
CN1536483A (zh) * 2003-04-04 2004-10-13 陈文中 网络信息抽取及处理的方法及系统

Also Published As

Publication number Publication date
WO2006049996A2 (en) 2006-05-11
JP4908422B2 (ja) 2012-04-04
EP1817697A2 (en) 2007-08-15
KR20070085477A (ko) 2007-08-27
WO2006049996A3 (en) 2007-09-27
US20060095416A1 (en) 2006-05-04
HK1115930A1 (en) 2008-12-12
CN101180624A (zh) 2008-05-14
KR101230687B1 (ko) 2013-02-07
JP2008519328A (ja) 2008-06-05
US7533092B2 (en) 2009-05-12

Similar Documents

Publication Publication Date Title
CN101180624B (zh) 基于链接的垃圾检测
US6321220B1 (en) Method and apparatus for preventing topic drift in queries in hyperlinked environments
Yuwono et al. WISE: a world wide web resource database system
US7630973B2 (en) Method for identifying related pages in a hyperlinked database
Cho et al. Efficient crawling through URL ordering
Xue et al. Optimizing web search using web click-through data
CN102576364B (zh) 用于智能的基于事件的数据挖掘的方法和装置
US8478792B2 (en) Systems and methods for presenting information based on publisher-selected labels
US9268873B2 (en) Landing page identification, tagging and host matching for a mobile application
US20050165757A1 (en) Method and apparatus for ranking web page search results
Agre et al. Keyword focused web crawler
US11361036B2 (en) Using historical information to improve search across heterogeneous indices
CN101268464A (zh) 使用文档使用统计量的排位函数
US20090083266A1 (en) Techniques for tokenizing urls
Singh et al. A comparative study of page ranking algorithms for information retrieval
Chakrabarti Recent results in automatic Web resource discovery
US20070094250A1 (en) Using matrix representations of search engine operations to make inferences about documents in a search engine corpus
Klein et al. Evaluating methods to rediscover missing web pages from the web infrastructure
US7490082B2 (en) System and method for searching internet domains
Kaur et al. SmartCrawler: A Three-Stage Ranking Based Web Crawler for Harvesting Hidden Web Sources.
HK1115930B (en) Link-based spam detection
Hicks et al. Extending web mining to digital forensics text mining
Sunitha et al. A comparative study over search engine optimization on precision and recall ratio
Wookey Hierarchical web structure mining
Devi A Novel Approach on Focused Crawling With Anchor Text

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1115930

Country of ref document: HK

C14 Grant of patent or utility model
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: GR

Ref document number: 1115930

Country of ref document: HK

ASS Succession or assignment of patent right

Owner name: FEIYANG MANAGEMENT CO., LTD.

Free format text: FORMER OWNER: YAHOO CORP.

Effective date: 20150331

TR01 Transfer of patent right

Effective date of registration: 20150331

Address after: The British Virgin Islands of Tortola

Patentee after: Yahoo! Inc.

Address before: California, USA

Patentee before: YAHOO! Inc.

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120509

Termination date: 20211026

CF01 Termination of patent right due to non-payment of annual fee