CN100520778C - Internet topics file searching method, reptile system and search engine - Google Patents
Internet topics file searching method, reptile system and search engine Download PDFInfo
- Publication number
- CN100520778C CN100520778C CNB200610099277XA CN200610099277A CN100520778C CN 100520778 C CN100520778 C CN 100520778C CN B200610099277X A CNB200610099277X A CN B200610099277XA CN 200610099277 A CN200610099277 A CN 200610099277A CN 100520778 C CN100520778 C CN 100520778C
- Authority
- CN
- China
- Prior art keywords
- url
- webpage
- module
- priority
- file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 18
- 241000270322 Lepidosauria Species 0.000 title 1
- 238000001914 filtration Methods 0.000 claims description 17
- 239000000284 extract Substances 0.000 claims description 8
- 230000000750 progressive effect Effects 0.000 claims description 4
- 230000015572 biosynthetic process Effects 0.000 description 8
- 238000005755 formation reaction Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 7
- 230000004048 modification Effects 0.000 description 6
- 238000012986 modification Methods 0.000 description 6
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000013549 information retrieval technique Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
本发明公开了一种互联网主题文件搜索方法,包括:解析下载的网页,提取网页中包含的统一资源定位符URL;确定出各URL的对应优先级;按优先级从高到低的顺序采集各URL,建立索引,搜索所需互联网主题文件。本发明还公开了一种互联网主题文件的搜索引擎的爬虫系统和搜索引擎。本发明提供的爬虫系统至少包括:URL队列存储模块、网页和文件下载模块、网页解析模块和采集控制模块。采用本发明可以提高互联网主题文件搜索效率。
The invention discloses a method for searching Internet subject files, comprising: parsing downloaded webpages, extracting uniform resource locator URLs contained in the webpages; determining the corresponding priorities of each URL; URL, indexed, and searched for desired Internet topic files. The invention also discloses a crawler system and a search engine of an Internet subject file search engine. The crawler system provided by the present invention at least includes: a URL queue storage module, a web page and file download module, a web page analysis module and a collection control module. By adopting the invention, the search efficiency of Internet subject files can be improved.
Description
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB200610099277XA CN100520778C (en) | 2006-07-25 | 2006-07-25 | Internet topics file searching method, reptile system and search engine |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB200610099277XA CN100520778C (en) | 2006-07-25 | 2006-07-25 | Internet topics file searching method, reptile system and search engine |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101114285A CN101114285A (en) | 2008-01-30 |
CN100520778C true CN100520778C (en) | 2009-07-29 |
Family
ID=39022634
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB200610099277XA Active CN100520778C (en) | 2006-07-25 | 2006-07-25 | Internet topics file searching method, reptile system and search engine |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100520778C (en) |
Families Citing this family (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101329687B (en) * | 2008-07-31 | 2010-06-23 | 清华大学 | Method for positioning news web page |
CN101355590B (en) * | 2008-09-05 | 2012-04-25 | 深圳市迅雷网络技术有限公司 | Download reminder method, system and device |
US8959091B2 (en) * | 2009-07-30 | 2015-02-17 | Alcatel Lucent | Keyword assignment to a web page |
US9582533B2 (en) | 2010-06-08 | 2017-02-28 | Sharp Kabushiki Kaisha | Content reproduction device, control method for content reproduction device, control program, and recording medium |
WO2012025040A1 (en) * | 2010-08-27 | 2012-03-01 | Huang Bin | Visualized search engine system and implementation method and application thereof |
CN102129453B (en) * | 2011-03-04 | 2013-10-23 | 北京立新盈企信息技术有限公司 | Display control device and method capable of displaying search result in mode of text completed with graphs |
CN102024035A (en) * | 2010-12-02 | 2011-04-20 | 东莞宇龙通信科技有限公司 | Resource retrieval method and device |
CN102904912B (en) * | 2011-07-26 | 2015-06-17 | 腾讯科技(深圳)有限公司 | Method and system for downloading webpage contents |
CN102254027B (en) * | 2011-07-29 | 2013-05-08 | 四川长虹电器股份有限公司 | Method for obtaining webpage contents in batch |
CN102346772A (en) * | 2011-09-23 | 2012-02-08 | 王楠 | Directional acquisition system based on OWL (ontology web language) semantic analysis |
CN103123640A (en) * | 2012-02-22 | 2013-05-29 | 深圳市谷古科技有限公司 | Method and device for searching novel |
CN103123642A (en) * | 2012-02-22 | 2013-05-29 | 深圳市谷古科技有限公司 | Searching method and device based on web language |
CN102760162A (en) * | 2012-06-11 | 2012-10-31 | 北京搜狗信息服务有限公司 | Method and device for revealing and acquiring download link |
CN103631792B (en) * | 2012-08-22 | 2017-01-25 | 北京华财会计股份有限公司 | Massive source index building system and method |
CN102999549A (en) * | 2012-09-25 | 2013-03-27 | 金博 | Method for realizing web crawler tasks |
CN103761279B (en) * | 2014-01-09 | 2017-02-08 | 北京京东尚科信息技术有限公司 | Method and system for scheduling network crawlers on basis of keyword search |
CN104679838A (en) * | 2015-02-09 | 2015-06-03 | 北京中搜网络技术股份有限公司 | Efficient information collection method |
CN106649354B (en) * | 2015-10-30 | 2020-02-28 | 北京国双科技有限公司 | Webpage crawling request processing method and device |
CN106326339A (en) * | 2016-08-03 | 2017-01-11 | 上海蔓盈信息科技有限公司 | Task allocating method and device |
CN106776934B (en) * | 2016-11-30 | 2021-03-26 | 努比亚技术有限公司 | Mobile terminal and implementation method of web crawler |
CN108228656B (en) * | 2016-12-21 | 2021-05-25 | 普天信息技术有限公司 | URL classification method and device based on CART decision tree |
CN107480264B (en) * | 2017-08-17 | 2019-11-15 | 北京知道创宇信息技术股份有限公司 | A kind of web crawlers De-weight method and calculate equipment |
CN107679072B (en) * | 2017-08-24 | 2020-08-28 | 平安普惠企业管理有限公司 | User behavior information acquisition method, terminal and storage medium |
CN107729564A (en) * | 2017-11-13 | 2018-02-23 | 北京众荟信息技术股份有限公司 | A kind of distributed focused web crawler web page crawl method and system |
CN108664646B (en) * | 2018-05-16 | 2021-11-16 | 电子科技大学 | Audio and video automatic downloading system based on keywords |
CN109871475A (en) * | 2019-02-28 | 2019-06-11 | 上海浪潮云计算服务有限公司 | A kind of method and system of in a preferential order piecemeal acquisition internet data |
CN113674769A (en) * | 2021-08-20 | 2021-11-19 | 湖北亿咖通科技有限公司 | Speech system test method, apparatus, equipment, medium and program product |
CN116132534B (en) * | 2022-07-01 | 2024-03-08 | 马上消费金融股份有限公司 | Method, device, equipment and storage medium for storing service request |
-
2006
- 2006-07-25 CN CNB200610099277XA patent/CN100520778C/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN101114285A (en) | 2008-01-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN100520778C (en) | Internet topics file searching method, reptile system and search engine | |
EP1934823B1 (en) | Click distance determination | |
US7644069B2 (en) | Search ranking method for file system and related search engine | |
US8046347B2 (en) | Method and apparatus for reconstructing a search query | |
Ma et al. | Efficiently finding web services using a clustering semantic approach | |
US8812478B1 (en) | Distributed crawling of hyperlinked documents | |
CN100437585C (en) | Method for carrying out retrieval hint based on inverted list | |
US8417657B2 (en) | Methods and apparatus for computing graph similarity via sequence similarity | |
CN101477554A (en) | User interest based personalized meta search engine and search result processing method | |
CN101452463A (en) | Method and apparatus for directionally grabbing page resource | |
JP3802813B2 (en) | Web page search method, web page search device, program, and recording medium | |
CN101727447A (en) | Generation method and device of regular expression based on URL | |
CN101399818A (en) | Theme related webpage filtering method and system based on navigation route information | |
KR100359233B1 (en) | Method for extracing web information and the apparatus therefor | |
KR100671077B1 (en) | Server, method and system for providing information retrieval service using page bundle | |
CN103745006A (en) | Internet information searching system and internet information searching method | |
WO2006107141A1 (en) | Server, method and system for providing information search service by using sheaf of pages | |
US20030046276A1 (en) | System and method for modular data search with database text extenders | |
CN112597369A (en) | Webpage spider theme type search system based on improved cloud platform | |
Gurrin et al. | Dublin City University experiments in connectivity analysis for TREC-9. | |
Inkpen | Information retrieval on the internet | |
CN117851535B (en) | Information file full structure storage based on business logic and search engine-free design method and system | |
Garg et al. | Implementation of a Search Engine | |
KR100645711B1 (en) | Server, Method and System for Providing Information Search Service by Using Web Page Segmented into Several Information Blocks | |
Silva | Searching and archiving the web with tumba |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
ASS | Succession or assignment of patent right |
Owner name: SHENZHEN SHIJI LIGHT SPEED INFORMATION TECHNOLOGY Free format text: FORMER OWNER: TENGXUN SCI-TECH (SHENZHEN) CO., LTD. Effective date: 20131021 |
|
C41 | Transfer of patent application or patent right or utility model | ||
COR | Change of bibliographic data |
Free format text: CORRECT: ADDRESS; FROM: 518044 SHENZHEN, GUANGDONG PROVINCE TO: 518057 SHENZHEN, GUANGDONG PROVINCE |
|
TR01 | Transfer of patent right |
Effective date of registration: 20131021 Address after: 518057 Tencent Building, 16, Nanshan District hi tech park, Guangdong, Shenzhen Patentee after: Shenzhen Shiji Guangsu Information Technology Co., Ltd. Address before: 2, 518044, East 410 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District Patentee before: Tencent Technology (Shenzhen) Co., Ltd. |