CN100520778C - Internet topics file searching method, reptile system and search engine - Google Patents

Internet topics file searching method, reptile system and search engine Download PDF

Info

Publication number
CN100520778C
CN100520778C CNB200610099277XA CN200610099277A CN100520778C CN 100520778 C CN100520778 C CN 100520778C CN B200610099277X A CNB200610099277X A CN B200610099277XA CN 200610099277 A CN200610099277 A CN 200610099277A CN 100520778 C CN100520778 C CN 100520778C
Authority
CN
China
Prior art keywords
url
webpage
module
priority
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CNB200610099277XA
Other languages
Chinese (zh)
Other versions
CN101114285A (en
Inventor
余祥鑫
杨卫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shiji Guangsu Information Technology Co., Ltd.
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CNB200610099277XA priority Critical patent/CN100520778C/en
Publication of CN101114285A publication Critical patent/CN101114285A/en
Application granted granted Critical
Publication of CN100520778C publication Critical patent/CN100520778C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a searching method of the internet top files and includes that a downloaded web page is analyzed and the uniform resource locator URL which is in the web page is extracted; corresponding priority of each URL is confirmed; each URL is collected according to the priority from high to low, an index is built to search for the internet topic file. The invention also discloses a crawler system and a search engine of a search engine of internet topic files. The crawler system provided by the invention comprises at least a storage module of URL queues, a downloading module of web pages and files, a web page analyzing module and a collection control module. The invention can improve the efficiency of searching for the internet topic files.

Description

A kind of internet subject file search method, crawler system and search engine
Technical field
The present invention relates to the internet document search, relate in particular to a kind of internet subject file search method, and corresponding crawler system and search engine.
Background technology
Intemet has become a most popular technology of computer realm, and the universal people of making of Internet can break through the restriction of space, region, shares information resources easily.Www is main, the most widely used a kind of information service that provides on the Internet, since being born, obtained fast development, become a huge information bank, stored a large amount of valuable information, people can find own interested various contents thereon.But in actual use, the online huge data volume of web brings great difficulty can for user's information inquiry work.In this case, various information retrieval services are arisen at the historic moment, and global search technology is an important information retrieval technique that extensively adopts.At present, global search technology based on the www net is just obtaining increasingly extensive application, the large-scale full-text search instrument that much has much influence has been arranged, there is www.soso.com in wherein more famous Chinese search engine system, www.baidu.com etc., the application of these text retrieval systems has been played huge effect to the inquiry of the online document information of www.
Internet search engine generally is made up of crawler system, directory system, searching system at present, crawler system need be gathered webpage and various file from different website on the network, such as web webpage, mp3 file etc., give directory system then and set up index data base, searching system receives user's retrieval request, the search index database returns the result who meets user's request.
General internet search engine system architecture comprises as shown in Figure 1:
Web page server: the web page access service of Chinese search engine system is provided, and is the user interface that the user uses the Chinese search engine system;
Searching system: the search key search index database according to the user submits to, according to certain algorithm the document that meets Search Requirement is sorted, filters, return to web page server;
Directory system: the document to the crawler system collection is handled, and sets up index data base;
Crawler system: gather pages of Internet and various document data.
Prior art one: gather all web website and webpage.
Carry out the particular interconnect host and inscribe in the search engine of file search, its crawler system is generally only gathered the file of particular topic, sets up index then, and retrieval is provided.But gather the file of particular topic, need to gather webpage, find URL(uniform resource locator) (Uniform Resource Locator, URL) link of particular topic file.
Crawler system generally adopts all webpages of traversal at present, promptly gathers all webpages and file, preserves the file of the particular topic that needs then.Because the webpage that contains the particular topic file seldom, the efficient that causes downloading the particular topic file is very low, downloads several ten thousand webpages and just includes a particular topic file, but also be likely dead chain.Therefore need a kind of technology to improve the probability of downloading the webpage that comprises the particular topic file.
Prior art two: gather specific subject web site and webpage.
According to the webpage of gathering is analyzed, find that the link between general webpage has following feature: theme aggregation and locality.Webpage generally has this two characteristics, and it is bigger that locality determines that the webpage of same main frame links likelihood ratio mutually, and it is big that the theme aggregation determines that the webpage of same theme links probability mutually.
Link properties between the webpage can be carried out analog representation with Fig. 2, and among Fig. 2, each circle is represented a webpage, and the solid circles representative comprises the webpage of mp3 file; Suppose to need to gather mp3 file, demonstrate link between the webpage of theme of news and musical theme and the mp3 file that comprises among Fig. 2, the result shows: link is many mutually between the webpage of theme of news, link is many mutually between the webpage of musical theme, and the web page interlinkage between musical theme and the theme of news is fewer.The URL probability that the webpage of musical theme comprises mp3 file is bigger than the URL probability of the mp3 file that the webpage of theme of news comprises.
Therefore, adopt the method that specific subject web page is searched in the prior art two.With above-mentioned collection mp3 file is example, and the crawler system of MP3 search engine is gathered musical theme website and webpage, and the efficient of finding and gathering mp3 file can be than higher.
Although the collecting efficiency of prior art two is higher, owing to only gather specific minority website, cause the particular topic file of whole collection fewer, can't gather file as much as possible on the internet.
Summary of the invention
The invention provides a kind of internet subject file search method, low or gather incomplete problem in order to solve the searching for Internet subject document efficient that exists in the prior art.
For solveing the technical problem, the technical solution used in the present invention is, a kind of internet subject file search method is provided, and this method comprises:
A, parsing web pages downloaded are extracted the uniform resource position mark URL that comprises in the webpage;
B, calculating comprise the Web page subject branch of gathering webpage of described URL, and the URL theme branch of described Web page subject branch as described URL adds up;
Determine the corresponding priority of described URL according to the score value size of described URL theme branch;
C, from high to low each URL of acquisition order according to priority set up index, search for required internet subject file.
According to said method of the present invention, also comprise:
Preserve the URL historical record of having gathered;
Among the described step B, judge that according to described historical record whether download the URL that comprises in the webpage gathers, only determines priority to the URL that did not gather.
According to said method of the present invention, also comprise:
The url filtering condition is set, only the URL that does not meet described filtercondition that did not gather is determined priority.
Described Web page subject divides concrete computing formula to be:
F(p)=a×numFileLink×FactorLink+b*numKeyWord×FactorWord;
In the formula, the Web page subject branch of F (p) for calculating;
The number of the subject document URL that numFileLink contains for this webpage;
FactorLink is the integrating factor of URL link;
The subject key words number that numKeyWord contains for this webpage;
The be the theme integrating factor of keyword of FactorWord;
A, b are weight factor, and a+b=1.
Simultaneously, the present invention also provides a kind of crawler system of search engine, comprising: URL queue stores module, webpage and file download module, webpage parsing module and acquisition control module;
Described URL queue stores module, according to priority sequential storage URL to be collected;
Described webpage and file download module are pressed URL priority progressive download webpage or file from high to low; Web pages downloaded is sent to described webpage parsing module, and the directory system that downloaded files is sent to search engine is handled;
Described webpage parsing module is resolved webpage, extracts the URL that comprises in the webpage and sends to described acquisition control module;
Described acquisition control module is calculated the Web page subject branch of gathering webpage that comprises described URL, and the URL theme branch of described Web page subject branch as described URL adds up;
The score value size of dividing according to described URL theme is determined the priority of described URL, and this URL is deposited in the corresponding priority query in the described URL queue stores module by its priority.
According to above-mentioned crawler system provided by the invention, comprise that also the url filtering module is connected between described webpage parsing module and the acquisition control module;
Described url filtering module judges whether the URL that described webpage parsing module parses gathers, only keeps the URL that did not gather; And further whether the URL that do not gather of judgement meets the url filtering condition of setting, and the URL that did not gather that only will not meet described filtercondition sends to described acquisition control module.
Corresponding to described crawler system, the present invention also provides a kind of search engine, comprise crawler system, directory system and searching system, described crawler system comprises: URL queue stores module, webpage and file download module, webpage parsing module and acquisition control module;
Described URL queue stores module, according to priority sequential storage URL to be collected;
Described webpage and file download module are pressed URL priority progressive download webpage or file from high to low; Web pages downloaded is sent to described webpage parsing module, and the directory system that downloaded files is sent to search engine is handled;
Described webpage parsing module is resolved webpage, extracts the URL that comprises in the webpage and sends to described acquisition control module;
Described acquisition control module is calculated the Web page subject branch of gathering webpage that comprises described URL, and the URL theme branch of described Web page subject branch as described URL adds up; The score value size of dividing according to described URL theme is determined the priority of described URL, and this URL is deposited in the corresponding priority query in the described URL queue stores module by its priority.
Beneficial effect of the present invention is as follows:
(1) the present invention downloads webpage by resolving, and extracts the uniform resource position mark URL that comprises in the webpage; Each URL is determined priority according to pre-defined rule, preferentially gather the higher URL of priority, search for required subject document; Because the URL that priority is higher and the relation of subject document are closer, the possibility that searches out the related subject file is bigger, therefore, adopts the present invention can improve search efficiency.
(2) the present invention is not limited to some specific website is searched for, and can search for each related web page according to URL priority, therefore, can accomplish at the enterprising line search of whole Internet.
Description of drawings
Fig. 1 is a prior art Chinese information retrieval system Organization Chart;
Fig. 2 is the web page interlinkage synoptic diagram between the different themes;
Fig. 3 is a crawler system structural representation provided by the invention;
Fig. 4 is the inventive method process flow diagram.
Embodiment
Referring to Fig. 3, be crawler system 1 structural representation provided by the invention.Comprise: webpage and file download module 11, webpage parsing module 12, url filtering module 13, acquisition control module 14 and URL queue stores module 15.
Function to each module is described in detail below.
Webpage and file download module 11: use HTTP, File Transfer Protocol to download webpage or file, and web pages downloaded is submitted to webpage parsing module 12, downloaded files is submitted to the directory system of search engine and set up index data base;
When crawler system 1 has just begun to start operation, the limit priority URL formation (its corresponding URL theme is divided into an acquiescence initial value) that some seed URL put into URL queue stores module 15 is set, some common navigating directory webpages for example, as www.hao123.com, webpage and file download module 11 obtain seed URL from the URL formation, download webpage then and send to webpage parsing module 12 and resolve.
Webpage parsing module 12: resolve html web page, extract the URL link that webpage comprises, and submit to url filtering module 13.
Url filtering module 13: judge whether each URL gathers,, judge whether to meet filter condition,, then send to acquisition control module 14 as URL to be collected if current URL does not gather and do not meet filter condition if do not gather;
In this url filtering module 13, preserve the URL historical record of having gathered; Judge according to the historical record of preserving whether download the URL that comprises in the webpage gathers, and the URL that will gather deposits in real time and writes down renewal in the historical record in;
In this url filtering module 13, all right stored filter condition, for example: the URL blacklist of filtercondition for setting, url filtering module 13 judges according to this filtercondition whether current URL is arranged in blacklist, if current URL is arranged in the blacklist of setting, judge that then this URL meets filter condition, this URL will be filtered, and not be sent to acquisition control module 14; Otherwise url filtering module 13 all sends to the URL that does not gather and do not meet filtercondition that is judged as that webpage parsing module 12 sends over acquisition control module 14 and handles.
Acquisition control module 14, the employing pre-defined algorithm calculates the theme branch of the URL of URL to be collected, determines the priority of corresponding URL according to the score value size of each URL theme branch; And be deposited into each URL in the different priorities formation of URL queue stores module 15 according to its corresponding priorities;
The concrete computing method that the URL theme divides are as follows:
Figure C200610099277D00101
Formula (1)
In the formula (1), S (url) is the URL theme branch of this URL, and F (p) is the theme branch of webpage.Promptly the theme of a URL is divided into the theme branch sum of all webpages of having gathered that comprise this URL.
Wherein:
F (p)=a*numFileLink*FactorLink+b*numKeyWord*FactorWord formula (2)
In the formula (2), F (p) is the Web page subject branch of the webpage correspondence that comprises this URL that calculates;
The number of the subject document URL that numFileLink contains for this webpage;
FactorLink is the integrating factor of URL link;
The subject key words number that numKeyWord contains for this webpage;
The be the theme integrating factor of keyword of FactorWord;
A, b are weight factor, and a+b=1;
The theme that is to say a webpage divides relevant with subject document number that comprises and subject key words number, and it comprises that subject document is many more, and subject key words is many more, and then the theme of this webpage branch is big more.
URL queue stores module 15: preserve the URL formation of a plurality of different priorities, and divide big wisp URL to be collected to put into different priority queries according to the theme of URL; For example: preserve three formations, be respectively first priority query, second priority query and the 3rd priority query, URL divides size to be divided into three different intervals according to theme, wherein, first priority query's rank is the highest, and the storage theme divides maximum interval URL to be collected, second priority query takes second place, and the 3rd priority query's rank is minimum; Webpage and file download module 11 are at first gathered the URL in highest-ranking first priority query, have only after first priority query is for sky (because the URL that had gathered will delete from formation, if the URL in first priority query is gathered, then this formation will be sky), the URL in ability acquisition order second priority query and the 3rd priority query;
The URL formation number of storage can arbitrarily be provided with in this URL queue stores module 15, and the present invention does not limit this.
According to above-mentioned crawler system 1 provided by the invention, the invention provides a kind of subject document searching method, its idiographic flow comprises as shown in Figure 4:
Step S11, webpage parsing module analyzing web page and file download module web pages downloaded, and webpage resolved, extract the URL that webpage comprises, and send to the url filtering module;
Step S12, url filtering module judge whether current URL gathers, perhaps whether meets the filter condition needs and is filtered; Gathered or meet filter condition if judged result shows current URL, then abandoned this URL, flow process goes to step S11, continues to extract other URL that comprises in the webpage by the webpage parsing module; If judged result shows current URL and is not gathered or do not meet filter condition, then send this URL to acquisition control module, continue the following step;
Step S13, acquisition control module are gathered the URL theme branch that algorithm (as adopting above-mentioned formula (1), the defined specific algorithm of formula (2)) calculates this URL correspondence according to subject document;
Step S14, acquisition control module are determined the priority of this URL according to the corresponding relation of the URL theme branch of setting with priority, this URL are deposited in the corresponding priority query of URL queue stores module;
Step S15, webpage and file download module begin to read URL successively from high-priority queue and download; The network element of downloading is sent to the webpage parsing module handle, downloaded files is sent to the directory system of search engine.
In sum, the present invention downloads webpage by resolving, and extracts the URL that comprises in the webpage; Divide computing method to calculate the theme branch to each URL according to the URL theme, determine priority, put into different priority queries, preferentially gather the higher URL of priority, search for required subject document according to pre-defined rule; Because the URL that priority is higher and the relation of subject document are closer, the possibility that searches out the related subject file is bigger, therefore, adopts the present invention can improve search efficiency.
In addition, the present invention can accomplish to be not limited to some specific website at the enterprising line search of whole Internet, and search is fully satisfied user's needs comprehensively.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (7)

1, a kind of internet subject file search method is characterized in that, comprising:
A, parsing web pages downloaded are extracted the uniform resource position mark URL that comprises in the webpage;
B, calculating comprise the Web page subject branch of gathering webpage of described URL, and the URL theme branch of described Web page subject branch as described URL adds up;
Determine the corresponding priority of described URL according to the score value size of described URL theme branch;
C, from high to low each URL of acquisition order according to priority set up index, search for required internet subject file.
2, internet subject file search method as claimed in claim 1 is characterized in that, also comprises:
Preserve the URL historical record of having gathered;
Among the described step B, judge that according to described historical record whether download the URL that comprises in the webpage gathers, only determines priority to the URL that did not gather.
3, internet subject file search method as claimed in claim 2 is characterized in that, also comprises:
The url filtering condition is set, only the URL that does not meet described filtercondition that did not gather is determined priority.
4, internet subject file search method as claimed in claim 3 is characterized in that, described Web page subject divides concrete computing formula to be:
F(p)=a×numFileLink×FactorLink+b*numKeyWord×FactorWord;
In the formula, the Web page subject branch of F (p) for calculating;
The number of the subject document URL that numFileLink contains for this webpage;
FactorLink is the integrating factor of URL link;
The subject key words number that numKeyWord contains for this webpage;
The be the theme integrating factor of keyword of FactorWord;
A, b are weight factor, and a+b=1.
5, a kind of crawler system of search engine is characterized in that comprising: URL queue stores module, webpage and file download module, webpage parsing module and acquisition control module;
Described URL queue stores module, according to priority sequential storage URL to be collected;
Described webpage and file download module are pressed URL priority progressive download webpage or file from high to low; Web pages downloaded is sent to described webpage parsing module, and the directory system that downloaded files is sent to search engine is handled;
Described webpage parsing module is resolved webpage, extracts the URL that comprises in the webpage and sends to described acquisition control module;
Described acquisition control module is calculated the Web page subject branch of gathering webpage that comprises described URL, and the URL theme branch of described Web page subject branch as described URL adds up; The score value size of dividing according to described URL theme is determined the priority of described URL, and this URL is deposited in the corresponding priority query in the described URL queue stores module by its priority.
6, crawler system as claimed in claim 5 is characterized in that, comprises that also the url filtering module is connected between described webpage parsing module and the acquisition control module;
Described url filtering module judges whether the URL that described webpage parsing module parses gathers, only keeps the URL that did not gather; And further whether the URL that do not gather of judgement meets the url filtering condition of setting, and the URL that did not gather that only will not meet described filtercondition sends to described acquisition control module.
7, a kind of search engine comprises crawler system, directory system and searching system, it is characterized in that, described crawler system comprises: URL queue stores module, webpage and file download module, webpage parsing module and acquisition control module;
Described URL queue stores module, according to priority sequential storage URL to be collected;
Described webpage and file download module are pressed URL priority progressive download webpage or file from high to low; Web pages downloaded is sent to described webpage parsing module, and the directory system that downloaded files is sent to search engine is handled;
Described webpage parsing module is resolved webpage, extracts the URL that comprises in the webpage and sends to described acquisition control module;
Described acquisition control module is calculated the Web page subject branch of gathering webpage that comprises described URL, and the URL theme branch of described Web page subject branch as described URL adds up; The score value size of dividing according to described URL theme is determined the priority of described URL, and this URL is deposited in the corresponding priority query in the described URL queue stores module by its priority.
CNB200610099277XA 2006-07-25 2006-07-25 Internet topics file searching method, reptile system and search engine Active CN100520778C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB200610099277XA CN100520778C (en) 2006-07-25 2006-07-25 Internet topics file searching method, reptile system and search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB200610099277XA CN100520778C (en) 2006-07-25 2006-07-25 Internet topics file searching method, reptile system and search engine

Publications (2)

Publication Number Publication Date
CN101114285A CN101114285A (en) 2008-01-30
CN100520778C true CN100520778C (en) 2009-07-29

Family

ID=39022634

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB200610099277XA Active CN100520778C (en) 2006-07-25 2006-07-25 Internet topics file searching method, reptile system and search engine

Country Status (1)

Country Link
CN (1) CN100520778C (en)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101329687B (en) * 2008-07-31 2010-06-23 清华大学 Method for positioning news web page
CN101355590B (en) * 2008-09-05 2012-04-25 深圳市迅雷网络技术有限公司 Method, system and apparatus for prompting download
US8959091B2 (en) * 2009-07-30 2015-02-17 Alcatel Lucent Keyword assignment to a web page
WO2011155350A1 (en) 2010-06-08 2011-12-15 シャープ株式会社 Content reproduction device, control method for content reproduction device, control program, and recording medium
CN102129453B (en) * 2011-03-04 2013-10-23 北京立新盈企信息技术有限公司 Display control device and method capable of displaying search result in mode of text completed with graphs
WO2012025040A1 (en) * 2010-08-27 2012-03-01 Huang Bin Visualized search engine system and implementation method and application thereof
CN102024035A (en) * 2010-12-02 2011-04-20 东莞宇龙通信科技有限公司 Resource retrieval method and device
CN102904912B (en) * 2011-07-26 2015-06-17 腾讯科技(深圳)有限公司 Method and system for downloading webpage contents
CN102254027B (en) * 2011-07-29 2013-05-08 四川长虹电器股份有限公司 Method for obtaining webpage contents in batch
CN102346772A (en) * 2011-09-23 2012-02-08 王楠 Directional acquisition system based on OWL (ontology web language) semantic analysis
CN103123640A (en) * 2012-02-22 2013-05-29 深圳市谷古科技有限公司 Method and device for searching novel
CN103123642A (en) * 2012-02-22 2013-05-29 深圳市谷古科技有限公司 Searching method and device based on web language
CN102760162A (en) * 2012-06-11 2012-10-31 北京搜狗信息服务有限公司 Method and device for revealing and acquiring download link
CN103631792B (en) * 2012-08-22 2017-01-25 北京华财会计股份有限公司 Massive source index building system and method
CN102999549A (en) * 2012-09-25 2013-03-27 金博 Method for realizing web crawler tasks
CN103761279B (en) * 2014-01-09 2017-02-08 北京京东尚科信息技术有限公司 Method and system for scheduling network crawlers on basis of keyword search
CN104679838A (en) * 2015-02-09 2015-06-03 北京中搜网络技术股份有限公司 Efficient information collection method
CN106649354B (en) * 2015-10-30 2020-02-28 北京国双科技有限公司 Webpage crawling request processing method and device
CN106326339A (en) * 2016-08-03 2017-01-11 上海蔓盈信息科技有限公司 Task allocating method and device
CN106776934B (en) * 2016-11-30 2021-03-26 努比亚技术有限公司 Mobile terminal and implementation method of web crawler
CN108228656B (en) * 2016-12-21 2021-05-25 普天信息技术有限公司 URL classification method and device based on CART decision tree
CN107480264B (en) * 2017-08-17 2019-11-15 北京知道创宇信息技术股份有限公司 A kind of web crawlers De-weight method and calculate equipment
CN107679072B (en) * 2017-08-24 2020-08-28 平安普惠企业管理有限公司 User behavior information acquisition method, terminal and storage medium
CN107729564A (en) * 2017-11-13 2018-02-23 北京众荟信息技术股份有限公司 A kind of distributed focused web crawler web page crawl method and system
CN108664646B (en) * 2018-05-16 2021-11-16 电子科技大学 Audio and video automatic downloading system based on keywords
CN109871475A (en) * 2019-02-28 2019-06-11 上海浪潮云计算服务有限公司 A kind of method and system of in a preferential order piecemeal acquisition internet data
CN113674769A (en) * 2021-08-20 2021-11-19 湖北亿咖通科技有限公司 Voice system test method, device, equipment, medium and program product
CN116132534B (en) * 2022-07-01 2024-03-08 马上消费金融股份有限公司 Method, device, equipment and storage medium for storing service request

Also Published As

Publication number Publication date
CN101114285A (en) 2008-01-30

Similar Documents

Publication Publication Date Title
CN100520778C (en) Internet topics file searching method, reptile system and search engine
US7644069B2 (en) Search ranking method for file system and related search engine
Brin et al. Reprint of: The anatomy of a large-scale hypertextual web search engine
CN101399818B (en) Theme related webpage filtering method and system based on navigation route information
EP1934823B1 (en) Click distance determination
US8812478B1 (en) Distributed crawling of hyperlinked documents
Ma et al. Efficiently finding web services using a clustering semantic approach
CN100437585C (en) Method for carrying out retrieval hint based on inverted list
US20100287149A1 (en) Method and apparatus for reconstructing a search query
US8417657B2 (en) Methods and apparatus for computing graph similarity via sequence similarity
CN101727447A (en) Generation method and device of regular expression based on URL
CN101261629A (en) Specific information searching method based on automatic classification technology
CN102710795A (en) Hotspot collecting method and device
KR100671077B1 (en) Server, Method and System for Providing Information Search Service by Using Sheaf of Pages
CN112597369A (en) Webpage spider theme type search system based on improved cloud platform
US20030046276A1 (en) System and method for modular data search with database text extenders
Gurrin et al. Dublin City University experiments in connectivity analysis for TREC-9.
Inkpen Information retrieval on the internet
Yu et al. The design and realization of open-source search engine based on Nutch
Singh Dynamic Clustering For Web Mining
Garg et al. Implementation of a Search Engine
KR100645711B1 (en) Server, Method and System for Providing Information Search Service by Using Web Page Segmented into Several Information Blocks
Wang et al. Challenges in crawling the deep web
Babu et al. Design of a metacrawler for web document retrieval
WO2006101282A1 (en) The real-time data grouping-searching method and the networking method of the computer servers in the internet environment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: SHENZHEN SHIJI LIGHT SPEED INFORMATION TECHNOLOGY

Free format text: FORMER OWNER: TENGXUN SCI-TECH (SHENZHEN) CO., LTD.

Effective date: 20131021

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 518044 SHENZHEN, GUANGDONG PROVINCE TO: 518057 SHENZHEN, GUANGDONG PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20131021

Address after: 518057 Tencent Building, 16, Nanshan District hi tech park, Guangdong, Shenzhen

Patentee after: Shenzhen Shiji Guangsu Information Technology Co., Ltd.

Address before: 2, 518044, East 410 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District

Patentee before: Tencent Technology (Shenzhen) Co., Ltd.