CN101826110A - Method for crawling BitTorrent torrent files - Google Patents

Method for crawling BitTorrent torrent files Download PDF

Info

Publication number
CN101826110A
CN101826110A CN 201010147527 CN201010147527A CN101826110A CN 101826110 A CN101826110 A CN 101826110A CN 201010147527 CN201010147527 CN 201010147527 CN 201010147527 A CN201010147527 A CN 201010147527A CN 101826110 A CN101826110 A CN 101826110A
Authority
CN
China
Prior art keywords
address
seed file
issue page
page address
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201010147527
Other languages
Chinese (zh)
Other versions
CN101826110B (en
Inventor
宋维佳
马皓
张建宇
张缘
杨加
张蓓
周渊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN2010101475279A priority Critical patent/CN101826110B/en
Publication of CN101826110A publication Critical patent/CN101826110A/en
Application granted granted Critical
Publication of CN101826110B publication Critical patent/CN101826110B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a method for crawling BitTorrent torrent files, and belongs to the field of computer networks. The method comprises the following steps that: 1, according to set characteristic key words of a BT server, a detection module calls a search engine interface to search WEB sites released by the BT and sending the webpage addresses to a crawler module; 2, according to the received released webpage addresses, the crawler module downloads the corresponding webpages; 3, the crawler module analyzes the downloaded webpages to obtain the addresses of the torrent files, and downloads the torrent files to a torrent file library according to the addresses of the torrent files; and 4, a torrent file analyzer analyzes the torrent files to obtain an address of an index server, converts the address of an index server into the addresses of the released webpages and sends the addresses to the crawler module, and steps 2 to 4 are repeated. Compared with the prior art, the method has the advantages that: crawled torrent resources are more complete and abundant, and the torrent resources of the torrent file library are greatly increased.

Description

A kind of BitTorrent seed file crawling method
Technical field
The present invention relates to a kind of BitTorrent seed file crawling method, have the advantages that effectively find and download the BitTorrent seed fast, belong to computer network field.
Background technology
The Napster that the P2P shared file system occurred since 1999 so far, technology is constantly reformed, and has developed different host-host protocols, wherein famous BitTorrent, EDonkey, the Gnutella of comprising.Its core concept is the bandwidth of uploading that makes full use of download person's (being commonly referred to as peer), makes them can be uploaded to other peer to the part of having downloaded when downloading.With the BitTorrent agreement is example, and the bittorrent seed file has write down title, the size of content file, the address of index server.The user just can find index server (or claim tracker server) by seed file, and then finds the peer of corresponding content file and them to connect and data download.This transmission mode has effectively been brought into play the potentiality of the network bandwidth, has quickened the download of file.
Owing to be widely used, when bringing convenience, the P2P shared file system also causes some problems.The outlet bandwidth of a lot of mechanisms is all occupied by the P2P flow, even influences normal service traffics.Some harmful contents are also by the diffusion of P2P shared file system, and are harmful physically and mentally healthy.Be head it off, at first need to understand P2P shared file system shared content, thereby hold the propagation law of these contents, and it is done effective management; For the BitTorrent network, can realize target by collecting seed file.
The issue of BitTorrent seed file has multiple mode.A kind of is that well-regulated large-scale website is concentrated issue, for example mininova, btchina etc. with seed file; Another kind is that the individual sets up small-sized privately owned BT server.For the situation of seed file in the awareness network comparatively all sidedly, existing method is to utilize web crawlers, according to the structure of large-scale seed distribution site, initiatively climbs and gets seed file.This method efficient height, but generally need human configuration, adaptability a little less than, also inconvenient to the processing of dynamic web page.Afterwards, reptile was surveyed the seed file on the page automatically, and the kernel of browser is wrapping in the reptile to handle the dynamic web page script.The more preceding a kind of adaptability of this method improves, but decrease in efficiency.Though being used to solve of this two kinds of methods got climbing of large-scale seed distribution site, be difficult for handling the situation of small-sized privately owned BT server: because the dynamic of this class server is more intense, existing method is difficult to effectively obtain their address.In sum, existing method can not effectively be followed the tracks of the content of propagating by the stronger privately owned BT server of dynamic.
Summary of the invention
At having the problem that BitTorrent seed reptile can not effectively be followed the tracks of the stronger privately owned BT server of dynamic now, the purpose of this invention is to provide a kind of BitTorrent seed file crawling method, this method can in time be found privately owned BT server, and the guiding reptile is downloaded seed.
We find: most of privately owned BT servers all are to be built by a few software; And the server of building by software of the same race, its publications page mask has similar feature; (the special phrase that all occurs on the BT Tracker issue page that definition adopts certain software to build is this Characteristic of Software keyword by extracting characteristic key words.Generally can intuitively determine characteristic key words: for example " BNBT Tracker Info by observing its issue page " be exactly the characteristic key words of BNBT), utilize search engine promptly to may detect a large amount of issue pages.In addition, small-sized privately owned its index server of BT server (tracker server) double as web publisher server; Therefore, the index server of seed file inside might be exactly the seed file issue page.Based on above 2 understanding, technical conceive of the present invention is as follows: one, and reptile utilizes the characteristic key words of BT server software commonly used, regularly surveys new BitTorrent seed distribution site with universal search engine; Two, whenever climb get a seed file in, therefrom resolve the index server of seed file, whether be the seed issue page, if then climb the seed file of getting wherein if testing it.
Technical scheme of the present invention is:
A kind of BitTorrent seed file crawling method the steps include:
1) according to the BT server characteristic key words of setting, detecting module calling search engine interface is searched BT issue WEB website and its issue page address is sent to the reptile module;
2) the reptile module is downloaded respective page according to the issue page address that receives;
3) the reptile module parses the seed file address from institute's downloading page, and according to the seed file address seed file is downloaded to the seed file storehouse;
4) the seed file resolver parses the address of index server from seed file, and the index server address convert to the issue page address send to the reptile module, repeating step 2)~4).
Further, described detecting module termly the calling search engine interface search BT issue WEB website.
Further, described detecting module provides a keyword input interface, is used to receive the BT server characteristic key words of new settings.
Further, described reptile module comprises a new issue page address formation, is used for buffer memory and does not climb the issue page address of getting; One old issue page address queue is used for buffer memory and has climbed the issue page address of getting; One seed file address set, the chained address that is used to deposit the seed file of having downloaded.
Further, the method that described reptile module is downloaded respective page according to the issue page address that receives is: the reptile module is for the issue page address that receives, at first check in described new issue page address formation or the described old issue page address queue whether this address is arranged, if have then abandon this address, otherwise deposit it tail of the queue of described new issue page address formation in; The reptile module is extracted an address from the head of the queue of described new issue page address formation then, puts it into the page of downloading the address of extracting after the described old issue page address queue.
Further, described reptile module parses the issue page address of other webpages from institute's downloading page, check then in described new issue page address formation or the described old issue page address queue whether this address is arranged, if have then abandon this address, otherwise deposit it tail of the queue of described new issue page address formation in.
Further, described reptile module according to the seed file address with the method that seed file downloads to the seed file storehouse is: for the seed file address that receives, the reptile module at first checks whether there is this seed file address in the described seed file address set, if exist then refusal link download, otherwise link this address and download this seed file.
Further, the address of buffer memory is the Hash of issue page address in described new issue page address formation and the described old issue page address queue; Be marked with processing time last time on the Hash of described issue page address, the described reptile module delete flag time surpasses the issue page address Hash of setting-up time; Address stored is the Hash of seed file address in the described seed file address set; Be marked with processing time last time on the Hash of described seed file address, the described reptile module delete flag time surpasses the seed file address Hash of setting-up time.
Further, described seed file storehouse is database or file system; Described detecting module, reptile module, seed file resolver, seed file storehouse run on different main frames, connect by network between the main frame; Perhaps described detecting module, reptile module, seed file resolver, seed file storehouse run on the same main frame.
Logically, this crawler system is divided into four parts:
1) detecting module: the function of this software module is to find BT seed distribution site; It is according to characteristic key words (as " BNBTTracker Info "), calls universal search engine (as " Baidu ") and finds BT seed distribution site, and they are organized into the issue page listings send to the reptile module.
2) reptile module: the function of this software module is to climb to get seed file; It reads the issue page listings that detecting module sends, and downloads the content of respective page from the internet, resolves the seed file address in the page and downloads it; Then seed is deposited in the seed file storehouse.
3) seed file resolver: the function of this software module is the address that parses index server from seed file, and the index server address is converted to the URL that issues the page, gives the reptile module.
4) seed file storehouse: the function in seed file storehouse is to deposit seed file, for analyzing and processing.According to the storage means of selecting for use, the seed file storehouse can be database or file system.
More than four logic modules may operate on same the main frame, also can be distributed to the operation of different main frame improving performance, between each module by network coordination work.
As shown in Figure 1, according to data flow, the job step of this crawler system is as follows:
The first step, detecting module is assembled into the http request with pre-set characteristic key words and sends to search engine.After receiving return results, detecting module therefrom extracts the URL of the page, sends to the reptile module.In this step: this process of detecting module can be regular or artificial the triggering; Pre-set characteristic key words can be added in operational process temporarily, to tackle the appearance of new software; Interface between detecting module and the reptile module can adopt inter-process communication mechanisms or socket.
In second step, the URL that the reptile module is imported according to previous step downloads, analyzing web page, and therefrom obtains the seed file link, and then downloads seed file, as shown in Figure 2.These a series of work are finished by four sub-sequence of modules of reptile inside: url filtering submodule, page download submodule, web page analysis submodule and seed file are downloaded submodule and are constituted.In addition, get in order to prevent to repeat to climb, the reptile inside modules is also being safeguarded new, old two URL formations and a seed file URL set; Buffer memory is not climbed the webpage URL that got in the new URL formation, and the webpage URL that buffer memory had been handled in the old formation; The link of depositing the seed file of having handled in the seed file URL set.
When input URL arrived, the url filtering submodule checked at first whether this URL occurs among any one in new, old URL formation; If, then abandon it, otherwise this URL is put into new URL formation.
The page download submodule is won head of the queue from new URL formation URL (be in the formation that URL) the earliest puts into old URL formation; Download the html web page content then, and give the web page analysis submodule and handle.The web page analysis submodule extracts the URL of other webpages and the URL of torrent file from web page contents; Send to url filtering submodule and seed file respectively and download submodule.
Seed file is downloaded submodule and judged: whether Already in the seed file URL of input in the seed file URL set, if exist, then abandons it; Otherwise, download this seed file, deposit the seed file storehouse in.
The 3rd step, second step of seed file resolver analysis is climbed the file of getting, and extract wherein tracker server address (annotate: torrent seed file form is an Open Standard, repeats no more) here, shape is as http://btfans.3322.org:6969/announce; It removes the announce character string of back, address, obtains issuing the page address, and shape such as http://btfans.3322.org:6969/ send to the reptile resume module.
Compared with prior art, good effect of the present invention is:
Method of the present invention not only can be excavated the seed resource of large-scale BT server, excavate download for the seed resource on the numerous small-sized privately owned BT server that upgrades day by day simultaneously, have the advantages that effectively find and download the BitTorrent seed fast, thereby download by a large amount of seed files, can enrich the resource in seed file storehouse, thereby strengthened the function of search engine, make result for retrieval more comprehensively, abundant.
Description of drawings
Fig. 1, expression system data flow graph;
Fig. 2, expression reptile data flow diagram;
The system construction drawing of Fig. 3, expression embodiment;
The fragment of Fig. 4, expression search engine return results.
Embodiment
Be example now, the embodiment of scheme is described with an instantiation.
The hardware environment of system implementation is as shown in Figure 3: the working environment of crawler system comprises two LAN (Local Area Network), and LAN (Local Area Network) 1 can be visited Internet, is built by the network switch 1; LAN (Local Area Network) 2 belongs to internal network, can not visit Internet, is built by the network switch 2.Survey host-specific in operation detecting module software, be connected in LAN (Local Area Network) 1 for one; A reptile main frame is used to move reptile module software and seed file resolver, and it has two network interface cards, and one is connected to LAN (Local Area Network) 1, and another piece is connected to LAN (Local Area Network) 2.Document storage server is a nfs server, and as the seed file storehouse, it is connected to LAN (Local Area Network) 2; The shared file catalogue of reptile main frame carry nfs server is used to write seed file.Notice that as previously mentioned, the file layout of other types can also be adopted in the seed file storehouse except adopting the NFS file storage service,, also can adopt database as GFS (Global FileSystem) etc.; In addition, in this example, all main frames all move linux operating system; Scheme also can adopt the operating system of other types.
The operating mechanism of each software module of this programme is explained in " summary of the invention " part, the following describes the ins and outs in the embodiment.
1) reciprocal process of detecting module and search engine.With google search engine and BNBT server is example, detecting module is characteristic key words " BNBT Tracker Info " put into the request field of google, obtain searching for the HTTP request URL: http://www.google.cn/search? q=BNBT+Tracker+Info; Detecting module is visited this URL and is promptly obtained the request response.HTML code fragment as shown in Figure 4 is a Search Results, and wherein black matrix partly is exactly the address of the seed issue page.From Search Results, extract the address of the issue page easily by regular expression.
2) the old URL formation in the reptile module infinitely increases problem.Old URL formation can constantly increase at system's operational process, and the time has been grown and can cause internal memory to be taken in a large number; In addition, because the existence of old URL formation, all URL can only be once processed; Thereby can miss the situation that the corresponding page of URL changes.In order to solve this two problems, when implementing, we have adopted two kinds of ways to solve this problem: at first, we store the URL character string in formation Hash (adopting 128 usually) reduces memory cost; Secondly, be the time of each Hash mark processing last time,, just discharge it when it has surpassed certain hour (generally got 3 days, and thought that the average 3 days issue pages will change) in system.
3) the unlimited increase problem of the URL of seed file set solves with similar method.
4) efficient of raising reptile.Though the reptile module software only moves a copy in present case, but operation can distribute reptile when specifically implementing: except the url filtering submodule, other submodules all can move a plurality of copies, giving full play to the performance of main frame, even are distributed on the different main frames operation to strengthen effect.
5) complicacy of the parsing page.Dynamic script uses extensively in the webpage now, especially dynamically is written into script of contents and causes difficulty for the analysis content of pages.We utilize the Gecko of browser kernel such as FireFox to be written into the page (concrete grammar is referring to the Gecko document), allow dynamic effect on the automatic processing page of Gecko, and the dom tree of analyzing the page again gets final product.

Claims (9)

1. a BitTorrent seed file crawling method the steps include:
1) according to the BT server characteristic key words of setting, detecting module calling search engine interface is searched BT issue WEB website and its issue page address is sent to the reptile module;
2) the reptile module is downloaded respective page according to the issue page address that receives;
3) the reptile module parses the seed file address from institute's downloading page, and according to the seed file address seed file is downloaded to the seed file storehouse;
4) the seed file resolver parses the address of index server from seed file, and the index server address convert to the issue page address send to the reptile module, repeating step 2)~4).
2. the method for claim 1, it is characterized in that described detecting module termly the calling search engine interface search BT issue WEB website.
3. the method for claim 1 is characterized in that described detecting module provides a keyword input interface, is used to receive the BT server characteristic key words of new settings.
4. the method for claim 1 is characterized in that described reptile module comprises a new issue page address formation, is used for buffer memory and does not climb the issue page address of getting; One old issue page address queue is used for buffer memory and has climbed the issue page address of getting; One seed file address set, the chained address that is used to deposit the seed file of having downloaded.
5. method as claimed in claim 4, it is characterized in that the method that described reptile module is downloaded respective page according to the issue page address that receives is: the reptile module is for the issue page address that receives, at first check in described new issue page address formation or the described old issue page address queue whether this address is arranged, if have then abandon this address, otherwise deposit it tail of the queue of described new issue page address formation in; The reptile module is extracted an address from the head of the queue of described new issue page address formation then, puts it into the page of downloading the address of extracting after the described old issue page address queue.
6. method as claimed in claim 5, it is characterized in that described reptile module parses the issue page address of other webpages from institute's downloading page, check then in described new issue page address formation or the described old issue page address queue whether this address is arranged, if have then abandon this address, otherwise deposit it tail of the queue of described new issue page address formation in.
7. method as claimed in claim 4, it is characterized in that described reptile module according to the seed file address with the method that seed file downloads to the seed file storehouse is: for the seed file address that receives, the reptile module at first checks whether there is this seed file address in the described seed file address set, if exist then refusal link download, otherwise link this address and download this seed file.
8. method as claimed in claim 4 is characterized in that the Hash of the address of buffer memory in described new issue page address formation and the described old issue page address queue for the issue page address; Be marked with processing time last time on the Hash of described issue page address, the described reptile module delete flag time surpasses the issue page address Hash of setting-up time; Address stored is the Hash of seed file address in the described seed file address set; Be marked with processing time last time on the Hash of described seed file address, the described reptile module delete flag time surpasses the seed file address Hash of setting-up time.
9. the method for claim 1 is characterized in that described seed file storehouse is database or file system; Described detecting module, reptile module, seed file resolver, seed file storehouse run on different main frames, connect by network between the main frame; Perhaps described detecting module, reptile module, seed file resolver, seed file storehouse run on the same main frame.
CN2010101475279A 2010-04-13 2010-04-13 Method for crawling BitTorrent torrent files Expired - Fee Related CN101826110B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010101475279A CN101826110B (en) 2010-04-13 2010-04-13 Method for crawling BitTorrent torrent files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010101475279A CN101826110B (en) 2010-04-13 2010-04-13 Method for crawling BitTorrent torrent files

Publications (2)

Publication Number Publication Date
CN101826110A true CN101826110A (en) 2010-09-08
CN101826110B CN101826110B (en) 2011-12-21

Family

ID=42690030

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010101475279A Expired - Fee Related CN101826110B (en) 2010-04-13 2010-04-13 Method for crawling BitTorrent torrent files

Country Status (1)

Country Link
CN (1) CN101826110B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102355488A (en) * 2011-08-15 2012-02-15 北京星网锐捷网络技术有限公司 Crawler seed obtaining method and equipment and crawler crawling method and equipment
CN103942335A (en) * 2014-05-07 2014-07-23 武汉大学 Construction method of uninterrupted crawler system oriented to web page structure change
CN105025064A (en) * 2014-04-30 2015-11-04 腾讯科技(深圳)有限公司 Method, device and system for downloading files
CN105760550A (en) * 2016-03-23 2016-07-13 江苏物联网研究发展中心 Big data storage center-oriented internet data acquisition system and acquisition method
CN106598984A (en) * 2015-10-16 2017-04-26 北京国双科技有限公司 Data processing method and device of web crawler
CN107102997A (en) * 2016-02-22 2017-08-29 北京国双科技有限公司 data crawling method and device
CN107526833A (en) * 2017-09-05 2017-12-29 广东科杰通信息科技有限公司 A kind of URL management methods, system
CN111260223A (en) * 2020-01-17 2020-06-09 山东省计算中心(国家超级计算济南中心) Intelligent identification and early warning method, system, medium and equipment for trial and judgment risk
WO2020211351A1 (en) * 2019-04-19 2020-10-22 平安科技(深圳)有限公司 Method and device for obtaining external data by using crawler
CN113987146A (en) * 2021-10-22 2022-01-28 国网江苏省电力有限公司镇江供电分公司 Dedicated novel intelligence of electric power intranet system of asking for answering

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101046806A (en) * 2006-03-30 2007-10-03 腾讯科技(深圳)有限公司 Search engine system and method
CN101101601A (en) * 2007-07-10 2008-01-09 北京大学 Subject crawling method based on link hierarchical classification in network search
CN101443751A (en) * 2004-11-22 2009-05-27 特鲁维奥公司 Method and apparatus for an application crawler

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101443751A (en) * 2004-11-22 2009-05-27 特鲁维奥公司 Method and apparatus for an application crawler
CN101046806A (en) * 2006-03-30 2007-10-03 腾讯科技(深圳)有限公司 Search engine system and method
CN101101601A (en) * 2007-07-10 2008-01-09 北京大学 Subject crawling method based on link hierarchical classification in network search

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《Proceedings of the Fifth International Conference on Grid and Cooperative Computing Workshops (GCCW"06)》 20061230 Jia Yang, Hao Ma, Weijia Song, Jian Cui, and Changling Zhou Crawling the eDonkey Network IEEE Computer society , 2 *
《华中科技大学学报》 20071031 方启明,杨广文,武永卫,朱安平,郑纬民 面向P2P搜索的可定制聚焦网络爬虫 第35卷, 2 *
《计算机工程与应用》 20060630 肖建勇,张武生 Clair: 一种基于P2P的BitTorrent关键词检索系统 第42卷, 第18期 2 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102355488A (en) * 2011-08-15 2012-02-15 北京星网锐捷网络技术有限公司 Crawler seed obtaining method and equipment and crawler crawling method and equipment
CN102355488B (en) * 2011-08-15 2014-01-22 北京星网锐捷网络技术有限公司 Crawler seed obtaining method and equipment and crawler crawling method and equipment
CN105025064B (en) * 2014-04-30 2019-07-02 腾讯科技(深圳)有限公司 Download the method, apparatus and system of file
CN105025064A (en) * 2014-04-30 2015-11-04 腾讯科技(深圳)有限公司 Method, device and system for downloading files
CN103942335B (en) * 2014-05-07 2017-04-26 武汉大学 Construction method of uninterrupted crawler system oriented to web page structure change
CN103942335A (en) * 2014-05-07 2014-07-23 武汉大学 Construction method of uninterrupted crawler system oriented to web page structure change
CN106598984A (en) * 2015-10-16 2017-04-26 北京国双科技有限公司 Data processing method and device of web crawler
CN107102997A (en) * 2016-02-22 2017-08-29 北京国双科技有限公司 data crawling method and device
CN105760550A (en) * 2016-03-23 2016-07-13 江苏物联网研究发展中心 Big data storage center-oriented internet data acquisition system and acquisition method
CN107526833A (en) * 2017-09-05 2017-12-29 广东科杰通信息科技有限公司 A kind of URL management methods, system
CN107526833B (en) * 2017-09-05 2020-03-24 广东科杰通信息科技有限公司 URL management method and system
WO2020211351A1 (en) * 2019-04-19 2020-10-22 平安科技(深圳)有限公司 Method and device for obtaining external data by using crawler
CN111260223A (en) * 2020-01-17 2020-06-09 山东省计算中心(国家超级计算济南中心) Intelligent identification and early warning method, system, medium and equipment for trial and judgment risk
CN113987146A (en) * 2021-10-22 2022-01-28 国网江苏省电力有限公司镇江供电分公司 Dedicated novel intelligence of electric power intranet system of asking for answering
CN113987146B (en) * 2021-10-22 2023-01-31 国网江苏省电力有限公司镇江供电分公司 Dedicated intelligent question-answering system of electric power intranet

Also Published As

Publication number Publication date
CN101826110B (en) 2011-12-21

Similar Documents

Publication Publication Date Title
CN101826110B (en) Method for crawling BitTorrent torrent files
CN108206802B (en) Method and device for detecting webpage backdoor
JP4668567B2 (en) System and method for client-based web crawling
US9183214B2 (en) Method and apparatus for data storage and downloading
CN102968591B (en) Malicious-software characteristic clustering analysis method and system based on behavior segment sharing
CN102073683A (en) Distributed real-time news information acquisition system
CN101046806B (en) Search engine system and method
CN104516982A (en) Method and system for extracting Web information based on Nutch
CN104363251B (en) Website security detection method and device
CN102957571A (en) Method and system for monitoring network flows
WO2020024903A1 (en) Method and device for searching for blockchain data, and computer readable storage medium
CN107688568A (en) Acquisition method and device based on web page access behavior record
CN101211340A (en) Dynamic network crawler based on client end /service end
US20230359627A1 (en) Sharing compiled code for executing queries across query engines
US8560521B2 (en) System, method, and computer program product for processing a prefix tree file utilizing a selected agent
Loo et al. Distributed web crawling over DHTs
CN105677921A (en) Method and system for acquiring Internet public opinion data
CN103440454B (en) A kind of active honeypot detection method based on search engine keywords
Deka NoSQL web crawler application
CN101763392A (en) Retrieval architecture and retrieval method
Leng et al. PyBot: an algorithm for web crawling
Krupa et al. On-demand web search using browser-based volunteer computing
Agrawal et al. A survey on content based crawling for deep and surface web
JP5165717B2 (en) Dead link determination apparatus and method
Liu et al. WRT: Constructing Users' Web Request Trees from HTTP Header Logs

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20111221

Termination date: 20160413

CF01 Termination of patent right due to non-payment of annual fee