CN101826110A

CN101826110A - Method for crawling BitTorrent torrent files

Info

Publication number: CN101826110A
Application number: CN 201010147527
Authority: CN
Inventors: 宋维佳; 马皓; 张建宇; 张缘; 杨加; 张蓓; 周渊
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2010-04-13
Filing date: 2010-04-13
Publication date: 2010-09-08
Anticipated expiration: 2030-04-13
Also published as: CN101826110B

Abstract

The invention relates to a method for crawling BitTorrent torrent files, and belongs to the field of computer networks. The method comprises the following steps that: 1, according to set characteristic key words of a BT server, a detection module calls a search engine interface to search WEB sites released by the BT and sending the webpage addresses to a crawler module; 2, according to the received released webpage addresses, the crawler module downloads the corresponding webpages; 3, the crawler module analyzes the downloaded webpages to obtain the addresses of the torrent files, and downloads the torrent files to a torrent file library according to the addresses of the torrent files; and 4, a torrent file analyzer analyzes the torrent files to obtain an address of an index server, converts the address of an index server into the addresses of the released webpages and sends the addresses to the crawler module, and steps 2 to 4 are repeated. Compared with the prior art, the method has the advantages that: crawled torrent resources are more complete and abundant, and the torrent resources of the torrent file library are greatly increased.

Description

A kind of BitTorrent seed file crawling method

Technical field

The present invention relates to a kind of BitTorrent seed file crawling method, have the advantages that effectively find and download the BitTorrent seed fast, belong to computer network field.

Background technology

The Napster that the P2P shared file system occurred since 1999 so far, technology is constantly reformed, and has developed different host-host protocols, wherein famous BitTorrent, EDonkey, the Gnutella of comprising.Its core concept is the bandwidth of uploading that makes full use of download person's (being commonly referred to as peer), makes them can be uploaded to other peer to the part of having downloaded when downloading.With the BitTorrent agreement is example, and the bittorrent seed file has write down title, the size of content file, the address of index server.The user just can find index server (or claim tracker server) by seed file, and then finds the peer of corresponding content file and them to connect and data download.This transmission mode has effectively been brought into play the potentiality of the network bandwidth, has quickened the download of file.

Owing to be widely used, when bringing convenience, the P2P shared file system also causes some problems.The outlet bandwidth of a lot of mechanisms is all occupied by the P2P flow, even influences normal service traffics.Some harmful contents are also by the diffusion of P2P shared file system, and are harmful physically and mentally healthy.Be head it off, at first need to understand P2P shared file system shared content, thereby hold the propagation law of these contents, and it is done effective management; For the BitTorrent network, can realize target by collecting seed file.

The issue of BitTorrent seed file has multiple mode.A kind of is that well-regulated large-scale website is concentrated issue, for example mininova, btchina etc. with seed file; Another kind is that the individual sets up small-sized privately owned BT server.For the situation of seed file in the awareness network comparatively all sidedly, existing method is to utilize web crawlers, according to the structure of large-scale seed distribution site, initiatively climbs and gets seed file.This method efficient height, but generally need human configuration, adaptability a little less than, also inconvenient to the processing of dynamic web page.Afterwards, reptile was surveyed the seed file on the page automatically, and the kernel of browser is wrapping in the reptile to handle the dynamic web page script.The more preceding a kind of adaptability of this method improves, but decrease in efficiency.Though being used to solve of this two kinds of methods got climbing of large-scale seed distribution site, be difficult for handling the situation of small-sized privately owned BT server: because the dynamic of this class server is more intense, existing method is difficult to effectively obtain their address.In sum, existing method can not effectively be followed the tracks of the content of propagating by the stronger privately owned BT server of dynamic.

Summary of the invention

At having the problem that BitTorrent seed reptile can not effectively be followed the tracks of the stronger privately owned BT server of dynamic now, the purpose of this invention is to provide a kind of BitTorrent seed file crawling method, this method can in time be found privately owned BT server, and the guiding reptile is downloaded seed.

We find: most of privately owned BT servers all are to be built by a few software; And the server of building by software of the same race, its publications page mask has similar feature; (the special phrase that all occurs on the BT Tracker issue page that definition adopts certain software to build is this Characteristic of Software keyword by extracting characteristic key words.Generally can intuitively determine characteristic key words: for example " BNBT Tracker Info by observing its issue page " be exactly the characteristic key words of BNBT), utilize search engine promptly to may detect a large amount of issue pages.In addition, small-sized privately owned its index server of BT server (tracker server) double as web publisher server; Therefore, the index server of seed file inside might be exactly the seed file issue page.Based on above 2 understanding, technical conceive of the present invention is as follows: one, and reptile utilizes the characteristic key words of BT server software commonly used, regularly surveys new BitTorrent seed distribution site with universal search engine; Two, whenever climb get a seed file in, therefrom resolve the index server of seed file, whether be the seed issue page, if then climb the seed file of getting wherein if testing it.

Technical scheme of the present invention is:

A kind of BitTorrent seed file crawling method the steps include:

1) according to the BT server characteristic key words of setting, detecting module calling search engine interface is searched BT issue WEB website and its issue page address is sent to the reptile module;

2) the reptile module is downloaded respective page according to the issue page address that receives;

3) the reptile module parses the seed file address from institute's downloading page, and according to the seed file address seed file is downloaded to the seed file storehouse;

4) the seed file resolver parses the address of index server from seed file, and the index server address convert to the issue page address send to the reptile module, repeating step 2)～4).

Further, described detecting module termly the calling search engine interface search BT issue WEB website.

Further, described detecting module provides a keyword input interface, is used to receive the BT server characteristic key words of new settings.

Further, described reptile module comprises a new issue page address formation, is used for buffer memory and does not climb the issue page address of getting; One old issue page address queue is used for buffer memory and has climbed the issue page address of getting; One seed file address set, the chained address that is used to deposit the seed file of having downloaded.

Further, the method that described reptile module is downloaded respective page according to the issue page address that receives is: the reptile module is for the issue page address that receives, at first check in described new issue page address formation or the described old issue page address queue whether this address is arranged, if have then abandon this address, otherwise deposit it tail of the queue of described new issue page address formation in; The reptile module is extracted an address from the head of the queue of described new issue page address formation then, puts it into the page of downloading the address of extracting after the described old issue page address queue.

Further, described reptile module parses the issue page address of other webpages from institute's downloading page, check then in described new issue page address formation or the described old issue page address queue whether this address is arranged, if have then abandon this address, otherwise deposit it tail of the queue of described new issue page address formation in.

Further, described reptile module according to the seed file address with the method that seed file downloads to the seed file storehouse is: for the seed file address that receives, the reptile module at first checks whether there is this seed file address in the described seed file address set, if exist then refusal link download, otherwise link this address and download this seed file.

Further, the address of buffer memory is the Hash of issue page address in described new issue page address formation and the described old issue page address queue; Be marked with processing time last time on the Hash of described issue page address, the described reptile module delete flag time surpasses the issue page address Hash of setting-up time; Address stored is the Hash of seed file address in the described seed file address set; Be marked with processing time last time on the Hash of described seed file address, the described reptile module delete flag time surpasses the seed file address Hash of setting-up time.

Further, described seed file storehouse is database or file system; Described detecting module, reptile module, seed file resolver, seed file storehouse run on different main frames, connect by network between the main frame; Perhaps described detecting module, reptile module, seed file resolver, seed file storehouse run on the same main frame.

Logically, this crawler system is divided into four parts:

1) detecting module: the function of this software module is to find BT seed distribution site; It is according to characteristic key words (as " BNBTTracker Info "), calls universal search engine (as " Baidu ") and finds BT seed distribution site, and they are organized into the issue page listings send to the reptile module.

2) reptile module: the function of this software module is to climb to get seed file; It reads the issue page listings that detecting module sends, and downloads the content of respective page from the internet, resolves the seed file address in the page and downloads it; Then seed is deposited in the seed file storehouse.

3) seed file resolver: the function of this software module is the address that parses index server from seed file, and the index server address is converted to the URL that issues the page, gives the reptile module.

4) seed file storehouse: the function in seed file storehouse is to deposit seed file, for analyzing and processing.According to the storage means of selecting for use, the seed file storehouse can be database or file system.

More than four logic modules may operate on same the main frame, also can be distributed to the operation of different main frame improving performance, between each module by network coordination work.

As shown in Figure 1, according to data flow, the job step of this crawler system is as follows:

The first step, detecting module is assembled into the http request with pre-set characteristic key words and sends to search engine.After receiving return results, detecting module therefrom extracts the URL of the page, sends to the reptile module.In this step: this process of detecting module can be regular or artificial the triggering; Pre-set characteristic key words can be added in operational process temporarily, to tackle the appearance of new software; Interface between detecting module and the reptile module can adopt inter-process communication mechanisms or socket.

In second step, the URL that the reptile module is imported according to previous step downloads, analyzing web page, and therefrom obtains the seed file link, and then downloads seed file, as shown in Figure 2.These a series of work are finished by four sub-sequence of modules of reptile inside: url filtering submodule, page download submodule, web page analysis submodule and seed file are downloaded submodule and are constituted.In addition, get in order to prevent to repeat to climb, the reptile inside modules is also being safeguarded new, old two URL formations and a seed file URL set; Buffer memory is not climbed the webpage URL that got in the new URL formation, and the webpage URL that buffer memory had been handled in the old formation; The link of depositing the seed file of having handled in the seed file URL set.

When input URL arrived, the url filtering submodule checked at first whether this URL occurs among any one in new, old URL formation; If, then abandon it, otherwise this URL is put into new URL formation.

The page download submodule is won head of the queue from new URL formation URL (be in the formation that URL) the earliest puts into old URL formation; Download the html web page content then, and give the web page analysis submodule and handle.The web page analysis submodule extracts the URL of other webpages and the URL of torrent file from web page contents; Send to url filtering submodule and seed file respectively and download submodule.

Seed file is downloaded submodule and judged: whether Already in the seed file URL of input in the seed file URL set, if exist, then abandons it; Otherwise, download this seed file, deposit the seed file storehouse in.

The 3rd step, second step of seed file resolver analysis is climbed the file of getting, and extract wherein tracker server address (annotate: torrent seed file form is an Open Standard, repeats no more) here, shape is as http://btfans.3322.org:6969/announce; It removes the announce character string of back, address, obtains issuing the page address, and shape such as http://btfans.3322.org:6969/ send to the reptile resume module.

Compared with prior art, good effect of the present invention is:

Method of the present invention not only can be excavated the seed resource of large-scale BT server, excavate download for the seed resource on the numerous small-sized privately owned BT server that upgrades day by day simultaneously, have the advantages that effectively find and download the BitTorrent seed fast, thereby download by a large amount of seed files, can enrich the resource in seed file storehouse, thereby strengthened the function of search engine, make result for retrieval more comprehensively, abundant.

Description of drawings

Fig. 1, expression system data flow graph;

Fig. 2, expression reptile data flow diagram;

The system construction drawing of Fig. 3, expression embodiment;

The fragment of Fig. 4, expression search engine return results.

Embodiment

Be example now, the embodiment of scheme is described with an instantiation.

The hardware environment of system implementation is as shown in Figure 3: the working environment of crawler system comprises two LAN (Local Area Network), and LAN (Local Area Network) 1 can be visited Internet, is built by the network switch 1; LAN (Local Area Network) 2 belongs to internal network, can not visit Internet, is built by the network switch 2.Survey host-specific in operation detecting module software, be connected in LAN (Local Area Network) 1 for one; A reptile main frame is used to move reptile module software and seed file resolver, and it has two network interface cards, and one is connected to LAN (Local Area Network) 1, and another piece is connected to LAN (Local Area Network) 2.Document storage server is a nfs server, and as the seed file storehouse, it is connected to LAN (Local Area Network) 2; The shared file catalogue of reptile main frame carry nfs server is used to write seed file.Notice that as previously mentioned, the file layout of other types can also be adopted in the seed file storehouse except adopting the NFS file storage service,, also can adopt database as GFS (Global FileSystem) etc.; In addition, in this example, all main frames all move linux operating system; Scheme also can adopt the operating system of other types.

The operating mechanism of each software module of this programme is explained in " summary of the invention " part, the following describes the ins and outs in the embodiment.

1) reciprocal process of detecting module and search engine.With google search engine and BNBT server is example, detecting module is characteristic key words " BNBT Tracker Info " put into the request field of google, obtain searching for the HTTP request URL: http://www.google.cn/search? q=BNBT+Tracker+Info; Detecting module is visited this URL and is promptly obtained the request response.HTML code fragment as shown in Figure 4 is a Search Results, and wherein black matrix partly is exactly the address of the seed issue page.From Search Results, extract the address of the issue page easily by regular expression.

2) the old URL formation in the reptile module infinitely increases problem.Old URL formation can constantly increase at system's operational process, and the time has been grown and can cause internal memory to be taken in a large number; In addition, because the existence of old URL formation, all URL can only be once processed; Thereby can miss the situation that the corresponding page of URL changes.In order to solve this two problems, when implementing, we have adopted two kinds of ways to solve this problem: at first, we store the URL character string in formation Hash (adopting 128 usually) reduces memory cost; Secondly, be the time of each Hash mark processing last time,, just discharge it when it has surpassed certain hour (generally got 3 days, and thought that the average 3 days issue pages will change) in system.

3) the unlimited increase problem of the URL of seed file set solves with similar method.

4) efficient of raising reptile.Though the reptile module software only moves a copy in present case, but operation can distribute reptile when specifically implementing: except the url filtering submodule, other submodules all can move a plurality of copies, giving full play to the performance of main frame, even are distributed on the different main frames operation to strengthen effect.

5) complicacy of the parsing page.Dynamic script uses extensively in the webpage now, especially dynamically is written into script of contents and causes difficulty for the analysis content of pages.We utilize the Gecko of browser kernel such as FireFox to be written into the page (concrete grammar is referring to the Gecko document), allow dynamic effect on the automatic processing page of Gecko, and the dom tree of analyzing the page again gets final product.

Claims

1. a BitTorrent seed file crawling method the steps include:

2. the method for claim 1, it is characterized in that described detecting module termly the calling search engine interface search BT issue WEB website.

3. the method for claim 1 is characterized in that described detecting module provides a keyword input interface, is used to receive the BT server characteristic key words of new settings.

4. the method for claim 1 is characterized in that described reptile module comprises a new issue page address formation, is used for buffer memory and does not climb the issue page address of getting; One old issue page address queue is used for buffer memory and has climbed the issue page address of getting; One seed file address set, the chained address that is used to deposit the seed file of having downloaded.

5. method as claimed in claim 4, it is characterized in that the method that described reptile module is downloaded respective page according to the issue page address that receives is: the reptile module is for the issue page address that receives, at first check in described new issue page address formation or the described old issue page address queue whether this address is arranged, if have then abandon this address, otherwise deposit it tail of the queue of described new issue page address formation in; The reptile module is extracted an address from the head of the queue of described new issue page address formation then, puts it into the page of downloading the address of extracting after the described old issue page address queue.

6. method as claimed in claim 5, it is characterized in that described reptile module parses the issue page address of other webpages from institute's downloading page, check then in described new issue page address formation or the described old issue page address queue whether this address is arranged, if have then abandon this address, otherwise deposit it tail of the queue of described new issue page address formation in.

7. method as claimed in claim 4, it is characterized in that described reptile module according to the seed file address with the method that seed file downloads to the seed file storehouse is: for the seed file address that receives, the reptile module at first checks whether there is this seed file address in the described seed file address set, if exist then refusal link download, otherwise link this address and download this seed file.

8. method as claimed in claim 4 is characterized in that the Hash of the address of buffer memory in described new issue page address formation and the described old issue page address queue for the issue page address; Be marked with processing time last time on the Hash of described issue page address, the described reptile module delete flag time surpasses the issue page address Hash of setting-up time; Address stored is the Hash of seed file address in the described seed file address set; Be marked with processing time last time on the Hash of described seed file address, the described reptile module delete flag time surpasses the seed file address Hash of setting-up time.

9. the method for claim 1 is characterized in that described seed file storehouse is database or file system; Described detecting module, reptile module, seed file resolver, seed file storehouse run on different main frames, connect by network between the main frame; Perhaps described detecting module, reptile module, seed file resolver, seed file storehouse run on the same main frame.