CN108829792A

CN108829792A - Distributed darknet excavating resource system and method based on scrapy

Info

Publication number: CN108829792A
Application number: CN201810558520.2A
Authority: CN
Inventors: 刘丹; 杜凤媛; 王永松; 郑云彬
Original assignee: Chengdu Kang Qiao Electronic LLC; University of Electronic Science and Technology of China
Current assignee: Chengdu Kang Qiao Electronic LLC; University of Electronic Science and Technology of China
Priority date: 2018-06-01
Filing date: 2018-06-01
Publication date: 2018-11-16

Abstract

The present invention relates to the field of data mining, a kind of distributed darknet excavating resource system and method based on scrapy are disclosed, to promote efficiency, range and the flexibility of darknet excavating resource.The present invention includes central node control module and crawls module from node, the central node control module includes the task queue of crawler seed, task preprocessing module, darknet task queue and bright net task queue, and described to crawl module from node include that darknet crawls module, bright net crawls module and crawler manager；Darknet it is artificial provide and crawled by bright net darknet domain name on the basis of crawl module by darknet again and bright net crawls module from the darknet page and bright net webpage and crawls more darknet domain names, to realize a large amount of acquisitions to darknet domain name and the storage to darknet Webpage.The present invention is suitable for darknet excavating resource.

Description

Distributed darknet excavating resource system and method based on scrapy

Technical field

The present invention relates to the field of data mining, in particular to the distributed darknet excavating resource system based on scrapy and side Method.

Background technique

Darknet is referred to through special software or the network that could be accessed using off-gauge communication protocol and port.Tor is The darknet anonymous communication system of current most mainstream, due to darknet strong controllable the characteristics of, bred a large amount of illegal transaction. Therefore research is of great importance to the excavation of darknet resource.What traditional search engine and crawler technology can crawl is only mutual The fraction web information provided in networking, i.e., bright online information.It can not achieve the excavation to darknet resource.Existing research It is for the i.e. deep net of non-surface network content that cannot be searched plain engine index on internet by standard again mostly, is not present institute The darknet of meaning；Although a small number of studied and crawled for darknets, the efficiency crawled is not accounted in design, range, with And flexibility.

Scrapy is the crawler frame of current most mainstream, it is realized based on twisted asynchronous network library, and speed is being crawled Other opposite crawlers are efficient on degree, and have customizability.However the download module that Scrapy is provided is assisted based on http What view was realized, darknet uses socke agreement.Additionally, due to web crawlers to the more demanding of I/O, Scrapy will be wait crawl URLs be stored directly in memory and in non-hand disk, so, a large amount of darknet is being crawled constantly, when the webpage number crawled When amount reaches tens of thousands of, the URLs quantity for needing to store may be more than million even ten million, along with Python itself is script Language, often language more compiled than C/C++ etc. is much greater for object committed memory, and the release of Python garbage collector Memory algorithm can't release immediately memory when object is no longer cited.Therefore, it is more likely that it is exhausted to will lead to single machine memory.It is single Machine is crawled using scrapy, and memory will be bottleneck.

Summary of the invention

The technical problem to be solved by the present invention is to：There is provided a kind of distributed darknet excavating resource system based on scrapy and Method, to promote efficiency, range and the flexibility of darknet excavating resource.

To solve the above problems, the technical solution adopted by the present invention is that：

Distributed darknet excavating resource system based on scrapy, including central node control module and mould is crawled from node Block, the central node control module include the task queue of crawler seed, task preprocessing module, darknet task queue and bright net Task queue, described to crawl module from node include that darknet crawls module, bright net crawls module and crawler manager；

The crawler seed task queue be used for store user offer crawl mould wait crawl kind of a subtask, and from node The new kind subtask to be crawled that block extracts；The task preprocessing module is used for the task in crawler seed task queue Heavy filtration is matched and gone, and the belonging to darknet of the task is stored in darknet task queue, the task deposit for belonging to bright net is bright Net task queue；

The darknet crawls the darknet crawler in module and crawls task for reading darknet from darknet task queue, is based on Darknet crawls task and downloads corresponding darknet webpage, and new darknet domain name is extracted from darknet webpage, new by what is extracted Darknet domain name is stored in the task queue of crawler seed；Stated clearly net crawls the bright net crawler in module for from bright net task queue It reads bright net and crawls task, task is crawled based on bright net and downloads corresponding bright net webpage, and extract from bright net webpage new bright The new bright domain name and darknet domain name extracted is stored in the task queue of crawler seed by domain name and darknet domain name；It is described to climb Worm manager is used for the solicited message sent according to central node control module, is managed to spidering process.

Further, the invention also includes Redis databases, and the Redis database is for storing darknet task queue With bright net task queue.

Further, the invention also includes kafka message systems, and the kafka message system is for storing crawler seed Task queue.

Further, the invention also includes MongoDB database, the MongoDB database is crawled for storing darknet The darknet webpage and bright net that module is downloaded crawl the bright net webpage that module is downloaded.

Further, the solicited message that central node control module is sent includes that crawler starting solicited message and crawler terminate Solicited message, crawler manager are based on crawler starting solicited message and create spidering process, are based on the crawler ending request end of message Spidering process.

Further, darknet of the present invention crawls module and downloads darknet webpage in such a way that tor is acted on behalf of, and darknet crawls module Realize tor agency method be：The configuration file of newly-built proxychains starts scrapy by proxychains, thus The darknet crawler is forced to walk tor agency.

Further, task preprocessing module of the present invention includes darknet plug-in unit and bright net plug-in unit；The darknet plug-in unit It for darknet task matching to be crawled, and realizes sentencing again for darknet task to be crawled, not being crawled for task is stored in Redis In key in database for storing darknet task；Stated clearly net plug-in unit realizes bright net for the task matching to be crawled of bright net Task to be crawled sentences weight, and not being crawled for task is stored in Redis database and is used to store in the key of bright net task.

Further, darknet plug-in unit of the present invention realizes that darknet task to be crawled matches by following regular expression：

"^https？://(([a-z0-9_-] { 1,64 }) { 0,4 } [a-z09=] { 16 } .onion) (:|\/|$)；

Bright net plug-in unit realizes that darknet task to be crawled matches by following regular expression：

((http|https)://) (([a-zA-Z0-9 ._-]+ (com | cn))) (/ [a-zA-Z0-9 &%_ /- ~-] *)？.

Based on above system, the distributed darknet excavating resource method based on scrapy that the present invention also provides a kind of, including Following steps：

S1, there may be the bright net url of a small amount of darknet domain name by some, and some darknet domain names deposit artificially obtained The crawler seed task queue of center control nodes module；

S2, the task in crawler seed task queue is matched and is sentenced via the plug-in unit of task preprocessing module and is located again Reason, is stored in darknet task queue and bright net task queue for obtained darknet task and bright net task respectively；

S3, darknet crawler read darknet from darknet task queue and crawl task, crawl task downloading based on darknet and correspond to Darknet webpage, and new darknet url is extracted from darknet webpage, the new darknet url deposit crawler seed that will be extracted is appointed Business queue；Bright net crawler reads bright net from bright net task queue and crawls task, and it is corresponding bright to crawl task downloading based on bright net Net webpage, and new bright net url and darknet url are extracted from bright net webpage, the new bright net url and darknet url that will be extracted It is stored in the task queue of crawler seed；

S4, repeat step S2-S3 until by all darknet url and darknet webpage acquisition finish.

Further, the url of bright net described in step S1 refer to social network sites or grey public forum or The website of issue of anonymity content.

Further, the crawler seed task queue in step S1 is stored in kafka message system, dark in step S2 Net task queue and bright net task queue are stored in Redis database.

The beneficial effects of the invention are as follows：Due to the open scope very little of concealment service domain name addresses, collect certain Difficulty, the present invention by both darknet crawler and bright net crawler being merged and again independent, by range and again deep layer crawl it is bright Net, obtains more darknet domain names, darknet it is artificial provide and by bright net crawl darknet domain name on the basis of again from darknet More darknet domain names are crawled in the page and bright net webpage.When not needing to carry out obtaining darknet domain name by bright net, Ke Yitong It crosses center control nodes module and terminates crawling for bright net to crawling module from node and send http request, while not influencing darknet. When finding that darknet task to be crawled is more, module transmission http request can be crawled to from node by center control nodes module Start new darknet crawler, improves the speed that darknet crawls, flexibility with higher；Since crawler wants I/O and bandwidth Ask high, single machine, which crawls a large amount of darknet and bright net, must cause the exhaustion of memory, crawl and slow, therefore in present invention selection Type database Redis is deposited to store darknet task queue and bright net task queue, while being deployed on central node, is allowed more The more spidering process realizations of platform are distributed to be crawled, and is improved and is crawled speed；While in order to solve to crawl from node crawler to new Url link, with task preprocessing module to the processing speed mismatch problem of task, the present invention chooses KAFKA message-oriented middleware Store crawler seed task queue, simultaneously because KAFKA have can persistence, in the case where central node delay machine, also not Task to be crawled can be lost, ensure that the reliability of system.The present invention, which realizes, obtains and to dark a large amount of of darknet domain name The storage of net Webpage provides data supporting to carry out data analysis to darknet.

Detailed description of the invention

Fig. 1 is the architecture diagram that the distributed hidden web data based on scrapy is excavated；

Fig. 2 is the process flow diagram of task preprocessing module；

Fig. 3 is the process flow diagram that darknet crawls module.

Specific embodiment

It due to the open scope very little of concealment service domain name addresses, collects and acquires a certain degree of difficulty, the present invention passes through will be dark Both net crawler and bright net crawler fusion and again independent crawls bright net by range and again deep layer, obtains more darknet domains Name, darknet it is artificial provide and by bright net crawl darknet domain name on the basis of more darknets are crawled from the darknet page again Domain name.When not needing to carry out obtaining darknet domain name by bright net, can be crawled by center control nodes module to from node Module sends http request and terminates crawling for bright net, while not influencing darknet.When discovery darknet task to be crawled it is more, Ke Yitong It crosses center control nodes and starts new darknet crawler to from node transmission http request, improve the speed that darknet crawls.

The present invention will be further described with reference to embodiments.

Embodiment provides a kind of distributed darknet excavating resource system based on scrapy, as shown in Figure 1, including center Node control module and module is crawled from node, the central node control module includes that the task queue of crawler seed, task are pre- Processing module, darknet task queue and bright net task queue, described to crawl module from node include that darknet crawls module, bright net is climbed Modulus block and crawler manager.The functional module of each section is illustrated individually below.

The crawler seed task queue be used for store user offer crawl mould wait crawl kind of a subtask, and from node The new kind subtask to be crawled that block extracts.Embodiment chooses KAFKA message-oriented middleware and stores crawler seed task queue, Simultaneously because KAFKA have can persistence will not lose task to be crawled in the case where central node delay machine, ensure that The reliability of system.

The task preprocessing module is used to be matched and be gone heavy filtration to the task in crawler seed task queue, And the belonging to darknet of the task is stored in darknet task queue, the belonging to bright net of the task is stored in bright net task queue.Specifically, institute The task preprocessing module of stating includes darknet plug-in unit and bright net plug-in unit；The darknet plug-in unit is matched for darknet task to be crawled, and That realizes darknet task to be crawled sentences weight, and not being crawled for task is stored in Redis database and is used to store darknet task Key in；Stated clearly net plug-in unit is for task the to be crawled matching of bright net, and that realizes bright net task to be crawled sentences weight, will not by In key in the task deposit Redis database crawled for storing bright net task, the process flow of task preprocessing module As shown in Figure 2.It is handled since task preprocessing module is provided with different plug-in units for every generic task, thereby, it is ensured that each appoint Business can quickly find its corresponding plug-in unit, and program defines a regular expression to matching task in each plug-in unit Url, using plug-in unit processing parsing, otherwise handles parsing using the plug-in unit of other successful match if successful match.Due to The task of darknet and the task difference of bright net are：The top level domain format of darknet is ABC.onion, and wherein ABC is by number With 16 character strings of letter composition.The top level domain of bright net is XXX.com | cn.Therefore darknet plug-in unit is corresponding in embodiment Regular expression can be：^https？://(([a-z0-9_-] { 1,64 }) { 0,4 } [a-z09=] { 16 } .onion)(:|\/|$).The corresponding regular expression of bright net plug-in unit is：((http|https)://)(([a-zA-Z0-9\._-] + (com | cn))) (/ [a-zA-Z0-9 &%_ /-~-] *)？.Whether the darknet plug-in unit first determines whether task to be crawled It has been be crawled that, darknet task queue is stored in if not being crawled, is otherwise abandoned.To avoid the repeated acquisition of webpage, make At the waste of resource.It should be noted that occur in this section "？" it is a symbol in regular expression, therefore here "？" do not represent this section and have any query or uncertain situation.

The darknet crawls the darknet crawler in module and crawls task for reading darknet from darknet task queue, is based on Darknet crawls task and downloads corresponding darknet webpage, and new darknet domain name is extracted from darknet webpage, new by what is extracted Darknet domain name is stored in the task queue of crawler seed.It includes scheduler submodule that darknet, which crawls module, darknet page download submodule, Darknet url extracting sub-module, data pipe submodule.The scheduler submodule is responsible for the darknet from center node control module Queue obtains task, and new crawler task is submitted in the crawler seed task queue to central node.Under the darknet webpage Carrier module realizes that can enter tor network downloads darknet Webpage by rewriteeing the download component of scrapy.The darknet Url extracting sub-module realizes that the darknet for extracting darknet webpage is linked by rewriteeing the spider component of Scrapy itself.It is described Data pipe module is responsible for linking the darknet extracted into the crawler seed task queue of deposit central node, by darknet webpage page Face is stored in MongoDB database.Fig. 3 be darknet crawl module crawl flow chart, specific step is as follows：

A, Scheduler module takes out darknet task to be crawled from the key in Redis for storing darknet task, via drawing It holds up and gives darknet webpage download module.

B, darknet webpage download module gives darknet page download to darknet url via engine by access tor network Extraction module.

Wherein darknet webpage download module, different from the downloader of original scrapy frame.Implementation method of the present invention is： The configuration file of newly-built proxychains starts scrapy by proxychains, and the darknet crawler is forced to walk tor agency.

C, darknet url extraction module extracts new darknet url from darknet webpage, gives data pipe mould via engine Block.

D, darknet url is stored in kafka message queue by data pipe module, and darknet webpage is stored in MongoDB database.

Stated clearly net crawls the bright net crawler in module and crawls task for reading bright net from bright net task queue, is based on Bright net crawls task and downloads corresponding bright net webpage, and new bright domain name and darknet domain name are extracted from bright net webpage, will mention New bright domain name and darknet domain name the deposit crawler seed task queue got.

Due to main and subordinate node can not direct communication, cause central node can not monitor constantly it is each from node crawler run shape State information, and control starting and termination from node crawler.The present invention realizes one using Twisted Application frame A crawler manager based on HTTP is deployed in respectively from node, respectively from node by calling Twisted lower portion TimerService interface inquires crawler message queue to be launched, running spidering process repeatedly during running from node The state of crawler node current process is stored in Redis by message queue.Central node is realized by access Redis to respectively from section The monitoring of point, and each crawler from node is remotely controlled by http request.

The crawler manager is used for the solicited message sent according to central node control module, carries out pipe to spidering process Reason.Crawler manager internal maintenance three queues：Crawler message queue to be launched, running spidering process message queue, The spidering process message queue being over.Crawler manager is requested by receiving the starting of the http of central node, and crawler is opened Dynamic solicited message includes the task that starting crawler type (darknet crawler, bright net crawler) and mark will start the spidering process Number.The solicited message received is stored in queue crawler message queue to be launched, information is successively taken out from queue and is created and is climbed Worm process.After receiving the request for cancelling spidering process, the crawler entry name to be cancelled and crawler task are first parsed from parameter Number, if crawler starting task is present in crawler message queue to be launched, the crawler is directly deleted from database and is opened Dynamic task；If cannot inquire crawler starting task in crawler message queue to be launched, what traversal was currently running is climbed The spidering process with crawler starting task same task number is found in worm queue, and calls the signalProcess inside Twisted Interface sends the signal terminated to the process and stops the process.

For creating and terminating spidering process, the method that embodiment uses is：By calling Twisted lower portion to provide SpawnProcess interface create corresponding process, wherein the first of the interface parameter is processProtocol object, It is responsible for monitoring all relevant to spidering process events (such as spidering process terminates, and spidering process creates successfully etc.), works as monitoring When creating successfully to spidering process, which is added to the spidering process message queue being currently running.When listening to At the end of spidering process, which is deleted from the spidering process message queue being currently running, and by the crawler Progress information is stored in the spidering process message queue being over.

Based on above system, embodiment additionally provides a kind of distributed darknet excavating resource method based on scrapy, Include the following steps：

S1, there may be the bright net url of a small amount of darknet domain name by some, and some darknet domain names deposit artificially obtained The crawler seed task queue of center control nodes module.Wherein, the url of stated clearly net refers to some social network sites, grey Public forum, the website of issue of anonymity content, these websites very likely exist user publication part tor darknet domain name. The crawler seed task queue is stored, since kafka has as the entrance for entering the system using kafka message system Persistence and the big feature of amount of storage, ensure that crawler kind subtask will not lose.Simultaneously as message-oriented middleware, solution Certainly consumer and the unmatched problem of producer's processing speed.

S2, the task in crawler seed task queue is matched and is sentenced via the plug-in unit of task preprocessing module and is located again Reason, is stored in darknet task queue and bright net task queue for obtained darknet task and bright net task respectively.Wherein, the darknet Task queue and bright net task queue, using memory type database Redis database purchase, instead of Scrapy frame it is original Task queue storage is created in local memory.Realize that the scrapy on multiple and different machines can be obtained from the same database Task carries out distribution and crawls.

S3, darknet crawler read darknet from darknet task queue and crawl task, crawl task downloading based on darknet and correspond to Darknet webpage, and new darknet url is extracted from darknet webpage, the new darknet url deposit crawler seed that will be extracted is appointed Business queue；Bright net crawler reads bright net from bright net task queue and crawls task, and it is corresponding bright to crawl task downloading based on bright net Net webpage, and new bright net url and darknet url are extracted from bright net webpage, the new bright net url and darknet url that will be extracted It is stored in the task queue of crawler seed.

S4, repeat step S2-S3 until by all darknet url and darknet webpage acquisition finish, thus realize pair A large amount of acquisitions of darknet url and the storage to darknet Webpage.

Claims

1. the distributed darknet excavating resource system based on scrapy, which is characterized in that including central node control module and from Node crawls module, and the central node control module includes the task queue of crawler seed, task preprocessing module, darknet task Queue and bright net task queue, it is described to crawl module from node and include that darknet crawls module, bright net crawls module and crawler management Device；

The crawler seed task queue be used to store user's offer wait crawl kind of a subtask, and crawl module from node and mention The new kind subtask to be crawled taken；The task preprocessing module is used to carry out the task in crawler seed task queue Heavy filtration is matched and gone, and the belonging to darknet of the task is stored in darknet task queue, the bright net of task deposit for belonging to bright net is appointed Business queue；

The darknet crawls the darknet crawler in module and crawls task for reading darknet from darknet task queue, is based on darknet It crawls task and downloads corresponding darknet webpage, and extract new darknet domain name from darknet webpage, the new darknet that will be extracted Domain name is stored in the task queue of crawler seed；Stated clearly net crawls the bright net crawler in module for reading from bright net task queue Bright net crawls task, crawls task based on bright net and downloads corresponding bright net webpage, and extracts new bright domain from bright net webpage The new bright domain name and darknet domain name extracted is stored in the task queue of crawler seed by name and darknet domain name；The crawler pipe The solicited message that reason device is used to be sent according to central node control module, is managed spidering process.

2. the distributed darknet excavating resource system based on scrapy as described in claim 1, which is characterized in that further include Redis database, the Redis database is for storing darknet task queue and bright net task queue.

3. the distributed darknet excavating resource system based on scrapy as described in claim 1, which is characterized in that further include Kafka message system, the kafka message system is for storing crawler seed task queue.

4. the distributed darknet excavating resource system based on scrapy as described in claim 1, which is characterized in that darknet crawls Module downloads darknet webpage in such a way that tor is acted on behalf of, and darknet, which crawls the method that module realizes tor agency, is：It is newly-built The configuration file of proxychains starts scrapy by proxychains, so that darknet crawler be forced to walk tor agency.

5. the distributed darknet excavating resource system based on scrapy as described in claim 1, which is characterized in that the task Preprocessing module includes darknet plug-in unit and bright net plug-in unit；The darknet plug-in unit is realized dark for darknet task matching to be crawled Net task to be crawled sentences weight, and not being crawled for task is stored in the key for being used to store darknet task in Redis database In；Stated clearly net plug-in unit realizes sentencing again for bright net task to be crawled for the task matching to be crawled of bright net, will not be crawled Task deposit Redis database in for storing in the key of bright net task.

6. the distributed darknet excavating resource system based on scrapy as claimed in claim 5, which is characterized in that darknet plug-in unit Darknet task matching to be crawled is realized by following regular expression：

^https？://(([a-z0-9_-] { 1,64 }) { 0,4 } [a-z09=] { 16 } .onion) (:|\/|$)

7. the distributed darknet excavating resource system based on scrapy as described in claim 1, which is characterized in that central node The solicited message that control module is sent includes crawler starting solicited message and crawler ending request information, and crawler manager is based on climbing Worm starts solicited message and creates spidering process, is based on crawler ending request end of message spidering process.

8. the distributed darknet excavating resource method based on scrapy, which is characterized in that include the following steps：

S1, there may be the bright net url of a small amount of darknet domain name by some, and some darknet domain names deposit center artificially obtained The crawler seed task queue of control node module；

S2, it the task in crawler seed task queue is matched and sentenced via the plug-in unit of task preprocessing module is handled again, Obtained darknet task and bright net task are stored in darknet task queue and bright net task queue respectively；

S3, darknet crawler read darknet from darknet task queue and crawl task, and it is corresponding dark to crawl task downloading based on darknet Net webpage, and new darknet url is extracted from darknet webpage, the new darknet url extracted is stored in crawler kind subtask team Column；Bright net crawler reads bright net from bright net task queue and crawls task, crawls task based on bright net and downloads corresponding bright net net Page, and new bright net url and darknet url are extracted from bright net webpage, the new bright net url and darknet url extracted is stored in Crawler seed task queue；

9. the distributed darknet excavating resource method based on scrapy as claimed in claim 8, which is characterized in that in step S1 The url of the bright net refers to the public forum of social network sites or grey or the website of issue of anonymity content.

10. the distributed darknet excavating resource method based on scrapy as claimed in claim 8, which is characterized in that step S1 In crawler seed task queue be stored in kafka message system, darknet task queue and bright net task team in step S2 Column are stored in Redis database.