CN108829792A - Distributed darknet excavating resource system and method based on scrapy - Google Patents
Distributed darknet excavating resource system and method based on scrapy Download PDFInfo
- Publication number
- CN108829792A CN108829792A CN201810558520.2A CN201810558520A CN108829792A CN 108829792 A CN108829792 A CN 108829792A CN 201810558520 A CN201810558520 A CN 201810558520A CN 108829792 A CN108829792 A CN 108829792A
- Authority
- CN
- China
- Prior art keywords
- darknet
- task
- crawler
- net
- bright
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The present invention relates to the field of data mining, a kind of distributed darknet excavating resource system and method based on scrapy are disclosed, to promote efficiency, range and the flexibility of darknet excavating resource.The present invention includes central node control module and crawls module from node, the central node control module includes the task queue of crawler seed, task preprocessing module, darknet task queue and bright net task queue, and described to crawl module from node include that darknet crawls module, bright net crawls module and crawler manager;Darknet it is artificial provide and crawled by bright net darknet domain name on the basis of crawl module by darknet again and bright net crawls module from the darknet page and bright net webpage and crawls more darknet domain names, to realize a large amount of acquisitions to darknet domain name and the storage to darknet Webpage.The present invention is suitable for darknet excavating resource.
Description
Technical field
The present invention relates to the field of data mining, in particular to the distributed darknet excavating resource system based on scrapy and side
Method.
Background technique
Darknet is referred to through special software or the network that could be accessed using off-gauge communication protocol and port.Tor is
The darknet anonymous communication system of current most mainstream, due to darknet strong controllable the characteristics of, bred a large amount of illegal transaction.
Therefore research is of great importance to the excavation of darknet resource.What traditional search engine and crawler technology can crawl is only mutual
The fraction web information provided in networking, i.e., bright online information.It can not achieve the excavation to darknet resource.Existing research
It is for the i.e. deep net of non-surface network content that cannot be searched plain engine index on internet by standard again mostly, is not present institute
The darknet of meaning;Although a small number of studied and crawled for darknets, the efficiency crawled is not accounted in design, range, with
And flexibility.
Scrapy is the crawler frame of current most mainstream, it is realized based on twisted asynchronous network library, and speed is being crawled
Other opposite crawlers are efficient on degree, and have customizability.However the download module that Scrapy is provided is assisted based on http
What view was realized, darknet uses socke agreement.Additionally, due to web crawlers to the more demanding of I/O, Scrapy will be wait crawl
URLs be stored directly in memory and in non-hand disk, so, a large amount of darknet is being crawled constantly, when the webpage number crawled
When amount reaches tens of thousands of, the URLs quantity for needing to store may be more than million even ten million, along with Python itself is script
Language, often language more compiled than C/C++ etc. is much greater for object committed memory, and the release of Python garbage collector
Memory algorithm can't release immediately memory when object is no longer cited.Therefore, it is more likely that it is exhausted to will lead to single machine memory.It is single
Machine is crawled using scrapy, and memory will be bottleneck.
Summary of the invention
The technical problem to be solved by the present invention is to:There is provided a kind of distributed darknet excavating resource system based on scrapy and
Method, to promote efficiency, range and the flexibility of darknet excavating resource.
To solve the above problems, the technical solution adopted by the present invention is that:
Distributed darknet excavating resource system based on scrapy, including central node control module and mould is crawled from node
Block, the central node control module include the task queue of crawler seed, task preprocessing module, darknet task queue and bright net
Task queue, described to crawl module from node include that darknet crawls module, bright net crawls module and crawler manager;
The crawler seed task queue be used for store user offer crawl mould wait crawl kind of a subtask, and from node
The new kind subtask to be crawled that block extracts;The task preprocessing module is used for the task in crawler seed task queue
Heavy filtration is matched and gone, and the belonging to darknet of the task is stored in darknet task queue, the task deposit for belonging to bright net is bright
Net task queue;
The darknet crawls the darknet crawler in module and crawls task for reading darknet from darknet task queue, is based on
Darknet crawls task and downloads corresponding darknet webpage, and new darknet domain name is extracted from darknet webpage, new by what is extracted
Darknet domain name is stored in the task queue of crawler seed;Stated clearly net crawls the bright net crawler in module for from bright net task queue
It reads bright net and crawls task, task is crawled based on bright net and downloads corresponding bright net webpage, and extract from bright net webpage new bright
The new bright domain name and darknet domain name extracted is stored in the task queue of crawler seed by domain name and darknet domain name;It is described to climb
Worm manager is used for the solicited message sent according to central node control module, is managed to spidering process.
Further, the invention also includes Redis databases, and the Redis database is for storing darknet task queue
With bright net task queue.
Further, the invention also includes kafka message systems, and the kafka message system is for storing crawler seed
Task queue.
Further, the invention also includes MongoDB database, the MongoDB database is crawled for storing darknet
The darknet webpage and bright net that module is downloaded crawl the bright net webpage that module is downloaded.
Further, the solicited message that central node control module is sent includes that crawler starting solicited message and crawler terminate
Solicited message, crawler manager are based on crawler starting solicited message and create spidering process, are based on the crawler ending request end of message
Spidering process.
Further, darknet of the present invention crawls module and downloads darknet webpage in such a way that tor is acted on behalf of, and darknet crawls module
Realize tor agency method be:The configuration file of newly-built proxychains starts scrapy by proxychains, thus
The darknet crawler is forced to walk tor agency.
Further, task preprocessing module of the present invention includes darknet plug-in unit and bright net plug-in unit;The darknet plug-in unit
It for darknet task matching to be crawled, and realizes sentencing again for darknet task to be crawled, not being crawled for task is stored in Redis
In key in database for storing darknet task;Stated clearly net plug-in unit realizes bright net for the task matching to be crawled of bright net
Task to be crawled sentences weight, and not being crawled for task is stored in Redis database and is used to store in the key of bright net task.
Further, darknet plug-in unit of the present invention realizes that darknet task to be crawled matches by following regular expression:
"^https?://(([a-z0-9_-] { 1,64 }) { 0,4 } [a-z09=] { 16 } .onion) (:|\/|$);
Bright net plug-in unit realizes that darknet task to be crawled matches by following regular expression:
((http|https)://) (([a-zA-Z0-9 ._-]+ (com | cn))) (/ [a-zA-Z0-9 &%_ /-
~-] *)?.
Based on above system, the distributed darknet excavating resource method based on scrapy that the present invention also provides a kind of, including
Following steps:
S1, there may be the bright net url of a small amount of darknet domain name by some, and some darknet domain names deposit artificially obtained
The crawler seed task queue of center control nodes module;
S2, the task in crawler seed task queue is matched and is sentenced via the plug-in unit of task preprocessing module and is located again
Reason, is stored in darknet task queue and bright net task queue for obtained darknet task and bright net task respectively;
S3, darknet crawler read darknet from darknet task queue and crawl task, crawl task downloading based on darknet and correspond to
Darknet webpage, and new darknet url is extracted from darknet webpage, the new darknet url deposit crawler seed that will be extracted is appointed
Business queue;Bright net crawler reads bright net from bright net task queue and crawls task, and it is corresponding bright to crawl task downloading based on bright net
Net webpage, and new bright net url and darknet url are extracted from bright net webpage, the new bright net url and darknet url that will be extracted
It is stored in the task queue of crawler seed;
S4, repeat step S2-S3 until by all darknet url and darknet webpage acquisition finish.
Further, the url of bright net described in step S1 refer to social network sites or grey public forum or
The website of issue of anonymity content.
Further, the crawler seed task queue in step S1 is stored in kafka message system, dark in step S2
Net task queue and bright net task queue are stored in Redis database.
The beneficial effects of the invention are as follows:Due to the open scope very little of concealment service domain name addresses, collect certain
Difficulty, the present invention by both darknet crawler and bright net crawler being merged and again independent, by range and again deep layer crawl it is bright
Net, obtains more darknet domain names, darknet it is artificial provide and by bright net crawl darknet domain name on the basis of again from darknet
More darknet domain names are crawled in the page and bright net webpage.When not needing to carry out obtaining darknet domain name by bright net, Ke Yitong
It crosses center control nodes module and terminates crawling for bright net to crawling module from node and send http request, while not influencing darknet.
When finding that darknet task to be crawled is more, module transmission http request can be crawled to from node by center control nodes module
Start new darknet crawler, improves the speed that darknet crawls, flexibility with higher;Since crawler wants I/O and bandwidth
Ask high, single machine, which crawls a large amount of darknet and bright net, must cause the exhaustion of memory, crawl and slow, therefore in present invention selection
Type database Redis is deposited to store darknet task queue and bright net task queue, while being deployed on central node, is allowed more
The more spidering process realizations of platform are distributed to be crawled, and is improved and is crawled speed;While in order to solve to crawl from node crawler to new
Url link, with task preprocessing module to the processing speed mismatch problem of task, the present invention chooses KAFKA message-oriented middleware
Store crawler seed task queue, simultaneously because KAFKA have can persistence, in the case where central node delay machine, also not
Task to be crawled can be lost, ensure that the reliability of system.The present invention, which realizes, obtains and to dark a large amount of of darknet domain name
The storage of net Webpage provides data supporting to carry out data analysis to darknet.
Detailed description of the invention
Fig. 1 is the architecture diagram that the distributed hidden web data based on scrapy is excavated;
Fig. 2 is the process flow diagram of task preprocessing module;
Fig. 3 is the process flow diagram that darknet crawls module.
Specific embodiment
It due to the open scope very little of concealment service domain name addresses, collects and acquires a certain degree of difficulty, the present invention passes through will be dark
Both net crawler and bright net crawler fusion and again independent crawls bright net by range and again deep layer, obtains more darknet domains
Name, darknet it is artificial provide and by bright net crawl darknet domain name on the basis of more darknets are crawled from the darknet page again
Domain name.When not needing to carry out obtaining darknet domain name by bright net, can be crawled by center control nodes module to from node
Module sends http request and terminates crawling for bright net, while not influencing darknet.When discovery darknet task to be crawled it is more, Ke Yitong
It crosses center control nodes and starts new darknet crawler to from node transmission http request, improve the speed that darknet crawls.
The present invention will be further described with reference to embodiments.
Embodiment provides a kind of distributed darknet excavating resource system based on scrapy, as shown in Figure 1, including center
Node control module and module is crawled from node, the central node control module includes that the task queue of crawler seed, task are pre-
Processing module, darknet task queue and bright net task queue, described to crawl module from node include that darknet crawls module, bright net is climbed
Modulus block and crawler manager.The functional module of each section is illustrated individually below.
The crawler seed task queue be used for store user offer crawl mould wait crawl kind of a subtask, and from node
The new kind subtask to be crawled that block extracts.Embodiment chooses KAFKA message-oriented middleware and stores crawler seed task queue,
Simultaneously because KAFKA have can persistence will not lose task to be crawled in the case where central node delay machine, ensure that
The reliability of system.
The task preprocessing module is used to be matched and be gone heavy filtration to the task in crawler seed task queue,
And the belonging to darknet of the task is stored in darknet task queue, the belonging to bright net of the task is stored in bright net task queue.Specifically, institute
The task preprocessing module of stating includes darknet plug-in unit and bright net plug-in unit;The darknet plug-in unit is matched for darknet task to be crawled, and
That realizes darknet task to be crawled sentences weight, and not being crawled for task is stored in Redis database and is used to store darknet task
Key in;Stated clearly net plug-in unit is for task the to be crawled matching of bright net, and that realizes bright net task to be crawled sentences weight, will not by
In key in the task deposit Redis database crawled for storing bright net task, the process flow of task preprocessing module
As shown in Figure 2.It is handled since task preprocessing module is provided with different plug-in units for every generic task, thereby, it is ensured that each appoint
Business can quickly find its corresponding plug-in unit, and program defines a regular expression to matching task in each plug-in unit
Url, using plug-in unit processing parsing, otherwise handles parsing using the plug-in unit of other successful match if successful match.Due to
The task of darknet and the task difference of bright net are:The top level domain format of darknet is ABC.onion, and wherein ABC is by number
With 16 character strings of letter composition.The top level domain of bright net is XXX.com | cn.Therefore darknet plug-in unit is corresponding in embodiment
Regular expression can be:^https?://(([a-z0-9_-] { 1,64 }) { 0,4 } [a-z09=] { 16 }
.onion)(:|\/|$).The corresponding regular expression of bright net plug-in unit is:((http|https)://)(([a-zA-Z0-9\._-]
+ (com | cn))) (/ [a-zA-Z0-9 &%_ /-~-] *)?.Whether the darknet plug-in unit first determines whether task to be crawled
It has been be crawled that, darknet task queue is stored in if not being crawled, is otherwise abandoned.To avoid the repeated acquisition of webpage, make
At the waste of resource.It should be noted that occur in this section "?" it is a symbol in regular expression, therefore here
"?" do not represent this section and have any query or uncertain situation.
The darknet crawls the darknet crawler in module and crawls task for reading darknet from darknet task queue, is based on
Darknet crawls task and downloads corresponding darknet webpage, and new darknet domain name is extracted from darknet webpage, new by what is extracted
Darknet domain name is stored in the task queue of crawler seed.It includes scheduler submodule that darknet, which crawls module, darknet page download submodule,
Darknet url extracting sub-module, data pipe submodule.The scheduler submodule is responsible for the darknet from center node control module
Queue obtains task, and new crawler task is submitted in the crawler seed task queue to central node.Under the darknet webpage
Carrier module realizes that can enter tor network downloads darknet Webpage by rewriteeing the download component of scrapy.The darknet
Url extracting sub-module realizes that the darknet for extracting darknet webpage is linked by rewriteeing the spider component of Scrapy itself.It is described
Data pipe module is responsible for linking the darknet extracted into the crawler seed task queue of deposit central node, by darknet webpage page
Face is stored in MongoDB database.Fig. 3 be darknet crawl module crawl flow chart, specific step is as follows:
A, Scheduler module takes out darknet task to be crawled from the key in Redis for storing darknet task, via drawing
It holds up and gives darknet webpage download module.
B, darknet webpage download module gives darknet page download to darknet url via engine by access tor network
Extraction module.
Wherein darknet webpage download module, different from the downloader of original scrapy frame.Implementation method of the present invention is:
The configuration file of newly-built proxychains starts scrapy by proxychains, and the darknet crawler is forced to walk tor agency.
C, darknet url extraction module extracts new darknet url from darknet webpage, gives data pipe mould via engine
Block.
D, darknet url is stored in kafka message queue by data pipe module, and darknet webpage is stored in MongoDB database.
Stated clearly net crawls the bright net crawler in module and crawls task for reading bright net from bright net task queue, is based on
Bright net crawls task and downloads corresponding bright net webpage, and new bright domain name and darknet domain name are extracted from bright net webpage, will mention
New bright domain name and darknet domain name the deposit crawler seed task queue got.
Due to main and subordinate node can not direct communication, cause central node can not monitor constantly it is each from node crawler run shape
State information, and control starting and termination from node crawler.The present invention realizes one using Twisted Application frame
A crawler manager based on HTTP is deployed in respectively from node, respectively from node by calling Twisted lower portion
TimerService interface inquires crawler message queue to be launched, running spidering process repeatedly during running from node
The state of crawler node current process is stored in Redis by message queue.Central node is realized by access Redis to respectively from section
The monitoring of point, and each crawler from node is remotely controlled by http request.
The crawler manager is used for the solicited message sent according to central node control module, carries out pipe to spidering process
Reason.Crawler manager internal maintenance three queues:Crawler message queue to be launched, running spidering process message queue,
The spidering process message queue being over.Crawler manager is requested by receiving the starting of the http of central node, and crawler is opened
Dynamic solicited message includes the task that starting crawler type (darknet crawler, bright net crawler) and mark will start the spidering process
Number.The solicited message received is stored in queue crawler message queue to be launched, information is successively taken out from queue and is created and is climbed
Worm process.After receiving the request for cancelling spidering process, the crawler entry name to be cancelled and crawler task are first parsed from parameter
Number, if crawler starting task is present in crawler message queue to be launched, the crawler is directly deleted from database and is opened
Dynamic task;If cannot inquire crawler starting task in crawler message queue to be launched, what traversal was currently running is climbed
The spidering process with crawler starting task same task number is found in worm queue, and calls the signalProcess inside Twisted
Interface sends the signal terminated to the process and stops the process.
For creating and terminating spidering process, the method that embodiment uses is:By calling Twisted lower portion to provide
SpawnProcess interface create corresponding process, wherein the first of the interface parameter is processProtocol object,
It is responsible for monitoring all relevant to spidering process events (such as spidering process terminates, and spidering process creates successfully etc.), works as monitoring
When creating successfully to spidering process, which is added to the spidering process message queue being currently running.When listening to
At the end of spidering process, which is deleted from the spidering process message queue being currently running, and by the crawler
Progress information is stored in the spidering process message queue being over.
Based on above system, embodiment additionally provides a kind of distributed darknet excavating resource method based on scrapy,
Include the following steps:
S1, there may be the bright net url of a small amount of darknet domain name by some, and some darknet domain names deposit artificially obtained
The crawler seed task queue of center control nodes module.Wherein, the url of stated clearly net refers to some social network sites, grey
Public forum, the website of issue of anonymity content, these websites very likely exist user publication part tor darknet domain name.
The crawler seed task queue is stored, since kafka has as the entrance for entering the system using kafka message system
Persistence and the big feature of amount of storage, ensure that crawler kind subtask will not lose.Simultaneously as message-oriented middleware, solution
Certainly consumer and the unmatched problem of producer's processing speed.
S2, the task in crawler seed task queue is matched and is sentenced via the plug-in unit of task preprocessing module and is located again
Reason, is stored in darknet task queue and bright net task queue for obtained darknet task and bright net task respectively.Wherein, the darknet
Task queue and bright net task queue, using memory type database Redis database purchase, instead of Scrapy frame it is original
Task queue storage is created in local memory.Realize that the scrapy on multiple and different machines can be obtained from the same database
Task carries out distribution and crawls.
S3, darknet crawler read darknet from darknet task queue and crawl task, crawl task downloading based on darknet and correspond to
Darknet webpage, and new darknet url is extracted from darknet webpage, the new darknet url deposit crawler seed that will be extracted is appointed
Business queue;Bright net crawler reads bright net from bright net task queue and crawls task, and it is corresponding bright to crawl task downloading based on bright net
Net webpage, and new bright net url and darknet url are extracted from bright net webpage, the new bright net url and darknet url that will be extracted
It is stored in the task queue of crawler seed.
S4, repeat step S2-S3 until by all darknet url and darknet webpage acquisition finish, thus realize pair
A large amount of acquisitions of darknet url and the storage to darknet Webpage.
Claims (10)
1. the distributed darknet excavating resource system based on scrapy, which is characterized in that including central node control module and from
Node crawls module, and the central node control module includes the task queue of crawler seed, task preprocessing module, darknet task
Queue and bright net task queue, it is described to crawl module from node and include that darknet crawls module, bright net crawls module and crawler management
Device;
The crawler seed task queue be used to store user's offer wait crawl kind of a subtask, and crawl module from node and mention
The new kind subtask to be crawled taken;The task preprocessing module is used to carry out the task in crawler seed task queue
Heavy filtration is matched and gone, and the belonging to darknet of the task is stored in darknet task queue, the bright net of task deposit for belonging to bright net is appointed
Business queue;
The darknet crawls the darknet crawler in module and crawls task for reading darknet from darknet task queue, is based on darknet
It crawls task and downloads corresponding darknet webpage, and extract new darknet domain name from darknet webpage, the new darknet that will be extracted
Domain name is stored in the task queue of crawler seed;Stated clearly net crawls the bright net crawler in module for reading from bright net task queue
Bright net crawls task, crawls task based on bright net and downloads corresponding bright net webpage, and extracts new bright domain from bright net webpage
The new bright domain name and darknet domain name extracted is stored in the task queue of crawler seed by name and darknet domain name;The crawler pipe
The solicited message that reason device is used to be sent according to central node control module, is managed spidering process.
2. the distributed darknet excavating resource system based on scrapy as described in claim 1, which is characterized in that further include
Redis database, the Redis database is for storing darknet task queue and bright net task queue.
3. the distributed darknet excavating resource system based on scrapy as described in claim 1, which is characterized in that further include
Kafka message system, the kafka message system is for storing crawler seed task queue.
4. the distributed darknet excavating resource system based on scrapy as described in claim 1, which is characterized in that darknet crawls
Module downloads darknet webpage in such a way that tor is acted on behalf of, and darknet, which crawls the method that module realizes tor agency, is:It is newly-built
The configuration file of proxychains starts scrapy by proxychains, so that darknet crawler be forced to walk tor agency.
5. the distributed darknet excavating resource system based on scrapy as described in claim 1, which is characterized in that the task
Preprocessing module includes darknet plug-in unit and bright net plug-in unit;The darknet plug-in unit is realized dark for darknet task matching to be crawled
Net task to be crawled sentences weight, and not being crawled for task is stored in the key for being used to store darknet task in Redis database
In;Stated clearly net plug-in unit realizes sentencing again for bright net task to be crawled for the task matching to be crawled of bright net, will not be crawled
Task deposit Redis database in for storing in the key of bright net task.
6. the distributed darknet excavating resource system based on scrapy as claimed in claim 5, which is characterized in that darknet plug-in unit
Darknet task matching to be crawled is realized by following regular expression:
^https?://(([a-z0-9_-] { 1,64 }) { 0,4 } [a-z09=] { 16 } .onion) (:|\/|$)
Bright net plug-in unit realizes that darknet task to be crawled matches by following regular expression:
((http|https)://) (([a-zA-Z0-9 ._-]+ (com | cn))) (/ [a-zA-Z0-9 &%_ /-
~-] *)?.
7. the distributed darknet excavating resource system based on scrapy as described in claim 1, which is characterized in that central node
The solicited message that control module is sent includes crawler starting solicited message and crawler ending request information, and crawler manager is based on climbing
Worm starts solicited message and creates spidering process, is based on crawler ending request end of message spidering process.
8. the distributed darknet excavating resource method based on scrapy, which is characterized in that include the following steps:
S1, there may be the bright net url of a small amount of darknet domain name by some, and some darknet domain names deposit center artificially obtained
The crawler seed task queue of control node module;
S2, it the task in crawler seed task queue is matched and sentenced via the plug-in unit of task preprocessing module is handled again,
Obtained darknet task and bright net task are stored in darknet task queue and bright net task queue respectively;
S3, darknet crawler read darknet from darknet task queue and crawl task, and it is corresponding dark to crawl task downloading based on darknet
Net webpage, and new darknet url is extracted from darknet webpage, the new darknet url extracted is stored in crawler kind subtask team
Column;Bright net crawler reads bright net from bright net task queue and crawls task, crawls task based on bright net and downloads corresponding bright net net
Page, and new bright net url and darknet url are extracted from bright net webpage, the new bright net url and darknet url extracted is stored in
Crawler seed task queue;
S4, repeat step S2-S3 until by all darknet url and darknet webpage acquisition finish.
9. the distributed darknet excavating resource method based on scrapy as claimed in claim 8, which is characterized in that in step S1
The url of the bright net refers to the public forum of social network sites or grey or the website of issue of anonymity content.
10. the distributed darknet excavating resource method based on scrapy as claimed in claim 8, which is characterized in that step S1
In crawler seed task queue be stored in kafka message system, darknet task queue and bright net task team in step S2
Column are stored in Redis database.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810558520.2A CN108829792A (en) | 2018-06-01 | 2018-06-01 | Distributed darknet excavating resource system and method based on scrapy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810558520.2A CN108829792A (en) | 2018-06-01 | 2018-06-01 | Distributed darknet excavating resource system and method based on scrapy |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108829792A true CN108829792A (en) | 2018-11-16 |
Family
ID=64145694
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810558520.2A Pending CN108829792A (en) | 2018-06-01 | 2018-06-01 | Distributed darknet excavating resource system and method based on scrapy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108829792A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109902212A (en) * | 2019-01-25 | 2019-06-18 | 中国电子科技集团公司第三十研究所 | A kind of darknet crawler system of customized dynamic expansion |
CN110119469A (en) * | 2019-05-22 | 2019-08-13 | 北京计算机技术及应用研究所 | A kind of data collection and transmission and method towards darknet |
CN110457559A (en) * | 2019-08-05 | 2019-11-15 | 深圳乐信软件技术有限公司 | Distributed data crawls system, method and storage medium |
CN110909178A (en) * | 2019-11-22 | 2020-03-24 | 上海交通大学 | System and method for collecting threat information of darknet and associating information |
CN111209460A (en) * | 2019-12-27 | 2020-05-29 | 青岛海洋科学与技术国家实验室发展中心 | Data acquisition system and method based on script crawler framework |
WO2020207022A1 (en) * | 2019-04-12 | 2020-10-15 | 深圳壹账通智能科技有限公司 | Scrapy-based data crawling method and system, terminal device, and storage medium |
CN112115331A (en) * | 2020-09-21 | 2020-12-22 | 朱彤 | Capital market public opinion monitoring method based on distributed web crawler and NLP |
CN112148956A (en) * | 2020-09-30 | 2020-12-29 | 上海交通大学 | Hidden net threat information mining system and method based on machine learning |
CN112417242A (en) * | 2020-11-09 | 2021-02-26 | 深圳市宝视佳科技有限公司 | Centralized management system of distributed crawlers |
CN112667873A (en) * | 2020-12-16 | 2021-04-16 | 北京华如慧云数据科技有限公司 | Crawler system and method suitable for general data acquisition of most websites |
CN113923193A (en) * | 2021-10-27 | 2022-01-11 | 北京知道创宇信息技术股份有限公司 | Network domain name association method, device, storage medium and electronic equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101051313A (en) * | 2007-05-09 | 2007-10-10 | 崔志明 | Integrated data source finding method for deep layer net page data source |
US20110113103A1 (en) * | 2006-03-17 | 2011-05-12 | Macrovision Corporation | Peer to peer gateway |
CN103116635A (en) * | 2013-02-07 | 2013-05-22 | 中国科学院计算技术研究所 | Field-oriented method and system for collecting invisible web resources |
CN105138561A (en) * | 2015-07-23 | 2015-12-09 | 中国测绘科学研究院 | Deep web space data acquisition method and apparatus |
CN107808000A (en) * | 2017-11-13 | 2018-03-16 | 哈尔滨工业大学(威海) | A kind of hidden web data collection and extraction system and method |
-
2018
- 2018-06-01 CN CN201810558520.2A patent/CN108829792A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110113103A1 (en) * | 2006-03-17 | 2011-05-12 | Macrovision Corporation | Peer to peer gateway |
CN101051313A (en) * | 2007-05-09 | 2007-10-10 | 崔志明 | Integrated data source finding method for deep layer net page data source |
CN103116635A (en) * | 2013-02-07 | 2013-05-22 | 中国科学院计算技术研究所 | Field-oriented method and system for collecting invisible web resources |
CN105138561A (en) * | 2015-07-23 | 2015-12-09 | 中国测绘科学研究院 | Deep web space data acquisition method and apparatus |
CN107808000A (en) * | 2017-11-13 | 2018-03-16 | 哈尔滨工业大学(威海) | A kind of hidden web data collection and extraction system and method |
Non-Patent Citations (2)
Title |
---|
刘泽华等: "基于scrapy技术的分布式爬虫的设计与优化", 《万方数据》 * |
杨溢等: "基于Tor的暗网空间资源探测", 《通信技术》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109902212A (en) * | 2019-01-25 | 2019-06-18 | 中国电子科技集团公司第三十研究所 | A kind of darknet crawler system of customized dynamic expansion |
WO2020207022A1 (en) * | 2019-04-12 | 2020-10-15 | 深圳壹账通智能科技有限公司 | Scrapy-based data crawling method and system, terminal device, and storage medium |
CN110119469A (en) * | 2019-05-22 | 2019-08-13 | 北京计算机技术及应用研究所 | A kind of data collection and transmission and method towards darknet |
CN110457559A (en) * | 2019-08-05 | 2019-11-15 | 深圳乐信软件技术有限公司 | Distributed data crawls system, method and storage medium |
CN110909178A (en) * | 2019-11-22 | 2020-03-24 | 上海交通大学 | System and method for collecting threat information of darknet and associating information |
CN111209460A (en) * | 2019-12-27 | 2020-05-29 | 青岛海洋科学与技术国家实验室发展中心 | Data acquisition system and method based on script crawler framework |
CN112115331A (en) * | 2020-09-21 | 2020-12-22 | 朱彤 | Capital market public opinion monitoring method based on distributed web crawler and NLP |
CN112148956A (en) * | 2020-09-30 | 2020-12-29 | 上海交通大学 | Hidden net threat information mining system and method based on machine learning |
CN112417242A (en) * | 2020-11-09 | 2021-02-26 | 深圳市宝视佳科技有限公司 | Centralized management system of distributed crawlers |
CN112667873A (en) * | 2020-12-16 | 2021-04-16 | 北京华如慧云数据科技有限公司 | Crawler system and method suitable for general data acquisition of most websites |
CN113923193A (en) * | 2021-10-27 | 2022-01-11 | 北京知道创宇信息技术股份有限公司 | Network domain name association method, device, storage medium and electronic equipment |
CN113923193B (en) * | 2021-10-27 | 2023-11-28 | 北京知道创宇信息技术股份有限公司 | Network domain name association method and device, storage medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108829792A (en) | Distributed darknet excavating resource system and method based on scrapy | |
CN103297475A (en) | Mock service system and processing method of Mock service | |
CN107832468B (en) | Demand recognition methods and device | |
CN103942281B (en) | The method and device that a kind of object to persistent storage is operated | |
CN110351381A (en) | A kind of Distributed data share method that Internet of Things based on block chain is credible | |
CN102760151B (en) | Implementation method of open source software acquisition and searching system | |
CN101408877B (en) | System and method for loading tree node | |
CN107885777A (en) | A kind of control method and system of the crawl web data based on collaborative reptile | |
US20040230667A1 (en) | Loosely coupled intellectual capital processing engine | |
CN104933188B (en) | A kind of data synchronous system and method in patent personalization storehouse | |
CN101473628A (en) | Systems and methods for accelerating delivery of a computing environment to remote user | |
CN102073683A (en) | Distributed real-time news information acquisition system | |
CN112116488A (en) | Water conservancy big data comprehensive maintenance system | |
CN108959244A (en) | The method and apparatus of address participle | |
CN104376063A (en) | Multithreading web crawler method based on sort management and real-time information updating system | |
CN103618652A (en) | Audit and depth analysis system and audit and depth analysis method of business data | |
CN104363253A (en) | Website security detecting method and device | |
CN113918793A (en) | Multi-source scientific and creative resource data acquisition method | |
US7428756B2 (en) | Access control over dynamic intellectual capital content | |
CN104363251A (en) | Website security detecting method and device | |
CN104378389A (en) | Website security detecting method and device | |
CN104363252A (en) | Website security detecting method and device | |
CN104618410B (en) | Resource supplying method and apparatus | |
Wang et al. | A novel blockchain oracle implementation scheme based on application specific knowledge engines | |
US20040230982A1 (en) | Assembly of business process using intellectual capital processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181116 |