CN109829094A - Distributed reptile system - Google Patents
Distributed reptile system Download PDFInfo
- Publication number
- CN109829094A CN109829094A CN201910065959.6A CN201910065959A CN109829094A CN 109829094 A CN109829094 A CN 109829094A CN 201910065959 A CN201910065959 A CN 201910065959A CN 109829094 A CN109829094 A CN 109829094A
- Authority
- CN
- China
- Prior art keywords
- module
- url
- distributed reptile
- distributed
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of distributed reptile systems, comprising: URL module for reading and writing, URL handling module, document parsing module and persistence module;URL module for reading and writing reads URL for the end Map based on MapReduce from inlet flow, URL is written in output stream;URL handling module, the URL for being written in output stream are based on the corresponding destination document in the end Map download access address according to preset mode of network accessing as access address;Document parsing module extracts the target data in destination document according to predetermined manner for being based on the end Map;Persistence module stores target data into Hadoop distributed file system according to preset path and persistence rule for being based on the end Map.Distributed reptile system modular is carried out information exchange by transmitting data between the modules by this programme, is improved the scalability, availability and maintainability of system, is improved the dispatching of system, the creeping performance of system is enable to bring into play.
Description
Technical field
The present invention relates to web crawlers technical fields, and in particular to a kind of distributed reptile system.
Background technique
The arrival of Internet era, bring are the rapid expansions of information content, and big data and cloud computing are also come into being, mutually
Networking enterprise, larger communication company and sales company etc. generate log, the user behavior information etc. of flood tide daily.The number of big data
According to measure it is huge, data type is complicated, value density is low, processing speed is fast the features such as so that traditional centralized network crawler by
To Web page coverage rate and crawl time performance bottleneck limitation, the scarce capacity of system call, cause system creeping performance compared with
Difference.
Summary of the invention
In view of this, the purpose of the present invention is to provide a kind of distributed reptile system, to overcome concentration traditional at present
Formula web crawlers is limited by Web page coverage rate and crawl time performance bottleneck, and the scarce capacity of system call, system is climbed
The poor problem of row performance.
In order to achieve the above object, the present invention adopts the following technical scheme:
A kind of distributed reptile system, comprising: uniform resource position mark URL module for reading and writing, URL handling module, document solution
Analyse module and persistence module;
The URL module for reading and writing reads URL for the end Map based on MapReduce from inlet flow, the URL is write
Enter into output stream;
The URL handling module, for using URL of the said write into output stream as access address, according to preset
Mode of network accessing downloads the corresponding destination document of the access address based on the end Map;
The document parsing module extracts the mesh in the destination document according to predetermined manner for being based on the end Map
Data are marked, the type of the target data is detected, if the type is target type, the Reduce based on the MapReduce
The target data is sent to the corresponding persistence module by end;
The persistence module, for being based on the end Map, according to preset path and persistence rule by the number of targets
According to storage into Hadoop distributed file system.
Further, distributed reptile system described above, the document parsing module, is also used to:
If the type is not target type, the end Reduce based on the MapReduce sends the target data
To the corresponding URL module for reading and writing.
Further, distributed reptile system described above further includes central schedule module;
The central schedule module is used in the distributed reptile system initialization, configuration data, and sends starting
Instruction;After distributed reptile system work, ephemeral data is deleted, check data result and sends halt instruction;
The node report information for receiving the distributed reptile system, according to the work of node described in the information reconciliation;Described in maintenance
The connection of distributed reptile system and client carries out data interaction with the client.
Further, distributed reptile system described above further includes secondary scheduler module;
The pair scheduler module, when breaking down for the central schedule module, replaces the central schedule module.
Further, distributed reptile system described above further includes configuration generic module;
The configuration generic module is automatically injected and parses the key assignments of different attribute for being based on CrawlerConfig class
It is right, and configuration class is copied in different nodes;Based on Configurable class, according to configuration item component in the crawler system
Obtain corresponding function in the CrawlerConfig class;Asynchronous input and output are used based on remote procedure call protocol RPC class
Stream realizes far call.
Further, distributed reptile system described above, further includes drive module;
The drive module, for the operation sequence to configuration generic module described in client's end shield.
Further, distributed reptile system described above, the URL handling module, is also used to:
Based on the end Reduce, classify according to default mode classification to the destination document;
The default mode classification includes initial link address, Doctype, crawls depth, at least one in document subject matter
Kind.
Further, distributed reptile system described above, the URL module for reading and writing, is also used to:
Judge whether the URL has been accessed based on Bloom filter;
If so, deleting the URL being accessed.
Further, distributed reptile system described above, the URL module for reading and writing are also used to:
The URL queue formed is safeguarded based on URL queued.
Further, distributed reptile system described above, further includes communication module, and the communication module is used for:
The operating rate between modules is adjusted, the operating rate between the modules is made to match.
Distributed reptile system of the invention includes uniform resource position mark URL module for reading and writing, URL handling module, document
Parsing module and persistence module;URL module for reading and writing reads URL for the end Map based on MapReduce from inlet flow, will
URL is written in output stream;URL handling module, the URL for being written in output stream is as access address, according to default
Mode of network accessing, be based on the corresponding destination document in the end Map download access address;Document parsing module, for being based on Map
End extracts the target data in destination document according to predetermined manner, detects the type of target data, if type is target type,
Target data is sent to corresponding persistence module by the end Reduce based on MapReduce;Persistence module, for being based on
The end Map stores target data into Hadoop distributed file system according to preset path and persistence rule.This programme will
Distributed reptile system modular, different modules realize specific function, carry out letter by transmitting data between the modules
Breath interaction, improves the scalability, availability and maintainability of system, improves system while reducing system coupling
Dispatching enables the creeping performance of system to bring into play.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is the structure chart that distributed reptile system embodiment one of the present invention provides;
Fig. 2 is the structure chart that distributed reptile system embodiment two of the present invention provides.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, technical solution of the present invention will be carried out below
Detailed description.Obviously, described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Base
Embodiment in the present invention, those of ordinary skill in the art are obtained all without making creative work
Other embodiment belongs to the range that the present invention is protected.
Fig. 1 is the structure chart that distributed reptile system embodiment one of the present invention provides.As shown in Figure 1, point of the present embodiment
Cloth crawler system may include: uniform resource position mark URL (Uniform Resource Location) module for reading and writing 11,
URL handling module 12, document parsing module 13 and persistence module 14.
Specifically, the distributed reptile system of the present embodiment can be realized based on Hadoop distributed file system,
Hadoop distributed file system can be used as data storage center, and unified scheduling of resource system is provided based on file system
System, and many Computational frames such as MapReduce can be run on this system.Hadoop system supports MapReduce to compile
Journey model, MapReduce programming model are this simple abstract it can be readily appreciated that can handle many problems of big data field
The model simplification analysis and processing of data.Further, the end Map of MapReduce can read data and conversion, will input
Data be converted to the key-value pairs of needs;The data that the end Reduce can generate previous step carry out integration and are subject to analysis processing.
The execution logic of MapReduce is as follows:
InputFormat: specified input data path can be Hadoop file system even network flow, and specified number
According to processing format, specified key-value pair is processed data into, is inputted for the end map;
Map:Map quantity depends on size of data and fragment size, and each Map handles different fragments, the result of output
It is persisted to local disk to read for Reduce later, reasonable fragment size may insure each Map task completion time
Synchronous, avoiding calculating tilting causes the end the Map processing time to extend;
The result of the end Shuffle/Sort:Map output is transferred to specified by hash or specified division mode
The end Reduce guarantees that the key of each Reduce is orderly, and reasonably distributing Reduce data volume can be unbalanced to avoid computational load;
The corresponding value of identical key polymerize by Reduce:Reduce, carries out same operation, the new key assignments after output calculating
It is right;
OutputFormat: checking the legitimacy of outgoing route, and the output key-value pair of Reduce is persisted to specified mesh
Record.
The MapReduce programming model function of Hadoop is more abundant, can handle the problems of many complexity, such as the end Map,
Various input file types can be easily handled by defining new input file formatted program, in the end Map and Reduce
Between end, the network transport load between each node is greatly reduced by increasing Combiner stage and Partitioner stage.
The distributed reptile system of the present embodiment is to realize each portion using the end Map or the end Reduce based on MapReduce programming model
The core function divided, so that whole system is abstract simple, easy to use and extension.
Specifically, in the present embodiment, URL module for reading and writing 11 can be used for the end Map based on MapReduce from inlet flow
URL is read, URL is written in output stream.And URL is periodically provided according to crawler capturing strategy and crawl time interval
To URL handling module 12, the correctness and legitimacy of URL are checked, repair incorrect URL, refresh the library URL.
URL handling module 12, the URL for being written in output stream are visited as access address according to preset network
It asks mode, is based on the corresponding destination document in the end Map download access address.Specifically, in the present embodiment, URL handling module 12 can
Processor FetchProcessor interface is obtained to use, Jsoup Fetch Processor can be defaulted, inside can be used
Jsoup, the different target document grabbed can be packaged as unified Document entity, be responsible for by document parsing module 13
Processing.Specifically, Jsoup is a outstanding page parsing tool, and provides the hypertext transfer protocol HTTP on basis
(HyperText Transfer Protocol) download function, application programming interfaces API (Application Program
Interface) easy to use, it can parse online or local document, therefore the present embodiment uses Jsoup as hypertext mark
Remember language HTML (Hyper Text Markup Language) document resolver.
Document parsing module 13 extracts the target data in destination document, inspection according to predetermined manner for being based on the end Map
The type of target data is surveyed, if type is target type, target data is sent to correspondence by the end Reduce based on MapReduce
Persistence module 14.Specifically, in the present embodiment, dissection process device Resolve Processor interface can be used, write from memory
Recognize and handled using Jsoup Resolve Processor, the target data mode extracted in destination document can be by user
To define.
Persistence module 14 arrives target data storage according to preset path and persistence rule for being based on the end Map
In Hadoop distributed file system, the validity and legitimacy of outgoing route can also be checked.Specifically, in the present embodiment
In, storage processor Store Processor interface can be used, and can advise according to the path and persistence that user selects
Then, file system is write data into.Optionally, persistence module 14 is desirably integrated into URL handling module 12.
This programme is by distributed reptile system modular, and different modules realizes specific function, by between the modules
Data are transmitted to carry out information exchange, the scalability, availability and Ke Wei of system are improved while reducing system coupling
Shield property, improves the dispatching of system, the creeping performance of system is enable to bring into play.
Fig. 2 is the structure chart that distributed reptile system embodiment two of the present invention provides.The present embodiment is in above embodiments
On the basis of technical solution of the present invention is described.Specifically, the document parsing module 13 of the present embodiment can be also used for, base
In the end Map, according to the target data in predetermined manner extraction destination document, after the type for detecting target data, if not mesh
Type is marked, target data can be sent to by corresponding module for reading and writing based on the end Reduce of MapReduce, which is
New URL.
The distributed reptile system of the present embodiment can also include central schedule module 15.Central schedule module 15 is this reality
The scheduling of distributed reptile system and a Consultation Center are applied, can be used in distributed reptile system initialization, configuration data,
Every configuration data is forwarded to each submodule, and sends enabled instruction, each module is notified to start to work;
After the work of distributed reptile system, ephemeral data is deleted, check data result and sends halt instruction;
The node report information for receiving distributed reptile system, according to the work of information reconciliation node, the node that will be broken down
Be replaced, coordinate the operating rate of modules, safeguard the data consistency of entire distributed system, avoid modules it
Between behavior disunity;
The connection for safeguarding distributed reptile system and client carries out data interaction, real-time reporting system work with client
Make progress and receives client operation instruction.
Optionally, it for URL module for reading and writing 11 and document parsing module 13, generally requires to handle computation-intensive task,
Therefore the node of cpu resource abundance should be assigned it to, and persistence module 14 and URL handling module 12 need a large amount of number
According to IO and network I/O, it is therefore desirable to have stable and high performance disk and network support.
The distributed reptile system of the present embodiment can also include secondary scheduler module 16, can be used for central schedule module 15
When failure, central schedule module 15 is replaced, systemic breakdown when central schedule 15 failure of module is avoided to be unable to run.
The distributed reptile system of the present embodiment can also include configuration generic module 17, and crawler is related to crawling depth, climb
Many configuration items such as rule and document analysis mode are taken, especially in distributed environment, different nodes will guarantee pair of configuration item
Synchronization should be worth, this generally requires efficient, thread-safe configuration class, present embodiments provides CrawlerConfig class.This reality
The configuration generic module 17 for applying example can be used for configuring CrawlerConfig class based on creeping, and be automatically injected and parse different attribute
Key-value pair, avoid configuration logic invade in service logic, can guarantee in multi-thread environment data sharing safety, and
By realizing copy function, configuration class is copied in different nodes;
Serializing mechanism is realized when CrawlerConfig is similar, guarantees high efficiency of transmission in network among the nodes, this reality
The distributed reptile system for applying example is mainly used for remote procedure call protocol RPC (Remote Procedure Call
Protocol it) calls, in configuration system, while having Configurable class, it is all in crawler system to need to use configuration item
Component, can be by realizing that the interface easily obtains corresponding function in CrawlerConfig class;
For the communication between multinode, the remote process of Hadoop is can be used in the distributed reptile system of the present embodiment
Invocation protocol RPC class libraries, RPC be built upon transmission control protocol TCP (Transmission Control Protocol) it
It is upper, to provide the communication system of service with layer, included Socket class libraries excessively bottom in Java, it is difficult to meet system and want
It asks, and its seriation mechanism provided can occupy more network bandwidth, the data communication not being suitable in distributed system,
Efficient far call may be implemented using new asynchronous iostream in Hadoop RPC.
The distributed reptile system of the present embodiment can also include drive module 18, to client shield configuration generic module
17 operation sequence.Specifically, the distributed reptile system configuration item of the present embodiment is various, and starting is complicated, between modules
Dependent Rule is more, especially in distributed environment, if crawler configuration 17 routine interface of generic module is directly exposed to client
End, the ease for use that frequently can lead to system is poor, and is difficult to mask operation details, therefore the distributed reptile system of the present embodiment
System provides the drive module 18 based on starting class CrawlerBooter on modular basis, runs and drives as crawler.
In drive module 18, crawler running environment is initially set up, configuration content and configuration generic module 17 is checked, then sets up data
Stream, processor that insertion is defaulted in a stream and that user provides, configuration system is integrated into driving, to shield to user
Fall to configure the specific implementation of class, the configuration function with precise meaning is provided in driving, can be driven in client initialization
Class and easy configuration realize the use of crawler system.
The URL handling module 12 of the present embodiment can be also used for based on the end Reduce, according to default mode classification to target
Document classification, default mode classification include initial link address, Doctype, at least one of crawl depth, document subject matter.
The URL module for reading and writing 11 of the present embodiment can be also used for judging whether URL has been interviewed based on Bloom filter
It asks;If so, the URL being accessed can be deleted.Remove duplicate URL can substantially reduce improve the present embodiment distribution climb
The workload of worm system, and then improve working efficiency.
Specifically, whether Bloom filter is a kind of binary vector data structure, can be used for judging some URL
Through being accessed.Bloom filter stores aggregate information using the bit array of certain length, uses a certain number of hash functions
Map the data into bit array space, basic principle are as follows: bit array element is all set to 0, by set it is all at
Member is calculated using specified hash function respectively, the corresponding bit array subscript of calculated result is set to 1, in this way, if there is h
A hash function then might have the position less than or equal to h and be set to 1 in bit array.Inquiring some element whether there is in set
When middle, it is only necessary to which, with this h function to the element successively evaluation, as long as being worth in corresponding index, to have any one be 0, it is determined that
Element is not in set.
The URL module for reading and writing 11 of the present embodiment is also based on URL queued QueueURLProcessor class pair
The queue of URL composition is safeguarded.QueueURLProcessor class is by being managed URL in internal maintenance queue, thread
After starting, URL module for reading and writing 11 constantly reads in URL from inlet flow, first can filter duplicate URL using Bloom filter,
URL legitimacy is verified again, eventually forms URL queue, QueueURLProcessor class to URL queue carry out maintenance with
URL, is written in output stream by management in a certain order.
Specifically, the distributed reptile system of the present embodiment can also include communication module 19, and communication module 19 can be adjusted
Operating rate between whole modules makes the operating rate between modules match.
Specifically, the URL maintenance pond of disk data structure can be used in the distributed reptile system of the present embodiment, to adapt to
Current huge data volume avoids leading into heap overflow since data volume is excessive.It can be mentioned for the data pipe in data flow
For identifier, it is used for authentication data type, it is convenient to transmit more information between different modules.The distributed reptile of the present embodiment
System can carry out processor configuration based on spring container cell, and configuration detail is detached code, realizes the flexible of crawler system
Configuration.
Further, the present embodiment has carried out operation test to distributed reptile system.The running environment of the present embodiment is
Based on the Hadoop cluster of 4 Centos virtual machines, analyzed for the availability scalability of crawler system, test data
For addressable URL, this system can select a certain number of URL to initialize from the library URL at random, test result such as table 1.
Crawl depth | Parallel line number of passes | Test result |
4 | 5 | Grab 100,000 datas, velocity-stabilization |
5 | 5 | 1,000,000 datas are grabbed, speed is constant |
6 | 5 | Grab 40,000,000 datas, speed decline |
5 | 5 | Grasp speed is stablized |
5 | 10 | Grasp speed is accelerated |
5 | 15 | Grasp speed decreases |
Table 1
In the case where crawl number of threads is certain, crawl depth can generate tremendous influence to data volume, therebetween
It has exponent relation, and under the premise of crawl depth is certain, the appropriate Thread Count that increases can accelerate grasp speed, but grab Thread Count
When measuring excessive, speed declines instead, because a large amount of thread will increase the burden of system thread switching and resource contention, test knot
Fruit is consistent with expection.This programme is by distributed reptile system modular, and different modules realizes specific function, by module
Between transmission data carry out information exchange, improve while reducing system coupling the scalability of system, availability and
Maintainability improves the dispatching of system, and the creeping performance of system is enable to bring into play.
It is understood that same or similar part can mutually refer in the various embodiments described above, in some embodiments
Unspecified content may refer to the same or similar content in other embodiments.
It should be noted that in the description of the present invention, term " first ", " second " etc. are used for description purposes only, without
It can be interpreted as indication or suggestion relative importance.In addition, in the description of the present invention, unless otherwise indicated, the meaning of " multiple "
Refer at least two.
It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in a processing module
It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould
Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as
Fruit is realized and when sold or used as an independent product in the form of software function module, also can store in a computer
In read/write memory medium.
Storage medium mentioned above can be read-only memory, disk or CD etc..
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example
Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not
Centainly refer to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any
One or more embodiment or examples in can be combined in any suitable manner.
Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example
Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned
Embodiment is changed, modifies, replacement and variant.
Claims (10)
1. a kind of distributed reptile system characterized by comprising uniform resource position mark URL module for reading and writing, URL grab mould
Block, document parsing module and persistence module;
The URL module for reading and writing reads URL for the end Map based on MapReduce from inlet flow, the URL is written to
In output stream;
The URL handling module, for using URL of the said write into output stream as access address, according to preset network
Access module downloads the corresponding destination document of the access address based on the end Map;
The document parsing module extracts the number of targets in the destination document according to predetermined manner for being based on the end Map
According to, the type of the target data is detected, if the type is target type, the end the Reduce general based on the MapReduce
The target data is sent to the corresponding persistence module;
The persistence module deposits the target data according to preset path and persistence rule for being based on the end Map
It stores up in Hadoop distributed file system.
2. distributed reptile system according to claim 1, which is characterized in that the document parsing module is also used to:
If the type is not target type, the target data is sent to pair based on the end Reduce of the MapReduce
The URL module for reading and writing answered.
3. distributed reptile system according to claim 1, which is characterized in that further include central schedule module;
The central schedule module is used in the distributed reptile system initialization, configuration data, and sends starting and refer to
It enables;After distributed reptile system work, ephemeral data is deleted, check data result and sends halt instruction;It connects
The node report information for receiving the distributed reptile system, according to the work of node described in the information reconciliation;Described point of maintenance
The connection of cloth crawler system and client carries out data interaction with the client.
4. distributed reptile system according to claim 3, which is characterized in that further include secondary scheduler module;
The pair scheduler module, when breaking down for the central schedule module, replaces the central schedule module.
5. distributed reptile system according to claim 3, which is characterized in that further include configuration generic module;
The configuration generic module, for being automatically injected and parsing the key-value pair of different attribute based on CrawlerConfig class, and
Configuration class is copied in different nodes;Based on Configurable class, obtained according to configuration item component in the crawler system
Corresponding function in the CrawlerConfig class;Asynchronous iostream is used based on remote procedure call protocol RPC class,
Realize far call.
6. distributed reptile system according to claim 5, which is characterized in that further include drive module;
The drive module, for the operation sequence to configuration generic module described in client's end shield.
7. distributed reptile system according to claim 1, which is characterized in that the URL handling module is also used to:
Based on the end Reduce, classify according to default mode classification to the destination document;
The default mode classification includes initial link address, Doctype, at least one of crawls depth, document subject matter.
8. distributed reptile system according to claim 1, which is characterized in that the URL module for reading and writing is also used to:
Judge whether the URL has been accessed based on Bloom filter;
If so, deleting the URL being accessed.
9. distributed reptile system according to claim 1, which is characterized in that the URL module for reading and writing is also used to:
The URL queue formed is safeguarded based on URL queued.
10. -9 distributed reptile system according to claim 1, which is characterized in that it further include communication module, the communication mould
Block is used for:
The operating rate between modules is adjusted, the operating rate between the modules is made to match.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910065959.6A CN109829094A (en) | 2019-01-23 | 2019-01-23 | Distributed reptile system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910065959.6A CN109829094A (en) | 2019-01-23 | 2019-01-23 | Distributed reptile system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109829094A true CN109829094A (en) | 2019-05-31 |
Family
ID=66862353
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910065959.6A Withdrawn CN109829094A (en) | 2019-01-23 | 2019-01-23 | Distributed reptile system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109829094A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111753162A (en) * | 2020-06-29 | 2020-10-09 | 平安国际智慧城市科技股份有限公司 | Data crawling method, device, server and storage medium |
CN112231320A (en) * | 2020-10-16 | 2021-01-15 | 南京信息职业技术学院 | Web data acquisition method, system and storage medium based on MapReduce algorithm |
CN114528069A (en) * | 2022-01-27 | 2022-05-24 | 西安电子科技大学 | Method and equipment for providing limited supervision internet access service in information security competition |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8707313B1 (en) * | 2003-07-03 | 2014-04-22 | Google Inc. | Scheduler for search engine crawler |
CN104156389A (en) * | 2014-07-04 | 2014-11-19 | 重庆邮电大学 | Deep packet detecting system and method based on Hadoop platform |
CN104598536A (en) * | 2014-12-29 | 2015-05-06 | 浙江大学 | Structured processing method of distributed network information |
US9152716B1 (en) * | 2012-12-31 | 2015-10-06 | Emc Corporation | Techniques for verifying search results over a distributed collection |
CN105260388A (en) * | 2015-09-11 | 2016-01-20 | 广州极数宝数据服务有限公司 | Optimization method of distributed vertical crawler service system |
US9614869B2 (en) * | 2013-11-23 | 2017-04-04 | Universidade da Coruña—OTRI | System and server for detecting web page changes |
CN108073693A (en) * | 2017-12-07 | 2018-05-25 | 国家计算机网络与信息安全管理中心 | A kind of distributed network crawler system based on Hadoop |
-
2019
- 2019-01-23 CN CN201910065959.6A patent/CN109829094A/en not_active Withdrawn
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8707313B1 (en) * | 2003-07-03 | 2014-04-22 | Google Inc. | Scheduler for search engine crawler |
US9152716B1 (en) * | 2012-12-31 | 2015-10-06 | Emc Corporation | Techniques for verifying search results over a distributed collection |
US9614869B2 (en) * | 2013-11-23 | 2017-04-04 | Universidade da Coruña—OTRI | System and server for detecting web page changes |
CN104156389A (en) * | 2014-07-04 | 2014-11-19 | 重庆邮电大学 | Deep packet detecting system and method based on Hadoop platform |
CN104598536A (en) * | 2014-12-29 | 2015-05-06 | 浙江大学 | Structured processing method of distributed network information |
CN105260388A (en) * | 2015-09-11 | 2016-01-20 | 广州极数宝数据服务有限公司 | Optimization method of distributed vertical crawler service system |
CN108073693A (en) * | 2017-12-07 | 2018-05-25 | 国家计算机网络与信息安全管理中心 | A kind of distributed network crawler system based on Hadoop |
Non-Patent Citations (2)
Title |
---|
YASSERG: "crawler4j-4.3", 《HTTPS://GITHUB.COM/YASSERG/CRAWLER4J/BLOB/CRAWLER4J-4.3/SRC/MAIN/JAVA/EDU/UCI/ICS/CRAWLER4J》 * |
张笑天: "分布式爬虫应用中布隆过滤器的研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111753162A (en) * | 2020-06-29 | 2020-10-09 | 平安国际智慧城市科技股份有限公司 | Data crawling method, device, server and storage medium |
CN112231320A (en) * | 2020-10-16 | 2021-01-15 | 南京信息职业技术学院 | Web data acquisition method, system and storage medium based on MapReduce algorithm |
CN112231320B (en) * | 2020-10-16 | 2024-02-20 | 南京信息职业技术学院 | Web data acquisition method, system and storage medium based on MapReduce algorithm |
CN114528069A (en) * | 2022-01-27 | 2022-05-24 | 西安电子科技大学 | Method and equipment for providing limited supervision internet access service in information security competition |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10447772B2 (en) | Managed function execution for processing data streams in real time | |
US20230004434A1 (en) | Automated reconfiguration of real time data stream processing | |
CN108681569B (en) | Automatic data analysis system and method thereof | |
US9582541B2 (en) | Systems, methods, and computer program products to ingest, process, and output large data | |
CN101661494B (en) | Data interactive method for distributed middleware and database | |
WO2017071134A1 (en) | Distributed tracking system | |
CN103246749B (en) | The matrix database system and its querying method that Based on Distributed calculates | |
US8959519B2 (en) | Processing hierarchical data in a map-reduce framework | |
CN106980669A (en) | A kind of storage of data, acquisition methods and device | |
CN108510082A (en) | The method and device that machine learning model is handled | |
CN104317928A (en) | Service ETL (extraction-transformation-loading) method and service ETL system both based on distributed database | |
CN105677251B (en) | Storage system based on Redis cluster | |
CN109829094A (en) | Distributed reptile system | |
CN108881485A (en) | The method for ensureing the high concurrent system response time under big data packet | |
CN107092627A (en) | The column-shaped storage of record is represented | |
CN105183470A (en) | Natural language processing systematic service platform | |
Vega et al. | Loginson: a transform and load system for very large-scale log analysis in large IT infrastructures | |
CN111984505A (en) | Operation and maintenance data acquisition engine and acquisition method | |
Koo et al. | IoT-enabled directed acyclic graph in spark cluster | |
CN106570151A (en) | Data collection processing method and system for mass files | |
CN109614241A (en) | The method and system of more cluster multi-tenant resource isolations are realized based on Yarn queue | |
CN109614380A (en) | Log processing method, system, computer equipment and readable medium | |
CN109800271A (en) | A kind of information collecting method based on big data | |
CN109324892A (en) | Distribution management method, distributed management system and device | |
CN110347731A (en) | Obtain the method and system of data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20190531 |