CN109829094A

CN109829094A - Distributed reptile system

Info

Publication number: CN109829094A
Application number: CN201910065959.6A
Authority: CN
Inventors: 张跃进; 胡勇; 喻蒙; 王猛; 王娟; 杜飞
Original assignee: Zhongxiang Bo Qian Mdt Infotech Ltd
Current assignee: Zhongxiang Bo Qian Mdt Infotech Ltd
Priority date: 2019-01-23
Filing date: 2019-01-23
Publication date: 2019-05-31

Abstract

The present invention relates to a kind of distributed reptile systems, comprising: URL module for reading and writing, URL handling module, document parsing module and persistence module；URL module for reading and writing reads URL for the end Map based on MapReduce from inlet flow, URL is written in output stream；URL handling module, the URL for being written in output stream are based on the corresponding destination document in the end Map download access address according to preset mode of network accessing as access address；Document parsing module extracts the target data in destination document according to predetermined manner for being based on the end Map；Persistence module stores target data into Hadoop distributed file system according to preset path and persistence rule for being based on the end Map.Distributed reptile system modular is carried out information exchange by transmitting data between the modules by this programme, is improved the scalability, availability and maintainability of system, is improved the dispatching of system, the creeping performance of system is enable to bring into play.

Description

Distributed reptile system

Technical field

The present invention relates to web crawlers technical fields, and in particular to a kind of distributed reptile system.

Background technique

The arrival of Internet era, bring are the rapid expansions of information content, and big data and cloud computing are also come into being, mutually Networking enterprise, larger communication company and sales company etc. generate log, the user behavior information etc. of flood tide daily.The number of big data According to measure it is huge, data type is complicated, value density is low, processing speed is fast the features such as so that traditional centralized network crawler by To Web page coverage rate and crawl time performance bottleneck limitation, the scarce capacity of system call, cause system creeping performance compared with Difference.

Summary of the invention

In view of this, the purpose of the present invention is to provide a kind of distributed reptile system, to overcome concentration traditional at present Formula web crawlers is limited by Web page coverage rate and crawl time performance bottleneck, and the scarce capacity of system call, system is climbed The poor problem of row performance.

In order to achieve the above object, the present invention adopts the following technical scheme:

A kind of distributed reptile system, comprising: uniform resource position mark URL module for reading and writing, URL handling module, document solution Analyse module and persistence module；

The URL module for reading and writing reads URL for the end Map based on MapReduce from inlet flow, the URL is write Enter into output stream；

The URL handling module, for using URL of the said write into output stream as access address, according to preset Mode of network accessing downloads the corresponding destination document of the access address based on the end Map；

The document parsing module extracts the mesh in the destination document according to predetermined manner for being based on the end Map Data are marked, the type of the target data is detected, if the type is target type, the Reduce based on the MapReduce The target data is sent to the corresponding persistence module by end；

The persistence module, for being based on the end Map, according to preset path and persistence rule by the number of targets According to storage into Hadoop distributed file system.

Further, distributed reptile system described above, the document parsing module, is also used to:

If the type is not target type, the end Reduce based on the MapReduce sends the target data To the corresponding URL module for reading and writing.

Further, distributed reptile system described above further includes central schedule module；

The central schedule module is used in the distributed reptile system initialization, configuration data, and sends starting Instruction；After distributed reptile system work, ephemeral data is deleted, check data result and sends halt instruction； The node report information for receiving the distributed reptile system, according to the work of node described in the information reconciliation；Described in maintenance The connection of distributed reptile system and client carries out data interaction with the client.

Further, distributed reptile system described above further includes secondary scheduler module；

The pair scheduler module, when breaking down for the central schedule module, replaces the central schedule module.

Further, distributed reptile system described above further includes configuration generic module；

The configuration generic module is automatically injected and parses the key assignments of different attribute for being based on CrawlerConfig class It is right, and configuration class is copied in different nodes；Based on Configurable class, according to configuration item component in the crawler system Obtain corresponding function in the CrawlerConfig class；Asynchronous input and output are used based on remote procedure call protocol RPC class Stream realizes far call.

Further, distributed reptile system described above, further includes drive module；

The drive module, for the operation sequence to configuration generic module described in client's end shield.

Further, distributed reptile system described above, the URL handling module, is also used to:

Based on the end Reduce, classify according to default mode classification to the destination document；

The default mode classification includes initial link address, Doctype, crawls depth, at least one in document subject matter Kind.

Further, distributed reptile system described above, the URL module for reading and writing, is also used to:

Judge whether the URL has been accessed based on Bloom filter；

If so, deleting the URL being accessed.

Further, distributed reptile system described above, the URL module for reading and writing are also used to:

The URL queue formed is safeguarded based on URL queued.

Further, distributed reptile system described above, further includes communication module, and the communication module is used for:

The operating rate between modules is adjusted, the operating rate between the modules is made to match.

Distributed reptile system of the invention includes uniform resource position mark URL module for reading and writing, URL handling module, document Parsing module and persistence module；URL module for reading and writing reads URL for the end Map based on MapReduce from inlet flow, will URL is written in output stream；URL handling module, the URL for being written in output stream is as access address, according to default Mode of network accessing, be based on the corresponding destination document in the end Map download access address；Document parsing module, for being based on Map End extracts the target data in destination document according to predetermined manner, detects the type of target data, if type is target type, Target data is sent to corresponding persistence module by the end Reduce based on MapReduce；Persistence module, for being based on The end Map stores target data into Hadoop distributed file system according to preset path and persistence rule.This programme will Distributed reptile system modular, different modules realize specific function, carry out letter by transmitting data between the modules Breath interaction, improves the scalability, availability and maintainability of system, improves system while reducing system coupling Dispatching enables the creeping performance of system to bring into play.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is the structure chart that distributed reptile system embodiment one of the present invention provides；

Fig. 2 is the structure chart that distributed reptile system embodiment two of the present invention provides.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, technical solution of the present invention will be carried out below Detailed description.Obviously, described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Base Embodiment in the present invention, those of ordinary skill in the art are obtained all without making creative work Other embodiment belongs to the range that the present invention is protected.

Fig. 1 is the structure chart that distributed reptile system embodiment one of the present invention provides.As shown in Figure 1, point of the present embodiment Cloth crawler system may include: uniform resource position mark URL (Uniform Resource Location) module for reading and writing 11, URL handling module 12, document parsing module 13 and persistence module 14.

Specifically, the distributed reptile system of the present embodiment can be realized based on Hadoop distributed file system, Hadoop distributed file system can be used as data storage center, and unified scheduling of resource system is provided based on file system System, and many Computational frames such as MapReduce can be run on this system.Hadoop system supports MapReduce to compile Journey model, MapReduce programming model are this simple abstract it can be readily appreciated that can handle many problems of big data field The model simplification analysis and processing of data.Further, the end Map of MapReduce can read data and conversion, will input Data be converted to the key-value pairs of needs；The data that the end Reduce can generate previous step carry out integration and are subject to analysis processing. The execution logic of MapReduce is as follows:

InputFormat: specified input data path can be Hadoop file system even network flow, and specified number According to processing format, specified key-value pair is processed data into, is inputted for the end map；

Map:Map quantity depends on size of data and fragment size, and each Map handles different fragments, the result of output It is persisted to local disk to read for Reduce later, reasonable fragment size may insure each Map task completion time Synchronous, avoiding calculating tilting causes the end the Map processing time to extend；

The result of the end Shuffle/Sort:Map output is transferred to specified by hash or specified division mode The end Reduce guarantees that the key of each Reduce is orderly, and reasonably distributing Reduce data volume can be unbalanced to avoid computational load；

The corresponding value of identical key polymerize by Reduce:Reduce, carries out same operation, the new key assignments after output calculating It is right；

OutputFormat: checking the legitimacy of outgoing route, and the output key-value pair of Reduce is persisted to specified mesh Record.

The MapReduce programming model function of Hadoop is more abundant, can handle the problems of many complexity, such as the end Map, Various input file types can be easily handled by defining new input file formatted program, in the end Map and Reduce Between end, the network transport load between each node is greatly reduced by increasing Combiner stage and Partitioner stage. The distributed reptile system of the present embodiment is to realize each portion using the end Map or the end Reduce based on MapReduce programming model The core function divided, so that whole system is abstract simple, easy to use and extension.

Specifically, in the present embodiment, URL module for reading and writing 11 can be used for the end Map based on MapReduce from inlet flow URL is read, URL is written in output stream.And URL is periodically provided according to crawler capturing strategy and crawl time interval To URL handling module 12, the correctness and legitimacy of URL are checked, repair incorrect URL, refresh the library URL.

URL handling module 12, the URL for being written in output stream are visited as access address according to preset network It asks mode, is based on the corresponding destination document in the end Map download access address.Specifically, in the present embodiment, URL handling module 12 can Processor FetchProcessor interface is obtained to use, Jsoup Fetch Processor can be defaulted, inside can be used Jsoup, the different target document grabbed can be packaged as unified Document entity, be responsible for by document parsing module 13 Processing.Specifically, Jsoup is a outstanding page parsing tool, and provides the hypertext transfer protocol HTTP on basis (HyperText Transfer Protocol) download function, application programming interfaces API (Application Program Interface) easy to use, it can parse online or local document, therefore the present embodiment uses Jsoup as hypertext mark Remember language HTML (Hyper Text Markup Language) document resolver.

Document parsing module 13 extracts the target data in destination document, inspection according to predetermined manner for being based on the end Map The type of target data is surveyed, if type is target type, target data is sent to correspondence by the end Reduce based on MapReduce Persistence module 14.Specifically, in the present embodiment, dissection process device Resolve Processor interface can be used, write from memory Recognize and handled using Jsoup Resolve Processor, the target data mode extracted in destination document can be by user To define.

Persistence module 14 arrives target data storage according to preset path and persistence rule for being based on the end Map In Hadoop distributed file system, the validity and legitimacy of outgoing route can also be checked.Specifically, in the present embodiment In, storage processor Store Processor interface can be used, and can advise according to the path and persistence that user selects Then, file system is write data into.Optionally, persistence module 14 is desirably integrated into URL handling module 12.

This programme is by distributed reptile system modular, and different modules realizes specific function, by between the modules Data are transmitted to carry out information exchange, the scalability, availability and Ke Wei of system are improved while reducing system coupling Shield property, improves the dispatching of system, the creeping performance of system is enable to bring into play.

Fig. 2 is the structure chart that distributed reptile system embodiment two of the present invention provides.The present embodiment is in above embodiments On the basis of technical solution of the present invention is described.Specifically, the document parsing module 13 of the present embodiment can be also used for, base In the end Map, according to the target data in predetermined manner extraction destination document, after the type for detecting target data, if not mesh Type is marked, target data can be sent to by corresponding module for reading and writing based on the end Reduce of MapReduce, which is New URL.

The distributed reptile system of the present embodiment can also include central schedule module 15.Central schedule module 15 is this reality The scheduling of distributed reptile system and a Consultation Center are applied, can be used in distributed reptile system initialization, configuration data, Every configuration data is forwarded to each submodule, and sends enabled instruction, each module is notified to start to work；

After the work of distributed reptile system, ephemeral data is deleted, check data result and sends halt instruction；

The node report information for receiving distributed reptile system, according to the work of information reconciliation node, the node that will be broken down Be replaced, coordinate the operating rate of modules, safeguard the data consistency of entire distributed system, avoid modules it Between behavior disunity；

The connection for safeguarding distributed reptile system and client carries out data interaction, real-time reporting system work with client Make progress and receives client operation instruction.

Optionally, it for URL module for reading and writing 11 and document parsing module 13, generally requires to handle computation-intensive task, Therefore the node of cpu resource abundance should be assigned it to, and persistence module 14 and URL handling module 12 need a large amount of number According to IO and network I/O, it is therefore desirable to have stable and high performance disk and network support.

The distributed reptile system of the present embodiment can also include secondary scheduler module 16, can be used for central schedule module 15 When failure, central schedule module 15 is replaced, systemic breakdown when central schedule 15 failure of module is avoided to be unable to run.

The distributed reptile system of the present embodiment can also include configuration generic module 17, and crawler is related to crawling depth, climb Many configuration items such as rule and document analysis mode are taken, especially in distributed environment, different nodes will guarantee pair of configuration item Synchronization should be worth, this generally requires efficient, thread-safe configuration class, present embodiments provides CrawlerConfig class.This reality The configuration generic module 17 for applying example can be used for configuring CrawlerConfig class based on creeping, and be automatically injected and parse different attribute Key-value pair, avoid configuration logic invade in service logic, can guarantee in multi-thread environment data sharing safety, and By realizing copy function, configuration class is copied in different nodes；

Serializing mechanism is realized when CrawlerConfig is similar, guarantees high efficiency of transmission in network among the nodes, this reality The distributed reptile system for applying example is mainly used for remote procedure call protocol RPC (Remote Procedure Call Protocol it) calls, in configuration system, while having Configurable class, it is all in crawler system to need to use configuration item Component, can be by realizing that the interface easily obtains corresponding function in CrawlerConfig class；

For the communication between multinode, the remote process of Hadoop is can be used in the distributed reptile system of the present embodiment Invocation protocol RPC class libraries, RPC be built upon transmission control protocol TCP (Transmission Control Protocol) it It is upper, to provide the communication system of service with layer, included Socket class libraries excessively bottom in Java, it is difficult to meet system and want It asks, and its seriation mechanism provided can occupy more network bandwidth, the data communication not being suitable in distributed system, Efficient far call may be implemented using new asynchronous iostream in Hadoop RPC.

The distributed reptile system of the present embodiment can also include drive module 18, to client shield configuration generic module 17 operation sequence.Specifically, the distributed reptile system configuration item of the present embodiment is various, and starting is complicated, between modules Dependent Rule is more, especially in distributed environment, if crawler configuration 17 routine interface of generic module is directly exposed to client End, the ease for use that frequently can lead to system is poor, and is difficult to mask operation details, therefore the distributed reptile system of the present embodiment System provides the drive module 18 based on starting class CrawlerBooter on modular basis, runs and drives as crawler. In drive module 18, crawler running environment is initially set up, configuration content and configuration generic module 17 is checked, then sets up data Stream, processor that insertion is defaulted in a stream and that user provides, configuration system is integrated into driving, to shield to user Fall to configure the specific implementation of class, the configuration function with precise meaning is provided in driving, can be driven in client initialization Class and easy configuration realize the use of crawler system.

The URL handling module 12 of the present embodiment can be also used for based on the end Reduce, according to default mode classification to target Document classification, default mode classification include initial link address, Doctype, at least one of crawl depth, document subject matter.

The URL module for reading and writing 11 of the present embodiment can be also used for judging whether URL has been interviewed based on Bloom filter It asks；If so, the URL being accessed can be deleted.Remove duplicate URL can substantially reduce improve the present embodiment distribution climb The workload of worm system, and then improve working efficiency.

Specifically, whether Bloom filter is a kind of binary vector data structure, can be used for judging some URL Through being accessed.Bloom filter stores aggregate information using the bit array of certain length, uses a certain number of hash functions Map the data into bit array space, basic principle are as follows: bit array element is all set to 0, by set it is all at Member is calculated using specified hash function respectively, the corresponding bit array subscript of calculated result is set to 1, in this way, if there is h A hash function then might have the position less than or equal to h and be set to 1 in bit array.Inquiring some element whether there is in set When middle, it is only necessary to which, with this h function to the element successively evaluation, as long as being worth in corresponding index, to have any one be 0, it is determined that Element is not in set.

The URL module for reading and writing 11 of the present embodiment is also based on URL queued QueueURLProcessor class pair The queue of URL composition is safeguarded.QueueURLProcessor class is by being managed URL in internal maintenance queue, thread After starting, URL module for reading and writing 11 constantly reads in URL from inlet flow, first can filter duplicate URL using Bloom filter, URL legitimacy is verified again, eventually forms URL queue, QueueURLProcessor class to URL queue carry out maintenance with URL, is written in output stream by management in a certain order.

Specifically, the distributed reptile system of the present embodiment can also include communication module 19, and communication module 19 can be adjusted Operating rate between whole modules makes the operating rate between modules match.

Specifically, the URL maintenance pond of disk data structure can be used in the distributed reptile system of the present embodiment, to adapt to Current huge data volume avoids leading into heap overflow since data volume is excessive.It can be mentioned for the data pipe in data flow For identifier, it is used for authentication data type, it is convenient to transmit more information between different modules.The distributed reptile of the present embodiment System can carry out processor configuration based on spring container cell, and configuration detail is detached code, realizes the flexible of crawler system Configuration.

Further, the present embodiment has carried out operation test to distributed reptile system.The running environment of the present embodiment is Based on the Hadoop cluster of 4 Centos virtual machines, analyzed for the availability scalability of crawler system, test data For addressable URL, this system can select a certain number of URL to initialize from the library URL at random, test result such as table 1.

Crawl depth	Parallel line number of passes	Test result
			4	5	Grab 100,000 datas, velocity-stabilization
5	5	1,000,000 datas are grabbed, speed is constant
			6	5	Grab 40,000,000 datas, speed decline
5	5	Grasp speed is stablized
			5	10	Grasp speed is accelerated
5	15	Grasp speed decreases

Table 1

In the case where crawl number of threads is certain, crawl depth can generate tremendous influence to data volume, therebetween It has exponent relation, and under the premise of crawl depth is certain, the appropriate Thread Count that increases can accelerate grasp speed, but grab Thread Count When measuring excessive, speed declines instead, because a large amount of thread will increase the burden of system thread switching and resource contention, test knot Fruit is consistent with expection.This programme is by distributed reptile system modular, and different modules realizes specific function, by module Between transmission data carry out information exchange, improve while reducing system coupling the scalability of system, availability and Maintainability improves the dispatching of system, and the creeping performance of system is enable to bring into play.

It is understood that same or similar part can mutually refer in the various embodiments described above, in some embodiments Unspecified content may refer to the same or similar content in other embodiments.

It should be noted that in the description of the present invention, term " first ", " second " etc. are used for description purposes only, without It can be interpreted as indication or suggestion relative importance.In addition, in the description of the present invention, unless otherwise indicated, the meaning of " multiple " Refer at least two.

It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in a processing module It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as Fruit is realized and when sold or used as an independent product in the form of software function module, also can store in a computer In read/write memory medium.

Storage medium mentioned above can be read-only memory, disk or CD etc..

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not Centainly refer to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any One or more embodiment or examples in can be combined in any suitable manner.

Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned Embodiment is changed, modifies, replacement and variant.

Claims

1. a kind of distributed reptile system characterized by comprising uniform resource position mark URL module for reading and writing, URL grab mould Block, document parsing module and persistence module；

The URL module for reading and writing reads URL for the end Map based on MapReduce from inlet flow, the URL is written to In output stream；

The URL handling module, for using URL of the said write into output stream as access address, according to preset network Access module downloads the corresponding destination document of the access address based on the end Map；

The document parsing module extracts the number of targets in the destination document according to predetermined manner for being based on the end Map According to, the type of the target data is detected, if the type is target type, the end the Reduce general based on the MapReduce The target data is sent to the corresponding persistence module；

The persistence module deposits the target data according to preset path and persistence rule for being based on the end Map It stores up in Hadoop distributed file system.

2. distributed reptile system according to claim 1, which is characterized in that the document parsing module is also used to:

If the type is not target type, the target data is sent to pair based on the end Reduce of the MapReduce The URL module for reading and writing answered.

3. distributed reptile system according to claim 1, which is characterized in that further include central schedule module；

The central schedule module is used in the distributed reptile system initialization, configuration data, and sends starting and refer to It enables；After distributed reptile system work, ephemeral data is deleted, check data result and sends halt instruction；It connects The node report information for receiving the distributed reptile system, according to the work of node described in the information reconciliation；Described point of maintenance The connection of cloth crawler system and client carries out data interaction with the client.

4. distributed reptile system according to claim 3, which is characterized in that further include secondary scheduler module；

5. distributed reptile system according to claim 3, which is characterized in that further include configuration generic module；

The configuration generic module, for being automatically injected and parsing the key-value pair of different attribute based on CrawlerConfig class, and Configuration class is copied in different nodes；Based on Configurable class, obtained according to configuration item component in the crawler system Corresponding function in the CrawlerConfig class；Asynchronous iostream is used based on remote procedure call protocol RPC class, Realize far call.

6. distributed reptile system according to claim 5, which is characterized in that further include drive module；

7. distributed reptile system according to claim 1, which is characterized in that the URL handling module is also used to:

The default mode classification includes initial link address, Doctype, at least one of crawls depth, document subject matter.

8. distributed reptile system according to claim 1, which is characterized in that the URL module for reading and writing is also used to:

Judge whether the URL has been accessed based on Bloom filter；

If so, deleting the URL being accessed.

9. distributed reptile system according to claim 1, which is characterized in that the URL module for reading and writing is also used to:

The URL queue formed is safeguarded based on URL queued.

10. -9 distributed reptile system according to claim 1, which is characterized in that it further include communication module, the communication mould Block is used for: