CN107193960A

CN107193960A - A kind of distributed reptile system and periodicity increment grasping means

Info

Publication number: CN107193960A
Application number: CN201710372282.1A
Authority: CN
Inventors: 张雷; 韩建军; 张文哲; 谭龙海; 王崇骏
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2017-05-24
Filing date: 2017-05-24
Publication date: 2017-09-22
Anticipated expiration: 2037-05-24
Also published as: CN107193960B

Abstract

The present invention discloses a kind of distributed reptile system, the system is configured as the Distributed Services based on ZooKeeper, system component and database three parts, wherein, system component includes system monitoring component Monitor, coordination component Coordinator, log collection component Logger, basic reptile component Spider, database includes Redis memory databases, redis is that storage is distributed formula URL task queues and distribution BloomFilter in key value storage form, Redis memory databases.Invention additionally discloses a kind of periodicity increment grasping means based on the system, including：Coordination component Coordinator periodically imports task to distribution URL task queues, and wakes up the Spider components just in dormancy；Spider components carry out dormancy according to the implementation status of current distribution URL task queues or periodicity increment is captured.How the system and method is effectively combined unit reptile together if being solved, realize the distributed reptile of High Availabitity under cluster environment, high stable and high-throughput, and property performance period increment is captured.

Description

A kind of distributed reptile system and periodicity increment grasping means

Technical field

The present invention relates to the high efficient data capture technical field of internet big data, more particularly to a kind of distributed reptile system System and periodicity increment grasping means.

Background technology

Web crawlers is URL (the Uniform Resource Locator, unified resource from one or several Initial pages Finger URL) start, obtain the URL on Initial page, during webpage is captured, according to different crawl strategies, constantly from New URL is extracted on current page and is put into task queue, the stop condition until meeting system.

With the high speed development of internet, explosive growth is presented in network data, and network data source also increasingly tends to be many Memberization.In face of so huge and internet data of diversification, the crawl efficiency of web crawlers how is improved, how for difference Data source carry out customizable and crawl strategy, it appears it is most important.

Compared to conventional individual reptile, distributed reptile can significantly improve the crawl efficiency of reptile, but be also introduced into therewith The problem of new：The mission dispatching problem of multinode, problem of load balancing, webpage repeat sex chromosome mosaicism and cycle under distributed environment Property increment crawl problem etc..

In summary, how while a series of problems that solution distributed reptile is brought, it can effectively improve and climb The grasp speed of worm, and it is present in currently available technology to realize that the periodicity increment of distributed reptile is captured on this basis Subject matter.

The content of the invention

The technical problems to be solved by the invention are to provide a kind of distributed reptile system and periodicity increment crawl side How unit reptile is effectively combined together by method, solution, realizes High Availabitity under cluster environment, high stable and high-throughput Distributed reptile, and the crawl of property performance period increment.The present invention the used technical scheme that solves the above problems is：

The present invention discloses a kind of distributed reptile system, the system be configured as the Distributed Services based on ZooKeeper, System component and database three parts, wherein, system component includes system monitoring component Monitor, coordination component Coordinator, log collection component Logger, basic reptile component Spider, database include Redis memory databases, Redis is that storage is distributed formula URL task queues and distribution in key-value storage form, Redis memory databases BloomFilter；Wherein, the Distributed Services based on ZooKeeper provide distributed coordination service for each system component；It is described System monitoring component Monitor is responsible for the dynamic configuration of system and the condition monitoring of system；The coordination component Coordinator Be responsible for by seed URL imported into the distributed task scheduling queue based on Redis, periodically collect each node state to ZooKeeper, For the one or more in the detection and management of log collection component Logger dynamically distributes Log Source and clustered node；The day Will collection assembly Logger is responsible for basic reptile component Spider collector journal data each from cluster；The basic reptile component Spider, which is responsible for handling webpage, crawls task；The distributed URL task queues based on Redis are responsible for storage and need to be climbed The task URL taken；The distributed BloomFilter based on Redis is responsible for all basic reptile component Spider in cluster URL duplicate removals request.

Further, the Distributed Services based on ZooKeeper by with the mutually coordinated work of each system component, be each system System component, which is provided, includes dynamic configuration, clustered node detection and management, Master elections, distributed lock, overall situation URL ID generations Distributed Services in one or more.

Further, system monitoring component Monitor has Monitor interfaces, and user can be repaiied by Monitor interfaces Change coordination component Coordinator, the log collection component in the system configuration parameter being present on ZooKeeper, cluster Logger and basis reptile component Spider can monitor the corresponding data node on ZooKeeper, and in back end content quilt Notified, and then adjusted accordingly according to amended configuration parameter accordingly after modification.

Further, Monitor interfaces can also show each system status parameters for being present on ZooKeeper in real time and each Component states parameter.

Further, basic reptile component Spider components have a multiple assembly kernel, and each component kernel crawls plan It is slightly not quite identical.

Further, basic reptile component Spider components have high scalability, are write with facilitating for new data source New component kernel.

Further, the task ways of distribution of distributed URL task queues pulling using basis reptile component Spider (Pull) mode.

Further, distributed BloomFilter is different in Redis by the vectorial fragmented storages of bit using fragmentation scheme On Key, and the synchronism control that each basic reptile component Spider is accessed is realized by being segmented optimistic lock.

Invention additionally discloses a kind of periodicity increment grasping means based on above-mentioned distributed reptile system, including：Coordinate Component Coordinator periodically imports task to distribution URL task queues, and wakes up the Spider components just in dormancy； Spider components carry out dormancy according to the implementation status of current distribution URL task queues or periodicity increment is captured, and are not having During crawl task, Spider components will enter resting state, the Spider components of dormancy can by other Spider components or Coordinator components may proceed to carry out crawl task when waking up.

Further, this method comprises the following steps：

S1, coordination component Coordinator periodically import task and stopped to distribution URL task queues, and wake-up The Spider components of dormancy.That is, the Coordinator components of system can periodically import task to distribution URL queues, and task is led After entering, Coordinator can wake up the increment crawl task that all Spider components in dormancy start a new round.Perform crawl Task is periodically to carry out always, and since each cycle be importing kind of subtask.

S2, Spider component judge whether that ends with system periodicity increment is captured, if it is, performing S6, otherwise, perform S3.That is, the crawl thread in Spider components can check corresponding back end information in ZooKeeper, back end letter Breath is set by Monitor, and when reading the crawl of ends with system periodicity increment, Spider components can carry out a series of cleanings Terminate the process of oneself after preserving work；Otherwise, it may proceed to carry out periodicity increment crawl.

S3, judge whether current distributed task scheduling queue is empty, if it is, performing S4, otherwise, redirect S5.That is, Spider Crawl thread in component can check whether the needing to be captured in distributed task scheduling queue in Redis of the task, if so, can then obtain Take task and enter stage of gripping；Otherwise, the dormant stage can be entered.

S4, into the basic reptile component Spider dormant stages, mainly include：1) obstruction crawl thread or dormancy basis are climbed Worm component (Spider) component, 2) wake up thread；

Specifically include following steps：A) current Spider components are judged in addition to currently crawl thread, and other crawl threads are It is no to have blocked, if so, then performing step b), otherwise, perform step c)；B) dormancy sign node is created in ZooKeeper, The node can only need for representing current Spider components dormancy when other assemblies need and wake up the Spider components Delete the back end；C) the crawl thread is blocked；D) crawl thread has blocked, and waits other thread wakenings；E) grab Line taking journey performs S2 by other thread wakenings.

In no crawl task, Spider components will enter the stage, dormancy oneself, it is to avoid the sky of system resource Consumption, when other Spider components have new task to be added to task queue or new one wheel increment crawl beginning, dormancy Spider components can be waken up by other Spider components or Coordinator components, proceed crawl task.

S5, into basic reptile component (Spider) stage of gripping, specifically include following steps：Including：1) from distribution URL task queues obtain task, 2) Spider execution crawl tasks；3) crawl thread or basic reptile component (Spider) are waken up Component.

Specifically include following steps：A) crawl task is obtained from distributed URL queues；B) according to the task of acquisition, crawl Corresponding web page simultaneously stores result；C) hyperlinks between Web pages grabbed are analyzed and new set of tasks is obtained；D) it is new by what is got Task is sent to distributed BloomFilter duplicate removals；E) new task after duplicate removal is added to distributed task scheduling queue；F) judge Whether current Spider components have crawl thread block, if so, then performing step g), otherwise, perform step h)；G) wake up current The crawl thread blocked in Spider components；H) judge whether there are other Spider component sleeps in current cluster, if so, then Corresponding dormancy Spider is waken up, otherwise, S2 is performed.

S6, end.I.e. when each component, which detects system, to be needed to be stopped, it can be tied after necessary cleaning work is carried out The respective process of beam.

The internet data of diversification for huge, a kind of distributed reptile system of the invention and periodicity increment are grabbed Take method compared to the prior art, have the advantages that：

1) realize simple：The present invention is based on the distributed coordination service ZooKeeper increased income and the distributed memory increased income Database Redis builds distributed reptile system, and deep development is carried out on the basis of using technological frame, particular needs are both met Ask, development cost is reduced again.

2) high-performance：Crawl task uses multinode multithreading working method, realizes the high-performance of webpage capture, and prop up Hold the linear expansion of Spider components.

3) High Availabitity：Based on ZooKeeper and Redis, each component of system works in cluster form, it is to avoid single-unit Point crash issue, externally realizes a kind of webpage capture service of High Availabitity high stable.

4) automation periodicity increment crawl：After disposable setting initiating task and relevant system parameters, system can be automatic Periodically increment crawl service is carried out, human intervention is not required to.

5) customizable crawl strategy：Spider components include multiple assembly kernel, and each component kernel correspondence is different Strategy is crawled, and Spider is designed to a kind of component of high extension, easily can be write newly for new data source Component kernel.

6) favorable expandability：System all component is organized together with lower coupling, any single node it is upper offline to being The influence caused of uniting is very little, supports the linear expansion of each component.

As can be seen here, the present invention has reasonable in design, framework simple, High Availabitity, high stable, high-performance, favorable expandability etc. Advantage.

Brief description of the drawings

Fig. 1 is the distributed reptile system architecture diagram

Fig. 2 is periodicity increment grasping means main flow chart

Fig. 3 is Spider stage of gripping flow charts

Fig. 4 is Spider dormant stage flow charts

Embodiment

In order to be better understood by the technology contents of the present invention, make furtherly especially exemplified by diagram appended by specific embodiment and cooperation It is bright.

Fig. 1 is a kind of distributed reptile system architecture diagram of the invention, and the system includes the distribution based on ZooKeeper Service, system component and database three parts.Wherein, the Distributed Services based on ZooKeeper provide for each system component Distributed coordination is serviced；System component includes system monitoring component Monitor, coordination component Coordinator, log collection group Part Logger, basic reptile component Spider；Database includes the data of Redis memory databases and other storage crawl webpages Storage is distributed formula URL task queues and distribution BloomFilter in storehouse, Redis memory databases.

Distributed Services based on ZooKeeper by with the mutually coordinated work of each system component, be that each system component is carried For generating distributed including dynamic configuration, clustered node detection and management, Master elections, distributed lock, the overall situation URL ID Coordination service.ZooKeeper maintains the tree form data structure of a similar file system in internal memory, based on ZooKeeper These Distributed Services can be by creating, inquiring about, delete and monitoring each component corresponding in ZooKeeper data structures Back end is realized.

System monitoring component Monitor is responsible for the dynamic configuration of system and the condition monitoring of system.User can pass through Monitor interface modifications are present in the system configuration parameter (such as each basic reptile component Spider parameter) on ZooKeeper, The corresponding assembly of each in cluster (including Spider, Coordinator and Logger) can monitor the corresponding data on ZooKeeper Node, and each corresponding assembly can be notified accordingly after being changed in back end content, i.e., sent by ZooKeeper Data change notice, and then each component can adjust accordingly according to amended configuration parameter.Monitor interfaces can also be real When display be present in each system status parameters and each component state parameter on ZooKeeper, so that user is monitored in real time, Pinpoint the problems in time and carry out corresponding remedial measure.Wherein, system configuration parameter mainly includes seed importing cycle, canonical about Beam, crawl Thread Count, crawl depth, maximum error number etc., the configuration parameter of also many other very details.

Coordination component Coordinator be responsible for import sub-pages URL to be based on distributed task scheduling queue, periodically collect Each node state is to ZooKeeper, for detection and the pipe of log collection component Logger dynamically distributes Log Source and clustered node Reason.

Log collection component Logger is responsible for basic reptile component Spider collector journal data each from cluster, with after an action of the bowels Continuous log analysis.

Basic reptile component Spider is responsible for specific web page crawl task, and Spider components include multiple assembly kernel, Each component kernel correspondence is different to crawl strategy, and Spider is designed to a kind of component of high extension, can be very square Just new component kernel is write for new data source.During crawling, Spider components enter according to system configuration first The corresponding initialization of row, afterwards can constantly from distributed task scheduling queue request URL, according to corresponding URL switchings it is corresponding crawl it is tactful, Webpage, extraction web page characteristics and text, storage is crawled to extract result, analysis hyperlinks between Web pages, will newly obtain URL through distribution It is added to after BloomFilter duplicate removals in distributed task scheduling queue, until distributed task scheduling queue is sky.

The task URL that storage needs to be crawled is responsible in distributed URL task queues based on Redis.The distributor of task Formula pulls (Pull) mode using basic reptile component Spider's, when Spider currently crawl task at the end of, Spider New task can be actively pulled into distributed queue, the work of next round is carried out.It is worth noting that, being based on currently employed In the case of Redis distributed queue, it is a kind of optimal simplest mode to pull (Pull) mode, other also have push and The mode that push-and-pull is combined can also be realized, but both modes are required for doing extra realization, and pulled mode and do not needed additionally Realization.

Distributed BloomFilter based on Redis is responsible for the URL duplicate removals of all basic reptile component Spider in cluster Request.Redis is key-value storage form, and bit vectors are segmented by the distributed BloomFilter using fragmentation scheme It is stored on the different Key of Redis, and the synchronism control that each Spider is accessed is realized by being segmented optimistic lock.Segmentation pleasure Seeing the realization mechanism locked is：Each Spider duplicate removal request can calculate each section of correspondence of the bit to be accessed vectors first Key, can be monitored afterwards on each Key (key), then initiate the Redis things that each section of bit vector updates request Business, the affairs can first check for whether the corresponding bit vectors of monitored Key were modified after monitoring upon execution, if Affairs since then are then abandoned to perform and initiation duplicate removal request again of spinning；Otherwise, it is updated successfully, duplicate removal URL is added successfully In BloomFilter.The distributed BloomFilter realized based on fragmentation scheme and optimistic lock can not only provide high-throughput Duplicate removal request, and can be extended with the linear expansion of Redis clusters, in the absence of capacity limit.

A kind of periodicity increment grasping means based on above-mentioned distributed reptile system is also disclosed in embodiment, with reference to accompanying drawing 2 to Fig. 4 this method is described in detail.

Fig. 2 is the main flow chart of periodicity increment grasping means in embodiment, is specifically described as follows：

Step 1-0, the initial state of periodicity increment grasping means；

Step 1-1, coordination component Coordinator periodically import task to distribution URL task queues；

Step 1-2, judge whether ends with system periodicity increment capture：If step 1-2 judged results are yes, enter step Rapid 1-9, otherwise performs step 1-3；

Step 1-3, judge whether current distributed task scheduling queue is empty；If the determination result is YES, then stop into Spider In the dormancy stage, corresponding steps 1-4 and step 1-5 is performed, otherwise, into Spider stages of gripping, perform corresponding steps 1-6, step 1-7 and step 1-8.

Step 1-4, obstruction crawl thread or dormancy Spider components；

Step 1-5, obstruction thread are waken up, and perform step 1-2；

Step 1-6, from distributed queue obtain task；

Step 1-7, Spider performs specific crawl task；

Step 1-8, wake-up crawl thread or Spider components, and perform step 1-2；

Step 1-9, done state.

Fig. 3 is Spider stage of gripping flow charts in embodiment, is specifically described as follows：

Step 2-0, Spider stage of gripping starts state, step Following step 1-3；

Step 2-1, from distributed queue obtain task；

Step 2-2, the task according to acquisition, capture corresponding web page and store result；

Step 2-3, the hyperlinks between Web pages that grab of analysis simultaneously obtain new set of tasks；

Step 2-4, by the new task got to distribution BloomFilter duplicate removals；

Step 2-5, the new task after duplicate removal is added to distributed task scheduling queue；

Step 2-6, judge whether this Spider components have crawl thread block, if so, then performing step 2-7, otherwise, hold Row step 2-8；

Step 2-7, the crawl thread for waking up this Spider components blocked；

Step 2-8, judge whether there are other Spider component sleeps in current cluster, if so, step 2-9 is then performed, Otherwise, 2-10 is performed；

Step 2-9, wake-up dormancy Spider；

The done state of step 2-10, Spider stage of gripping, can perform step 1-2 afterwards.

Fig. 4 is Spider dormant stage flow charts in embodiment, is specifically described as follows：

Step 3-0, the Spider dormant stage starts state, step Following step 1-3；

Step 3-1, judge this Spider components except capture thread in addition to, it is other crawl threads whether blocked, if so, then Step 3-2 is performed, otherwise, step 3-3 is performed；

Step 3-2, the establishment dormancy sign node in ZooKeeper, the node can be for the corresponding Spider groups of expression Part dormancy, the back end need to be only deleted when other assemblies need and wake up the Spider components；

Step 3-3, block this crawl thread；

Step 3-4, crawl thread have blocked, and wait other thread wakenings；

Step 3-5, crawl thread are by other thread wakenings；

The done state of step 3-6, Spider dormant stage, can perform step 1-2 afterwards.

Although embodiment of the present invention is described above in association with accompanying drawing, the invention is not limited in above-mentioned Specific embodiments and applications field, above-mentioned specific embodiment is only schematical, guiding, rather than restricted 's.One of ordinary skill in the art is not departing from the scope that the claims in the present invention are protected under the enlightenment of this specification In the case of, the form of many kinds can also be made, these belong to the row of protection of the invention.

Claims

1. a kind of distributed reptile system, it is characterised in that the system be configured as the Distributed Services based on ZooKeeper, System component and database three parts, wherein, system component includes system monitoring component Monitor, coordination component Coordinator, log collection component Logger, basic reptile component Spider, database include Redis memory databases, Storage is distributed formula URL task queues and distribution BloomFilter in Redis memory databases；Wherein,

Distributed Services based on ZooKeeper provide distributed coordination service for each system component,

The system monitoring component Monitor is responsible for the dynamic configuration of system and the condition monitoring of system,

The coordination component Coordinator is responsible for seed URL importeding into the distributed task scheduling queue based on Redis, cycle Property collect each node state to ZooKeeper, for the inspection of log collection component Logger dynamically distributes Log Source and clustered node One or more in surveying and managing,

The log collection component Logger is responsible for basic reptile component Spider collector journal data each from cluster,

The basic reptile component Spider, which is responsible for handling webpage, crawls task,

The task URL that storage needs to be crawled is responsible in the distributed URL task queues based on Redis,

The distributed BloomFilter based on Redis is responsible for the URL duplicate removals of all basic reptile component Spider in cluster Request.

2. distributed reptile system according to claim 1, it is characterised in that the Distributed Services based on ZooKeeper By with the mutually coordinated work of each system component, for each system component provide including dynamic configuration, clustered node detection with management, One or more in Master elections, distributed lock, the Distributed Services of overall situation URL ID generations.

3. a kind of distributed reptile system according to claim 1, it is characterised in that wherein described system monitoring component Monitor has Monitor interfaces, and user can be present in the system configuration on ZooKeeper by Monitor interface modifications Coordination component Coordinator, log collection component Logger and basis reptile component Spider in parameter, cluster can be monitored Corresponding data node on ZooKeeper, and notified accordingly after back end content is changed, and then according to modification Configuration parameter afterwards is adjusted accordingly.

4. distributed reptile system according to claim 3, it is characterised in that Monitor interfaces can also in real time show and deposit It is each system status parameters and each component state parameter on ZooKeeper.

5. a kind of distributed reptile system according to claim 1, it is characterised in that the basic reptile component Spider Component has a multiple assembly kernel, and each component kernel to crawl strategy not quite identical.

6. distributed reptile system according to claim 1, it is characterised in that the basic reptile component Spider components With high scalability, new component kernel is write for new data source to facilitate.

7. distributed reptile system according to claim 1, it is characterised in that times of the distributed URL task queues Business ways of distribution pulls mode using basic reptile component Spider's.

8. distributed reptile system according to claim 1, it is characterised in that the distributed BloomFilter is used The vectorial fragmented storages of bit are realized each basic reptile by fragmentation scheme on the different Key of Redis, and by being segmented optimistic lock The synchronism control that component Spider is accessed.

9. a kind of periodicity increment grasping means, it is characterised in that climbed based on distribution described in claim 1 to 8 any one Worm system, including：Coordination component Coordinator periodically imports task and stopped to distribution URL task queues, and wake-up The Spider components of dormancy；Spider components carry out dormancy or periodicity according to the implementation status of current distribution URL task queues Increment is captured, in no crawl task, and Spider components will enter resting state, and the Spider components of dormancy can be by other Basic reptile component Spider components or Coordinator components may proceed to carry out crawl task when waking up.

10. periodicity increment grasping means according to claim 9, it is characterised in that comprise the following steps：

S1, coordination component Coordinator periodically import task to distribution URL task queues, and wake up just in dormancy Spider components；

S2, judge whether that ends with system periodicity increment is captured, if it is, redirect S6, otherwise, perform S3；

S3, judge whether current distributed task scheduling queue is empty, if it is, performing S4, otherwise, redirect S5；

S4, into basic reptile component (Spider) dormant stage, comprise the following steps：A) judge that current Spider components are removed to work as Outside preceding crawl thread, whether other crawl threads have blocked, if so, then performing step b), otherwise, perform step c)；B) exist Dormancy sign node is created in ZooKeeper, the node can be for representing current Spider components dormancy, when other groups Part needs only delete the back end when waking up the Spider components；C) the crawl thread is blocked；D) thread has been captured Through obstruction, other thread wakenings are waited；E) crawl thread is by other thread wakenings, and performs S2；

S5, into basic reptile component Spider stages of gripping, including：A) crawl task is obtained from distributed URL queues；B) root According to the task of acquisition, capture corresponding web page and store result；C) hyperlinks between Web pages grabbed are analyzed and new task-set is obtained Close；D) new task got is sent to distributed BloomFilter duplicate removals；E) new task after duplicate removal is added to distribution Formula task queue；F) judge whether current Spider components have crawl thread block, if so, then performing step g), otherwise, perform Step h)；G) the crawl thread blocked in current Spider components is waken up；H) judge whether there are others in current cluster Spider component sleeps, if so, then waking up corresponding dormancy Spider, otherwise, perform S2；

S6, end.