CN109063216A

CN109063216A - A kind of distributed vertical service search crawler frame

Info

Publication number: CN109063216A
Application number: CN201811208977.7A
Authority: CN
Inventors: 邓炽成
Original assignee: Zhuhai Zhitu Digital Research Information Technology Co Ltd
Current assignee: Zhuhai Zhitu Digital Research Information Technology Co Ltd
Priority date: 2018-10-17
Filing date: 2018-10-17
Publication date: 2018-12-21

Abstract

The invention discloses a kind of distributed vertical service search crawler frames, specific step is as follows: step 1: using crawler static distribution form by the crawler of same target according to configuration, it copies on different network computers, request is then crawled using the sending of different IP resources；Step 2: target pages URL converges to consolidated storage by message queue pipeline；Step 3: URL is scheduled by consolidated storage load program, and is pushed to multiple network computer terminals by message queue pipeline, carries out dynamic dispatching execution URL downloading task by monitoring crawler.The present invention utilizes distributed vertical service search crawler, carry out reasonable layout task using more IP resources, share tasks will be crawled to carry out to network multiple stage computers, there is provided a kind of solutions of low cost for the web crawlers of high resource consumption, therefore distributed reptile is to realize in line business search and continue the good technological means free of discontinuities crawled.

Description

A kind of distributed vertical service search crawler frame

Technical field

The present invention relates to search crawler framework technology field, in particular to a kind of distributed vertical service search crawler frames Frame.

Background technique

Most of website can have certain preventative strategies for the behavior of web crawlers, prevent because excessively frequently consuming More Internet resources and I/O resource, cause the decline of web site performance.In order to be applicable in the rule of anti-crawler, crawler needs more IP resource carrys out reasonable layout task, will crawl share tasks and carries out to network multiple stage computers.Public cloud is more and more common at present, Internet resources cost is lower and lower, and there is provided a kind of solution party of low cost for the web crawlers of high resource consumption Case, therefore distributed reptile is to realize in line business search and continue the good technological means free of discontinuities crawled.

Therefore, it is necessary to solve the above problems to invent a kind of distributed vertical service search crawler frame.

Summary of the invention

The purpose of the present invention is to provide a kind of distributed vertical service search crawler frames, by using crawler static state point Cloth form according to configuration, copies to the crawler of same target on different network computers, is climbed using the sending of different IP resources Request is taken, target pages URL converges to consolidated storage by message queue pipeline, and URL is passed through message team by consolidated storage load program Tubulation road is pushed to network computer terminal, executes URL downloading task by monitoring crawler, the present invention utilizes distributed vertical business Crawler is searched for, carrys out reasonable layout task using more IP resources, share tasks will be crawled and carried out to network multiple stage computers, mutually Cost is relatively low for networked resources, and there is provided a kind of solutions of low cost for the web crawlers of high resource consumption, therefore Distributed reptile is to realize in line business search and continue the good technological means free of discontinuities crawled, to solve above-mentioned back The problem of being proposed in scape technology.

To achieve the above object, the invention provides the following technical scheme: a kind of distributed vertical service search crawler frame, Specific step is as follows:

Step 1: the crawler of same target is copied to according to configuration by different networks using crawler static distribution form On computer, request is then crawled using the sending of different IP resources；

Step 2: target pages URL converges to consolidated storage by message queue pipeline；

Step 3: URL is scheduled by consolidated storage load program, and is pushed to multiple networks by message queue pipeline Terminal carries out dynamic dispatching execution URL downloading task by monitoring crawler；

Step 4: when URL is not denied access to, being indexed, and passs consolidated storage, and the URL is marked in consolidated storage It has been performed task；

Step 5: when there is network computer terminal that cannot execute URL downloading task, by the URL through message queue pipeline Consolidated storage is fed back to, this URL is scheduled by consolidated storage again at this time, is come back to message queue and is pushed to other nets again Network terminal, then dynamic dispatching is carried out by monitoring crawler, execute URL downloading task；

Step 6: when the message that cannot execute URL downloading task feeds back to consolidated storage through message queue pipeline, simultaneously will This URL and corresponding network computer terminal carry out record preservation, when searching again for this URL next time, do not transfer this net Network terminal executes URL downloading task.

Preferably, the message queue is managed collectively and is dispatched by consolidated storage, when network computer crawler finds URL When be pushed in consolidated storage by message queue, consolidated storage judges that the URL whether there is or not being downloaded, has, abandons by duplicate removal, without then New URL is added in the message queue to be crawled by consolidated storage and executes URL downloading task for network computer terminal, works as network When terminal executes the failure of URL downloading task, which can be backed in message queue, be moved by monitoring crawler State scheduling is downloaded again, and when network computer terminal, which executes URL downloading task, to be completed, which can be recorded as climbing by consolidated storage State is taken, avoids repeating to crawl.

Technical effect and advantage of the invention:

The present invention uses crawler static distribution form that the crawler of same target according to configuration, is copied to different network meters On calculation machine, request is crawled using the sending of different IP resources, target pages URL converges to consolidated storage by message queue pipeline, in URL is pushed to network computer terminal by message queue pipeline by heart library load program, executes URL downloading times by monitoring crawler Business, the present invention utilize distributed vertical service search crawler, carry out reasonable layout task using more IP resources, will crawl task It is distributed to the progress of network multiple stage computers, cost is relatively low for Internet resources, is to provide for the web crawlers of high resource consumption A kind of solution of low cost, therefore distributed reptile is realized and continues free of discontinuities crawl in the line business search Good technological means.

Detailed description of the invention

Fig. 1 is overall structure diagram of the invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

Embodiment 1:

A kind of distributed vertical service search crawler frame according to figure 1, the specific steps are as follows:

The crawler of same target is copied into different network query functions according to configuration by using crawler static distribution form On machine, request is crawled using the sending of different IP resources, target pages URL converges to consolidated storage, center by message queue pipeline Data are pushed to network computer terminal by message queue pipeline by library load program, execute URL downloading times by monitoring crawler Business, the present invention utilize distributed vertical service search crawler, carry out reasonable layout task using more IP resources, will crawl task It is distributed to the progress of network multiple stage computers, cost is relatively low for Internet resources, is to provide for the web crawlers of high resource consumption A kind of solution of low cost, therefore distributed reptile is realized and continues free of discontinuities crawl in the line business search Good technological means.

Embodiment 2:

A kind of distributed vertical service search crawler frame according to figure 1, the message queue pipeline include two Message queue manager, one of them is used as main message queue and uses, another message queue manager is used for Enabling when main message queue manager breaks down, as spare；

Embodiment 3:

A kind of distributed vertical service search crawler frame according to figure 1, the message queue pipeline include three Message queue manager, one of them is used as main message queue and uses, other two message queue manager difference For the enabling when main message queue manager breaks down with congestion, used as spare and dredging；

Embodiment 4:

A kind of distributed vertical service search crawler frame according to figure 1, the message queue pipeline include three Message above queue management device, one of them is used as main message queue and uses, and a message queue manager is used for Spare when main message queue manager breaks down, multiple message queue managers in addition are used in main message Enabling when queue management device gets congestion, according to crawler number evenly distribute, as dredging use；

Working principle of the present invention:

Referring to Figure of description 1, in use, first using crawler static distribution form by the crawler of same target according to matching It sets, copies on different network computers, request is then crawled using the sending of different IP resources, target pages URL is by disappearing Breath queue pipeline converges to consolidated storage, at this point, URL is scheduled by consolidated storage load program, and passes through message queue pipeline Multiple network computer terminals are pushed to, carry out dynamic dispatching execution URL downloading task by monitoring crawler: when URL is not refused When accessing absolutely, it is indexed, passs consolidated storage, consolidated storage is marked the URL and has been performed task；When there is network computer whole The URL is fed back to consolidated storage through message queue pipeline when cannot execute URL downloading task by end, at this time consolidated storage by this URL again It is secondary to be scheduled, and other network computer terminals are pushed to by message queue pipeline, then carry out dynamic by monitoring crawler Scheduling executes URL downloading task, the library URL is finally passed, when the URL that cannot execute URL downloading task is anti-through message queue pipeline Feed consolidated storage when, while this URL and corresponding network computer terminal are subjected to record preservation, are searched again for when next time When this URL, this network computer terminal is not transferred and executes URL downloading task, the message queue pipeline includes two or two Above message queue manager, one of them is used as main message queue and uses, other message queue managers For breaking down in main message queue manager or enabling when congestion, used as spare and dredging.

Finally, it should be noted that the foregoing is only a preferred embodiment of the present invention, it is not intended to restrict the invention, Although the present invention is described in detail referring to the foregoing embodiments, for those skilled in the art, still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features, All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in of the invention Within protection scope.

Claims

1. a kind of distributed vertical service search crawler frame, it is characterised in that: specific step is as follows:

Step 1: the crawler of same target is copied to according to configuration by different network query functions using crawler static distribution form On machine, request is then crawled using the sending of different IP resources；

Step 3: URL is scheduled by consolidated storage load program, and is pushed to multiple network query functions by message queue pipeline Machine terminal carries out dynamic dispatching execution URL downloading task by monitoring crawler；

Step 4: when URL is not denied access to, being indexed, and passs consolidated storage, consolidated storage be marked the URL by Execution task；

Step 5: when there is network computer terminal that cannot execute URL downloading task, which is fed back through message queue pipeline To consolidated storage, this URL is scheduled by consolidated storage again at this time, is come back to message queue and is pushed to other network meters again Calculation machine terminal, then dynamic dispatching is carried out by monitoring crawler, execute URL downloading task；

Step 6: when the message that cannot execute URL downloading task feeds back to consolidated storage through message queue pipeline, while by this URL and corresponding network computer terminal carry out record preservation, when searching again for this URL next time, do not transfer this network Terminal executes URL downloading task.

2. a kind of distributed vertical service search crawler frame according to claim 1, it is characterised in that: the message team Column are managed collectively and are dispatched by consolidated storage, are pushed to center by message queue when network computer crawler finds URL In library, consolidated storage judges that the URL whether there is or not being downloaded, has, abandons by duplicate removal, and nothing is then added to new URL to climb by consolidated storage URL downloading task is executed for network computer terminal in the message queue taken, when network computer terminal executes URL downloading task When failure, which can be backed in message queue, downloaded again by monitoring crawler progress dynamic dispatching, worked as network query function When machine terminal executes the completion of URL downloading task, which can be recorded as crawling state by consolidated storage, avoid repeating to crawl.