CN111125487A

CN111125487A - Crawling method and device for web crawler

Info

Publication number: CN111125487A
Application number: CN201911343121.5A
Authority: CN
Inventors: 陈建欣; 王海军; 梁晓; 刘沐芸
Original assignee: Individualized Cell Therapy Technology National And Local Joint Engineering Laboratory (shenzhen)
Current assignee: Individualized Cell Therapy Technology National And Local Joint Engineering Laboratory (shenzhen)
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-05-08

Abstract

The invention discloses a crawling method and a crawling device for a web crawler, which relate to a plurality of target task allocation servers and a target crawling server, wherein the crawling method comprises the following steps: acquiring a target crawling task list, wherein the target crawling task list comprises a plurality of crawling tasks and a crawling address corresponding to each crawling task; distributing a plurality of crawling tasks to each target task distribution server according to a preset distribution rule, and performing duplicate removal processing on the crawling tasks received by each target task distribution server; and delivering the crawling task corresponding to each target task allocation server after the duplicate removal processing and the corresponding crawling address to the target crawling server according to a preset delivery rule. The invention solves the following problems in the prior art: the single-computer web crawler is easy to cause single-computer crash, the multi-thread web crawler is limited on one running server, and repeated crawling results are easy to occur.

Description

Crawling method and device for web crawler

Technical Field

The invention relates to the technical field of web crawler application, in particular to a crawling method and device for a web crawler.

Background

A web crawler (also called web spider, web robot, among FOAF communities, more often called web chasers) is a program or script that automatically captures web information according to certain rules. Other less commonly used names are ants, automatic indexing, simulation programs or worms. The currently used web crawlers all belong to single-machine operation; although the crawler has a multi-thread operation mode, the crawler is limited to one operation server, and the performance requirement of a single machine is extremely high for a large-scale web crawler task. The crawlers technology that uses at present all is the crawler based on the unit version, for improving crawler efficiency, has improved in present crawler instrument, adopts the mode of multiprocess to crawl. Although there are improvements to multiprocess crawlers, the following disadvantages exist: 1. communication cannot be carried out among the processes, and multiple crawling targets need to be manually classified clearly; 2. when the crawling classification is unclear or URLs of crawling targets are overlapped, repeated crawling results often appear in the multi-process crawling method, and duplicate removal operation is still required at the later stage; 3. still being limited by the resources of one crawling server, for large crawling tasks, higher and more expensive servers need to be configured; 4. the single machine running always has the risk of single machine crash.

Disclosure of Invention

The embodiment of the invention provides a crawling method and a crawling device for a web crawler, which aim to solve the following problems in the prior art: the single-computer web crawler is easy to cause single-computer crash, the multi-thread web crawler is limited on one running server, and repeated crawling results are easy to occur.

In order to solve the above technical problem, a first technical solution adopted in the embodiments of the present invention is as follows:

a crawling method of a web crawler, which relates to a plurality of target task allocation servers and a target crawling server, comprises the following steps: acquiring a target crawling task list, wherein the target crawling task list comprises a plurality of crawling tasks and crawling addresses corresponding to the crawling tasks; distributing a plurality of crawling tasks to each target task distribution server according to a preset distribution rule, and performing duplicate removal processing on the crawling tasks received by each target task distribution server; and delivering the crawling task corresponding to each target task distribution server after the duplication removal processing and the corresponding crawling address to the target crawling server according to a preset delivery rule.

Optionally, after the delivering the crawling task and the crawling address corresponding to each target task allocation server after the deduplication processing to the target crawling server according to a preset delivery rule, the method includes: and obtaining a crawling result of each target crawling server executing the crawling task corresponding to the target crawling server, and storing the crawling result to a specified address.

Optionally, after saving the crawling result to the specified address, the method includes: and analyzing the page content corresponding to the crawling result, and identifying a new crawling address according to the page content.

Optionally, the allocating, according to a preset allocation rule, the plurality of crawling tasks to each target task allocation server includes: judging whether a signal of the crawling task in the target crawling task list is received by the target task allocation server or not; if yes, setting a pickup mark for the picked crawling task, and distributing the crawling task after the pickup mark is set to the target crawling server.

Optionally, after setting a pickup mark for the crawl tasks picked up and allocating the crawl tasks after setting the pickup mark to the target crawl server, the method includes: judging whether the crawling task with the pickup mark is still not allocated to the corresponding target crawling server after a preset time length; and if so, distributing the crawling task with the pickup mark to the corresponding target crawling server.

Optionally, the delivering, according to a preset delivery rule, the crawling task and the crawling address corresponding to each target task allocation server after the deduplication processing to the target crawling server includes: acquiring running state data of each target crawling server, wherein the running state data comprises a CPU (Central processing Unit) utilization rate, a memory utilization rate and the current task number; and delivering the crawling task and the crawling address to the target crawling server meeting preset requirements according to the running state data.

Optionally, the delivering the crawling task and the crawling address to the target crawling server meeting preset requirements according to the running state data includes: acquiring the target crawling server with the minimum current task number; and delivering the crawling task and the crawling address to the target crawling server with the minimum number of current tasks.

In order to solve the above technical problem, a second technical solution adopted in the embodiments of the present invention is as follows:

a crawling device of a web crawler, which relates to a plurality of target task allocation servers and target crawling servers, comprises: the task obtaining module is used for obtaining a target crawling task list, and the target crawling task list comprises a plurality of crawling tasks and crawling addresses corresponding to the crawling tasks; the task allocation module is used for allocating a plurality of crawling tasks to each target task allocation server according to a preset allocation rule and performing duplicate removal processing on the crawling tasks received by each target task allocation server; and the task delivery module is used for delivering the crawling task corresponding to each target task distribution server after the duplication removal processing and the corresponding crawling address to the target crawling server according to a preset delivery rule.

In order to solve the above technical problem, a third technical solution adopted in the embodiments of the present invention is as follows:

a computer-readable storage medium having stored thereon a computer program which, when crawled, implements a crawling method of a web crawler as described above.

In order to solve the above technical problem, a fourth technical solution adopted in the embodiments of the present invention is as follows:

a computer apparatus comprising a processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing a crawling method of a web crawler as described above when crawling the computer program.

The embodiment of the invention has the beneficial effects that: different from the situation in the prior art, the embodiment of the present invention obtains the target crawling task list, allocates a plurality of crawling tasks to each target task allocation server according to the preset allocation rule, performs deduplication processing, and finally delivers the crawling task corresponding to each target task allocation server after deduplication processing and the corresponding crawling address to the target crawling server according to the preset delivery rule, thereby solving the following problems in the prior art: the single-computer web crawler is easy to cause single-computer crash, the multi-thread web crawler is limited on one running server, and repeated crawling results are easy to occur.

Drawings

FIG. 1 is a flowchart illustrating an implementation of a crawling method for web crawlers according to a first embodiment of the present invention;

FIG. 2 is a partial structural framework diagram of an embodiment of a crawler crawling apparatus for web crawlers according to a second embodiment of the present invention;

FIG. 3 is a partial structural framework diagram of an embodiment of a computer-readable storage medium according to a third embodiment of the present invention;

fig. 4 is a partial structural framework diagram of an embodiment of a computer device according to a fourth embodiment of the present invention.

Detailed Description

Example one

Referring to fig. 1, fig. 1 is a flowchart illustrating an implementation of a crawling method for a web crawler according to an embodiment of the present invention, which can be obtained by referring to fig. 1, and the crawling method for a web crawler according to the present invention relates to a plurality of target task allocation servers and a target crawling server, and includes:

step S101: and acquiring a target crawling task list, wherein the target crawling task list comprises a plurality of crawling tasks and crawling addresses corresponding to the crawling tasks. Wherein, the target crawling task list is input by the staff in a specified interface.

Step S102: and distributing a plurality of crawling tasks to each target task distribution server according to a preset distribution rule, and performing duplicate removal treatment on the crawling tasks received by each target task distribution server, namely removing repeated crawling tasks.

In this step, specifically, each target task allocation server actively obtains a corresponding crawling task from the target crawling task list at a fixed time interval, determines whether the crawling task is repeated, and performs deduplication. After one target task allocation server delivers the corresponding crawling task, the crawling task is picked up again from the target crawling task list until the target crawling task fails and cannot be executed.

Step S103: and delivering the crawling task corresponding to each target task distribution server after the duplication removal processing and the corresponding crawling address to the target crawling server according to a preset delivery rule.

In this embodiment, optionally, after the delivering the crawling task and the crawling address corresponding to each target task allocation server after the deduplication processing to the target crawling server according to a preset delivery rule, the method includes:

and obtaining a crawling result of each target crawling server executing the crawling task corresponding to the target crawling server, and storing the crawling result to a specified address.

In this embodiment, optionally, after saving the crawling result to the specified address, the saving includes:

and analyzing the page content corresponding to the crawling result, and identifying a new crawling address according to the page content.

In this embodiment, optionally, the allocating, according to a preset allocation rule, the plurality of crawling tasks to each target task allocation server includes:

firstly, judging whether a signal of the crawling task in the target crawling task list is received by the target task allocation server or not.

Secondly, if a signal that the crawling task in the target crawling task list is received, the target task allocation server acquires the crawling task, a getting mark is set for the acquired crawling task, and the crawling task after the getting mark is set is allocated to the target crawling server.

In this embodiment, optionally, after setting a pickup mark for the crawl tasks picked up and allocating the crawl tasks after setting the pickup mark to the target crawl server, the method includes:

first, whether the crawling task with the pickup mark set is still not allocated to the corresponding target crawling server after a preset time length is judged.

Secondly, if the crawling task with the pickup mark is still not allocated to the corresponding target crawling server after a preset time length, allocating the crawling task with the pickup mark to the corresponding target crawling server.

In this embodiment, optionally, the delivering, according to a preset delivery rule, the crawling task and the crawling address corresponding to each target task allocation server after the deduplication processing to the target crawling server includes:

firstly, obtaining operation state data of each target crawling server, wherein the operation state data comprises a CPU (Central processing Unit) utilization rate, a memory utilization rate and a current task number.

Secondly, delivering the crawling task and the crawling address to the target crawling server meeting preset requirements according to the running state data.

In this embodiment, optionally, the target task allocation server performs a real-time scoring operation on the acquired operating state of the target crawling server, and submits the crawling task to the target crawling server with the lowest score for execution. For example, if the CPU usage of a target crawling server is 30%, the memory usage is 50%, and the number of crawling tasks to be executed is 6, the target crawling server is scored as 30 × 0.3+50 × 0.3+6 × 0.4 — 26.4.

In this embodiment, optionally, the delivering the crawling task and the crawling address to the target crawling server meeting preset requirements according to the running state data includes:

firstly, the target crawling server with the minimum current task number is obtained.

And secondly, delivering the crawling task and the crawling address to the target crawling server with the minimum number of current tasks. The CPU utilization rate and the memory utilization rate of the target crawling server with the least current task number are always the lowest, and the target crawling server is most suitable for receiving crawling tasks.

The embodiment of the invention solves the following problems in the prior art by acquiring the target crawling task list, allocating a plurality of crawling tasks to each target task allocation server according to the preset allocation rule, performing duplicate removal processing, and finally delivering the crawling task corresponding to each target task allocation server after the duplicate removal processing and the corresponding crawling address to the target crawling server according to the preset delivery rule: the single-computer web crawler is easy to cause single-computer crash, the multi-thread web crawler is limited on one running server, and repeated crawling results are easy to occur.

In addition, the invention also has the following beneficial effects:

firstly, compared with the single-machine operation in the past, a plurality of servers work cooperatively, when one server crashes, other servers still complete the original crawling task, and the usability is extremely high; secondly, the invention can give play to the advantage of cooperative work of a plurality of servers, increase or reduce the work servers as required, when the execution scale of the crawling task is small, partial resources can be extracted and adjusted to be used as other purposes; thirdly, the invention adopts the cooperative work of a plurality of servers, can greatly improve the execution efficiency of the crawling task, and can perform the duplicate removal operation on the crawling task in the task scheduling process without performing the post-processing on the crawling result subsequently.

Example two

Referring to fig. 2, fig. 2 is a partial structural framework diagram of a crawling apparatus 100 for web crawlers according to an embodiment of the present invention, which can be obtained by referring to fig. 2, wherein the crawling apparatus 100 for web crawlers according to the present invention relates to a plurality of target task assignment servers and target crawling servers, and comprises:

the task obtaining module 110 is configured to obtain a target crawling task list, where the target crawling task list includes a plurality of crawling tasks and a crawling address corresponding to each crawling task.

And the task allocation module 120 is configured to allocate the plurality of crawling tasks to each target task allocation server according to a preset allocation rule, and perform deduplication processing on the crawling tasks received by each target task allocation server.

And the task delivery module 130 is configured to deliver the crawling task and the crawling address corresponding to each target task allocation server after the deduplication processing to the target crawling server according to a preset delivery rule.

EXAMPLE III

Referring to fig. 3, a computer-readable storage medium 10 according to an embodiment of the present invention can be seen, where the computer-readable storage medium 10 includes: ROM/RAM, magnetic disk, optical disk, etc., on which a computer program 11 is stored, the computer program 11 implementing the crawling method of the web crawler as described in the first embodiment when being crawled. Since the crawling method of the web crawler has been described in detail in the first embodiment, the description is not repeated here.

According to the crawling method of the web crawler, which is realized by the embodiment of the invention, the target crawling task list is obtained, a plurality of crawling tasks are distributed to each target task distribution server according to the preset distribution rule and are subjected to duplicate removal processing, and finally the crawling task corresponding to each target task distribution server and the corresponding crawling address after the duplicate removal processing are delivered to the target crawling server according to the preset delivery rule, so that the following problems in the prior art are solved: the single-computer web crawler is easy to cause single-computer crash, the multi-thread web crawler is limited on one running server, and repeated crawling results are easy to occur.

Example four

Referring to fig. 4, a computer device 20 according to an embodiment of the present invention includes a processor 21, a memory 22, and a computer program 221 stored in the memory 22 and running on the processor 21, where the crawling method of the web crawler according to an embodiment is implemented when the processor 21 crawls the computer program 221. Since the crawling method of the web crawler has been described in detail in the first embodiment, the description is not repeated here.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A crawling method of a web crawler, which relates to a plurality of target task allocation servers and a target crawling server, comprises the following steps:

acquiring a target crawling task list, wherein the target crawling task list comprises a plurality of crawling tasks and crawling addresses corresponding to the crawling tasks;

distributing a plurality of crawling tasks to each target task distribution server according to a preset distribution rule, and performing duplicate removal processing on the crawling tasks received by each target task distribution server;

and delivering the crawling task corresponding to each target task distribution server after the duplication removal processing and the corresponding crawling address to the target crawling server according to a preset delivery rule.

2. The crawling method for web crawlers according to claim 1, wherein after delivering the crawl task and the crawl address corresponding to each target task allocation server after the deduplication processing to the target crawl server according to a preset delivery rule, the method comprises:

3. The crawling method for web crawlers according to claim 2, wherein said saving the crawling result to a specified address comprises:

4. The crawling method for web crawlers according to claim 1, wherein the step of allocating the plurality of crawling tasks to each target task allocation server according to a preset allocation rule comprises:

judging whether a signal of the crawling task in the target crawling task list is received by the target task allocation server or not;

if yes, setting a pickup mark for the picked crawling task, and distributing the crawling task after the pickup mark is set to the target crawling server.

5. The crawling method for web crawlers according to claim 4, wherein after setting a pickup mark for the crawl tasks picked up and allocating the crawl tasks after setting the pickup mark to the target crawl server, the method comprises:

judging whether the crawling task with the pickup mark is still not allocated to the corresponding target crawling server after a preset time length;

and if so, distributing the crawling task with the pickup mark to the corresponding target crawling server.

6. The crawling method for web crawlers according to claim 1, wherein the delivering the crawl task and the crawl address corresponding to each target task allocation server after the deduplication processing to the target crawl server according to a preset delivery rule comprises:

acquiring running state data of each target crawling server, wherein the running state data comprises a CPU (Central processing Unit) utilization rate, a memory utilization rate and the current task number;

and delivering the crawling task and the crawling address to the target crawling server meeting preset requirements according to the running state data.

7. The crawling method for web crawlers according to claim 1, wherein the delivering the crawling task and the crawling address to the target crawling server meeting preset requirements according to the running state data comprises:

acquiring the target crawling server with the minimum current task number;

and delivering the crawling task and the crawling address to the target crawling server with the minimum number of current tasks.

8. A crawling apparatus of web crawler, which relates to a plurality of target task assignment servers and target crawling servers, comprising:

the task obtaining module is used for obtaining a target crawling task list, and the target crawling task list comprises a plurality of crawling tasks and crawling addresses corresponding to the crawling tasks;

the task allocation module is used for allocating a plurality of crawling tasks to each target task allocation server according to a preset allocation rule and performing duplicate removal processing on the crawling tasks received by each target task allocation server;

and the task delivery module is used for delivering the crawling task corresponding to each target task distribution server after the duplication removal processing and the corresponding crawling address to the target crawling server according to a preset delivery rule.

9. A computer-readable storage medium, having stored thereon a computer program which, when crawled, implements the crawling method of the web crawler of any of claims 1 to 7.

10. A computer device comprising a processor, a memory, and a computer program stored on the memory and executable on the processor, wherein the processor implements the crawling method of the web crawler according to any one of claims 1 to 7 when crawling the computer program.