CN110516139B

CN110516139B - Crawler system and method

Info

Publication number: CN110516139B
Application number: CN201910835029.4A
Authority: CN
Inventors: 宋海伟
Original assignee: Shanghai Ctrip Business Co Ltd
Current assignee: Shanghai Ctrip Business Co Ltd
Priority date: 2019-09-05
Filing date: 2019-09-05
Publication date: 2023-07-07
Anticipated expiration: 2039-09-05
Also published as: CN110516139A

Abstract

The invention discloses a crawler system and a method, wherein the crawler system comprises a client and a server; the server side comprises a load balancing server and a cluster server; the client is used for generating crawling parameters according to access requirements of a preset crawling target and a target website, and sending the crawling parameters to the load balancing server; the load balancing server is used for generating a crawler task according to the crawling parameters, distributing crawler tasks to the cluster servers; the cluster server is used for crawling target data of the target website according to the crawler task. According to the invention, the client user only needs to set the crawling parameters at the client according to the access requirements of the preset crawling target and the target website, and the generated crawler tasks are uniformly processed by the cluster server, so that the system is easy to maintain, the research and development time is reduced, the repeated research and development workload is avoided, and the research and development cost is reduced.

Description

Crawler system and method

Technical Field

The invention relates to the field of information retrieval, in particular to a crawler system and a crawler method.

Background

With the rapid development of networks, the internet has become a carrier of vast amounts of information, and how to efficiently extract and utilize such information has become a great challenge.

The web crawler is a program or script for automatically capturing internet information according to a certain rule, and is widely used in internet search engines or other similar websites, and can automatically collect the contents of all the pages of the websites which can be accessed by the web crawler so as to acquire or update the contents and the retrieval modes of the websites.

At present, enterprises are increasingly applied to information on the internet, but due to cost consideration, generally, when specific business demands exist, the crawling function is independently developed according to the specific business demands, because the complexity and the requirements of each target website are different, a set of general crawler technical scheme is not available, a great deal of time and labor are wasted, the independently developed crawler functions are repeated and complicated, the upgrading and maintenance are not facilitated, the commercial crawler system on the market is high in price, and the business specificity of the enterprises is not strong.

Disclosure of Invention

The invention aims to overcome the defects that the crawler functions independently developed in the prior art are repeated and complicated, the upgrading and the maintenance are not facilitated, and the existing crawler system is high in cost and not strong in profession.

The invention solves the technical problems by the following technical scheme:

the crawler system comprises a client and a server; the server comprises a load balancing server and a cluster server;

the client is used for generating crawling parameters according to access requirements of a preset crawling target and a target website and sending the crawling parameters to the load balancing server;

the load balancing server is used for generating a crawler task according to the crawling parameters and distributing the crawler task to the cluster server;

and the cluster server is used for crawling target data of the target website according to the crawler task.

Preferably, the cluster server comprises a plurality of working servers, and the load balancing server is further used for generating prompt information when the number of the crawler tasks exceeds a preset number, wherein the prompt information comprises information for prompting to increase the number of the working servers.

Preferably, the cluster server is further configured to simulate an input operation of a real user by using an automated testing tool when receiving a verification code input request of the target website;

and/or the cluster server is further used for connecting the target website through ADSL dialing;

and/or the cluster server is further used for modifying the crawling parameters by utilizing a bale grabbing tool and generating the crawler task according to the modified crawling parameters.

Preferably, the client is further configured to set a preset time, and send the crawling parameter to the load balancing server according to the preset time;

and/or the client is further used for setting a crawling mode, the crawling mode comprises webpage element crawling and/or interface crawling, and the crawling parameters comprise the crawling mode.

Preferably, the server further comprises a monitoring server, the cluster server is used for generating process data according to the crawler task and sending the process data to the monitoring server, and the monitoring server is used for displaying the process data;

and/or the monitoring server is further used for sending a control command to the cluster server, and the cluster server is used for starting or stopping the crawler task according to the control command.

A crawler method, the crawler method comprising:

the client generates crawling parameters according to access requirements of a preset crawling target and a target website, and sends the crawling parameters to the load balancing server;

the load balancing server generates a crawler task according to the crawling parameters and distributes the crawler task to a cluster server;

and the cluster server crawls target data of the target website according to the crawler task.

Preferably, the cluster server includes a plurality of working servers, and the step of generating the crawler task by the load balancing server according to the crawling parameter further includes:

and the load balancing server generates prompt information when the number of the crawler tasks exceeds a preset number, and the prompt information comprises information for prompting to increase the number of the working servers.

Preferably, the step of crawling, by the cluster server, the target data of the target website according to the crawler task includes:

when receiving a verification code input request of the target website, the cluster server simulates input operation of a real user by using an automatic testing tool;

and/or before the step of crawling the target data of the target website according to the crawler task, the cluster server comprises:

the cluster server is connected with the target website through ADSL (asymmetric digital subscriber line) dialing;

and/or the step of generating the crawler task by the load balancing server according to the crawling parameters comprises the following steps:

and the cluster server modifies the crawling parameters by utilizing a packet grabbing tool and generates the crawler task according to the modified crawling parameters.

Preferably, the step of sending the crawling parameter to the load balancing server includes: the client sets preset time and sends the crawling parameters to the load balancing server according to the preset time;

and/or the client side also sets a crawling mode, wherein the crawling mode comprises webpage element crawling and/or interface crawling, and the crawling parameters comprise the crawling mode.

Preferably, the step of crawling, by the cluster server, the target data of the target website according to the crawler task further includes:

the cluster server generates process data according to the crawler task and sends the process data to a monitoring server;

the monitoring server displays the process data; and/or the monitoring server sends a control command to the cluster server; and the cluster server starts or stops the crawler task according to the control command.

The invention has the positive progress effects that:

according to the invention, the client sets the crawling parameters according to the access requirements of the preset crawling target and the target website, the target data of the target website can be crawled in a targeted manner, the crawler speed is improved, the load balancing server generates the crawler task according to the crawling parameters, and distributes the crawler task to the cluster server, the cluster server is used for crawling the target data of the target website according to the crawler task so as to complete the crawler task, a unified crawler system is realized, the user only needs to set the crawling parameters at the client according to the access requirements of the preset crawling target and the target website, and the generated crawler task is uniformly processed by the cluster server, so that the system is easy to maintain, the research and development duration is reduced, the repeated research and development workload is avoided, and the research and development cost is reduced.

Drawings

FIG. 1 is a block diagram of a crawler system according to embodiment 1 of the present invention.

FIG. 2 is a flow chart of the crawler method of embodiment 2 of the present invention.

FIG. 3 is a flowchart of step 205 of the crawler method of embodiment 2 of the present invention.

Detailed Description

The invention is further illustrated by means of the following examples, which are not intended to limit the scope of the invention.

Example 1

The embodiment provides a crawler system, as shown in fig. 1, which comprises a client 1 and a server 2; the server side 2 includes a load balancing server 21, a cluster server 22 and a monitoring server 23.

The server side 2 is set as a Service cluster based on Web Service.

The client 1 is configured to generate crawling parameters according to access requirements of a preset crawling target and a target website 3, and send the crawling parameters to the load balancing server 21.

The load balancing server 21 is configured to generate a crawler task according to the crawling parameter, and distribute the crawler task to the cluster server 22.

The cluster server 22 is used for crawling target data of the target website 3 according to the crawler task.

The cluster server 22 includes a plurality of work servers 221, and the load balancing server 21 is further configured to generate prompt information when the number of crawler tasks exceeds a preset number, where the prompt information includes information for prompting to increase the number of work servers 221. The method can be used for flexibly reminding the increase of the work servers when the crawler tasks are increased gradually so as to relieve the crawling pressure of the cluster servers.

The cluster server 22 is further configured to simulate an input operation of a real user by using the automated test tool when receiving the verification code input request of the target website 3.

In this embodiment, the cluster server 22 may simulate the input operation of the real user through Selenium (an automated testing tool), firefox (a web browser), gecko driver (for driving Firefox), etc., for example, to implement slider verification, etc. Simulating a manual input operation may be accomplished by invoking the browser kernel with the Selenium to directly open an interface-free browser and select a page element when encountering an operation where some scenarios must be manually intervened.

The cluster server 22 is also used to connect to the target web site 3 via ADSL dial-up.

If the target web site 3 takes monitoring of the IP (internet protocol address) address of the accessed cluster server 22, a large number of accesses to the same IP address by the target web site 3 may be blocked. Therefore, the crawler task needs to be deployed on the working server 221 supporting ADSL dialing, and the function of switching the IP address within a certain time is realized through the timing task of timing dialing.

The cluster server 22 is further configured to modify the crawling parameter with the crawling tool, and generate a crawler task according to the modified crawling parameter.

Request parameters are modified by a mitxproxy (intermediate proxy tool), and the feature of the non-manual request is removed or modified. Through the Mitmproxy as the server of the intermediate proxy, a forward proxy mode is adopted, so that the request sent by the client 1 firstly passes the Mitmproxy inspection, the Mitmproxy modifies the content of the response sent back by the target website, the characteristic of the non-manual request is removed or modified, new crawling parameters are generated, and the new crawling parameters are sent to the target website.

The client 1 is further configured to set a preset time, and send crawling parameters to the load balancing server 21 according to the preset time.

The client 1 is further configured to set a crawling manner, where the crawling manner includes crawling of web page elements and crawling of interfaces, and the crawling manner is set by crawling parameters.

The cluster server 22 uses a combination of two processing logic modes, namely, crawling the web page elements of the target website and crawling specific interfaces, where the crawling of the web page elements can directly obtain the element codes of the web page through an address request tool, such as HttpClient (client programming tool kit) supporting the HTTP (hypertext transfer protocol) protocol, and then parse the element codes through a jsoup (which is an HTML (hypertext markup language) parser for Java (an object oriented programming language)). The direct interface crawling request analyzes a JavaScript (an transliterated script language) file obtained from a target website, then stores the JavaScript file into a local site to directly or translate the JavaScript file into a corresponding service end code, and generates crawling parameters before sending the interface request.

The cluster server 22 realizes efficient crawling of data of the target website through a compatible crawler mode. The cluster server can process the simple content return of the target website or the complex request containing the content such as the header information, the cookie (stored on the computer of the user and used for maintaining the information in the computer of the user) and the like, so that the user does not need to care about specific crawling logic, and the workload of repeated development is simplified.

The cluster server 22 is configured to generate process data according to the crawler task and send the process data to the monitoring server 23, and the monitoring server 23 is configured to display the process data.

The monitoring server 23 is further configured to send a control command to the cluster server 22, and the cluster server 22 is configured to start or stop the crawler task according to the control command.

According to the embodiment, the client sets crawling parameters according to the access requirements of the preset crawling target and the target website, target data of the target website can be crawled in a targeted manner, the crawler speed is improved, the load balancing server generates a crawler task according to the crawling parameters, the crawler task is distributed to the cluster server, the cluster server is used for crawling the target data of the target website according to the crawler task so as to complete the crawler task, a unified crawler system is realized, a user only needs to set the crawling parameters at the client according to the access requirements of the preset crawling target and the target website, and the generated crawler task is uniformly processed by the cluster server, so that the system is easy to maintain, the research and development time is reduced, the repeated research and development workload is avoided, and the research and development cost is reduced.

Example 2

The embodiment provides a crawler method, as shown in fig. 2, including:

step 201, the client generates crawling parameters according to access requirements of a preset crawling target and a target website, and sends the crawling parameters to the load balancing server.

The server is a Service cluster based on Web Service.

The manner of sending the crawling parameters to the load balancing server can be instant sending, or can set preset time through the client, and then send the crawling parameters to the load balancing server according to the preset time.

The crawling mode can be set through the client, the crawling mode comprises a webpage element crawling mode and/or an interface crawling mode, and the crawling parameters comprise the set crawling mode.

The cluster server adopts the combination of two processing logic modes of crawling the webpage elements of the target website and crawling specific interfaces, the crawling of the webpage elements can directly acquire element codes of the webpage through an address request tool such as HTTP protocol supporting HttpClient, and then the element codes are analyzed through a jsoup tool. The direct interface crawling request analyzes the JavaScript file obtained from the target website, then stores the JavaScript file into a local site to directly or translate the JavaScript file into a corresponding service end code, and generates crawling parameters before sending the interface request.

The cluster server realizes efficient crawling of the data of the target website in a compatible crawler mode. The cluster server can process the simple content return of the target website or the complex request containing the content such as the header information, the cookie and the like, so that a user does not need to care about specific crawling logic, and the workload of repeated development is simplified.

Step 202, the load balancing server generates a crawler task according to the crawling parameters and distributes the crawler task to the cluster server.

Wherein the cluster server comprises a plurality of working servers.

And 203, generating prompt information by the load balancing server when the number of the crawler tasks exceeds the preset number, wherein the prompt information comprises information for prompting to increase the number of the working servers.

Step 204, the cluster server is further configured to connect to the target website through ADSL dial-up.

If the target website adopts the IP address monitoring of the accessed cluster server, a large number of accesses to the same IP address by the target website can be blocked. Therefore, the crawler task is deployed on a working server supporting ADSL dialing, and the function of switching IP addresses in a certain time is realized through the timing task of timing dialing.

Step 205, the cluster server crawls target data of the target website according to the crawler task.

As shown in fig. 3, a specific crawling process includes:

step 2050, judging whether the cluster server receives the verification code input request of the target website, if yes, executing step 2051.

Step 2051, simulating an input operation of a real user by using an automated test tool.

In this embodiment, the cluster server may simulate the actual user operation through Selenium, firefox, gecko driver, etc., for example, to implement slider verification, etc. Simulating a manual operation may be accomplished by opening an interface-free browser with a Selenium through a browser kernel and selecting page elements when encountering an operation where some scenes must be manually intervened.

Step 2052, the cluster server modifies the crawling parameters by using the crawling tool, and generates a crawler task according to the modified crawling parameters.

The request parameters are modified by Mitmproxy, and the features of the non-manual request are removed or modified. Through the Mitmproxy as the intermediate proxy server, a forward proxy mode is adopted, so that a request sent by a client passes Mitmproxy inspection, the Mitmproxy modifies the content of a response sent back by a target website, the characteristic of a non-manual request is removed or modified, new crawling parameters are generated, and the new crawling parameters are sent to the target website.

Step 2053, the cluster server generates process data according to the crawler task, and sends the process data to the monitoring server.

Step 2054, the monitoring server displays the process data.

Step 2055, the monitoring server sends a control command to the cluster server.

Step 2056, the cluster server starts or stops the crawler task according to the control command.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the principles and spirit of the invention, but such changes and modifications fall within the scope of the invention.

Claims

1. The crawler system is characterized by comprising a client and a server; the server comprises a load balancing server and a cluster server;

the cluster server is used for crawling target data of the target website according to the crawler task; the cluster server is further used for modifying the crawling parameters by utilizing a bag grabbing tool, removing or modifying the characteristic of the non-manual request to obtain the modified crawling parameters, and generating the crawler task according to the modified crawling parameters.

2. The crawler system of claim 1, wherein the cluster server includes a plurality of work servers, the load balancing server further configured to generate hint information when the number of crawler tasks exceeds a preset number, the hint information including information that hints to increase the number of work servers.

3. The crawler system of claim 1, wherein the cluster server is further configured to simulate input operations of a real user with an automated testing tool upon receiving a verification code input request of the target website;

and/or the cluster server is further used for connecting the target website through ADSL dialing.

4. The crawler system of claim 1, wherein the client is further configured to set a preset time, and send the crawling parameters to the load balancing server according to the preset time;

5. The crawler system of claim 1, wherein the server further comprises a monitoring server, the cluster server is configured to generate process data according to the crawler task and send the process data to the monitoring server, and the monitoring server is configured to display the process data;

6. A crawler method, the crawler method comprising:

the cluster server crawls target data of the target website according to the crawler task; the cluster server is further used for modifying the crawling parameters by utilizing a bag grabbing tool, removing or modifying the characteristic of the non-manual request to obtain the modified crawling parameters, and generating the crawler task according to the modified crawling parameters.

7. The crawler method of claim 6, wherein the cluster server comprises a plurality of work servers, the step of the load balancing server generating crawler tasks according to the crawling parameters further comprising:

8. The crawler method of claim 6, wherein crawling the target data of the target website by the cluster server according to the crawler task comprises:

the cluster server is connected with the target website through ADSL dialing.

9. The crawler method of claim 6, wherein the step of sending the crawling parameters to a load balancing server comprises: the client sets preset time and sends the crawling parameters to the load balancing server according to the preset time;

10. The crawler method as in claim 6 wherein the cluster server is based on

The step of crawling the target data of the target website by the crawler task further comprises the following steps: