CN111090798A

CN111090798A - Webpage data crawling method and system

Info

Publication number: CN111090798A
Application number: CN201911242411.0A
Authority: CN
Inventors: 林浩; 劳永聪; 郑志勇
Original assignee: Guangzhou Tiantu Network Technology Co Ltd
Current assignee: Guangzhou Tiantu Network Technology Co Ltd
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2020-05-01
Anticipated expiration: 2039-12-06
Also published as: CN111090798B

Abstract

The application provides a webpage data crawling method and system, and relates to the technical field of data crawling. The method is applied to a webpage crawling system, the system comprises a first server and at least one second server, and a mirror image file is sent to the at least one second server through the first server, wherein the mirror image file comprises a linux system, a crawler system and parameters to be crawled, then the second server analyzes the mirror image file to start the crawler system and obtain the parameters to be crawled, then the second server accesses a third server according to the parameters to be crawled, target data corresponding to the parameters to be crawled are crawled from the third server, and finally the target data are sent to the first server through the second server. The webpage data crawling method and system have the effects of reducing the performance overhead of the server and enabling the server to be simpler in deployment and not limited by the area.

Description

Webpage data crawling method and system

Technical Field

The application relates to the technical field of data crawling, in particular to a webpage data crawling method and system.

Background

With the development of the internet industry, there are more and more internet companies. Generally, in an operation process, an internet company needs to crawl a large amount of data on a server, for example, crawl data information of some web pages, find possible bugs of the web pages, test loading performance and file loss of the own web pages around the world, for example, crawl DNS resolution time, TCP link time, download time of some web pages, whether resource loss exists, and the like, and then maintain the own web pages.

At present, when data needs to be crawled, a server is required to actively crawl the data, so that a large amount of performance overhead exists in the server, and user experience is influenced. Moreover, when the amount of data to be crawled is large, a large number of servers need to be distributed, server deployment is relatively troublesome, and regional limitation exists on server deployment, which causes that data crawling is troublesome.

Disclosure of Invention

The application aims to provide a webpage data crawling method and system, and the method and system are used for solving the problems that in the prior art, when data crawling is carried out, the performance overhead of a server is high, and meanwhile the layout of the server is troublesome.

In order to achieve the above purpose, the embodiments of the present application employ the following technical solutions:

in one aspect, an embodiment of the present application provides a web page data crawling method, which is applied to a web page crawling system, where the system includes a first server and at least one second server, where the first server is communicatively connected to the at least one second server, and each second server is further configured to be communicatively connected to a plurality of third servers;

the first server sends an image file to the at least one second server, wherein the image file comprises a linux system, a crawler system and parameters to be crawled;

the second server analyzes the mirror image file to start the crawler system and acquire the parameter to be crawled;

the second server accesses the third server according to the parameter to be crawled and crawls target data corresponding to the parameter to be crawled from the third server;

the second server sends the target data to the first server.

In another aspect, the present application provides a web page data crawling system, including a first server and at least one second server, the first server being communicatively connected to the at least one second server, each of the second servers being further configured to be communicatively connected to a plurality of third servers; wherein,

the first server is used for sending an image file to at least one second server, wherein the image file comprises a linux system, a crawler system and parameters to be crawled;

the second server is used for analyzing the mirror image file so as to start the crawler system and acquire the parameter to be crawled;

the second server is also used for accessing the third server according to the parameter to be crawled and crawling target data corresponding to the parameter to be crawled from the third server;

the second server is further configured to send the target data to the first server.

Compared with the prior art, the method has the following beneficial effects:

the application provides a webpage data crawling method which is applied to a webpage crawling system, the system comprises a first server and at least one second server, the first server is in communication connection with the at least one second server, and each second server is also in communication connection with a plurality of third servers. Sending the image file to a second server through a first server, wherein the image file comprises a linux system, a crawler system and a parameter to be crawled, then analyzing the image file by the second server to start the crawler system and acquire the parameter to be crawled, accessing a third server through the second server according to the parameter to be crawled, crawling target data corresponding to the parameter to be crawled from the third server, and finally sending the target data to the first server through the second server. On the one hand, when data needs to be crawled, the first server sends an instruction to the second server, and then the second server crawls the data, so that the performance overhead of the first server is reduced, and the user experience is higher. On the other hand, a plurality of second servers can be deployed by different server suppliers all over the world, the regional limitation is avoided, and data can be further crawled all over the world.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and it will be apparent to those skilled in the art that other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is an interaction diagram of a web page data crawling system according to an embodiment of the present application.

Fig. 2 is a schematic block diagram of a first server and a second server according to an embodiment of the present disclosure.

Fig. 3 is a first schematic flowchart of a web page data crawling method according to an embodiment of the present disclosure.

Fig. 4 is a second schematic flowchart of a web page data crawling method according to an embodiment of the present application.

Fig. 5 is a third schematic flowchart of a web page data crawling method according to an embodiment of the present application.

Fig. 6 is a fourth schematic flowchart of a web page data crawling method according to an embodiment of the present application.

Fig. 7 is a fifth schematic flowchart of a web page data crawling method according to an embodiment of the present application.

In the figure: 100-web page data crawling system; 101-a memory; 102-a processor; 103-a communication interface; 110-a first server; 120-a second server.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

In the description of the present application, it should be noted that the terms "upper", "lower", "inner", "outer", and the like indicate orientations or positional relationships based on orientations or positional relationships shown in the drawings or orientations or positional relationships conventionally found in use of products of the application, and are used only for convenience in describing the present application and for simplification of description, but do not indicate or imply that the referred devices or elements must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present application.

In the description of the present application, it is also to be noted that, unless otherwise explicitly specified or limited, the terms "disposed" and "connected" are to be interpreted broadly, e.g., as being either fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

As described in the background art, currently, when data needs to be crawled, a server needs to actively crawl the data, so that a large amount of performance overhead exists in the server, and user experience is affected. Moreover, when the amount of data to be crawled is large, a large number of servers need to be distributed, server deployment is relatively troublesome, and regional limitation exists on server deployment, which causes that data crawling is troublesome.

In view of this, the present application provides a method for crawling web page data, so as to implement an effect of crawling data by using at least one second server in a manner of using a mirror image file, without affecting performance of a first server, and meanwhile, avoid problems of deployment trouble and deployment area limitation in a deployment process of the first server.

Please refer to fig. 1, the web data crawling method provided by the present application is applied to a web data crawling system 100, which includes a first server 110 and at least one second server 120, where the first server 110 is in communication connection with the at least one second server 120, and each second server 120 is further in communication connection with a plurality of third servers. For example, the first server 110 may be a server of an internet merchant, and the second server 120 may be a server of an intermediary merchant, for example, the second server 120 may be an arison cloud server or amazon cloud server, such that the first server 110 can crawl data through the second server 120.

For example, the third server includes a server a and a server B, where the server a and the server B are respectively located in different areas, for example, the server a is located in china and the server B is located in the united states. In the prior art, when the first server 110 needs to crawl data of the server a and the server B, for example, crawl page layout data or user access data of a website corresponding to the server a and the server B. The first server 110 needs to be deployed in china and the united states simultaneously, and the crawling of the data of the server a and the server B is realized by the first server 110.

Therefore, the process of crawling the web page data in the prior art inevitably consumes the performance of the first server 110, and requires more first servers 110 to be deployed, and the deployment area is limited to a certain extent.

The following describes an exemplary method for crawling web page data provided by the present application, with the first server 110 and the second server 120 as execution subjects.

Referring to fig. 2, the first server 110 and the second server 120 include a memory 101, a processor 102, and a communication interface 103, and the memory 101, the processor 102, and the communication interface 103 are electrically connected to each other directly or indirectly to implement data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The memory 101 may be used for storing software programs and modules, such as program instructions/modules corresponding to the web page data crawling method provided in the embodiments of the present application, and the processor 102 executes the software programs and modules stored in the memory 101, so as to execute various functional applications and data processing. The communication interface 103 may be used for communicating signaling or data with other node devices.

The Memory 101 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

The processor 102 may be an integrated circuit chip having signal processing capabilities. The processor 102 may be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

It is to be understood that the structure shown in fig. 2 is merely illustrative, and the first server 110 and the second server 120 may also include more or fewer components than shown in fig. 2, or have a different configuration than shown in fig. 2. The components shown in fig. 2 may be implemented in hardware, software, or a combination thereof.

Referring to fig. 3, the method for crawling web page data provided by the present application includes:

s102, the first server sends the image file to at least one second server, wherein the image file comprises a linux system, a crawler system and the image file comprises parameters to be crawled.

And S104, the second server analyzes the mirror image file to start the crawler system and acquire parameters to be crawled.

And S106, the second server accesses the third server according to the parameter to be crawled, and crawls target data corresponding to the parameter to be crawled from the third server.

S108, the second server sends the target data to the first server 110.

It can be understood that when data crawling needs to be performed, a worker needs to make an image file firstly, wherein the image file comprises a linux system, a crawler system and parameters to be crawled. For example, the parameters to be crawled include DNS resolution time of the web page, TCP link time, download time, and whether there is a resource loss. And, the first server 110 sends the image file to the second server 120.

The image file described herein is substantially similar to the rar ZIP compact package in that it makes a specific series of files into a single file in a certain format to facilitate the downloading and use of the second server 120, such as an operating system, games, etc. The most important characteristic of the method is that the method can be identified by specific software and can be directly recorded on an optical disc. The image file in the general sense can be expanded again, and more information can be contained in the image file. Such as system files, boot files, partition table information, etc., so that the image file may contain all the information for a partition or even a hard disk. In the usual sense, the recording software can directly record the content contained in the supported image file onto the optical disc. In fact, the image file is the "extract" of the optical disc.

After receiving the image file sent by the first server 110, the second server 120 parses the image file, further acquires the linux system, the crawler system, and the parameter to be crawled, starts the crawler system, accesses the third server according to the parameter to be crawled, and crawls target data corresponding to the parameter to be crawled from the third server.

After the second server 120 crawls the target data from the third server, the target data may be sent to the first server 110 again, so that the first server 110 obtains the target data.

On the one hand, when data needs to be crawled, the first server sends an instruction to the second server, and then the second server crawls the data, so that the performance overhead of the first server is reduced, and the user experience is higher. On the other hand, a plurality of second servers can be deployed by different server suppliers all over the world, the regional limitation is avoided, and data can be further crawled all over the world. For example, when data in different ranges need to be crawled, a server platform can be built by utilizing the arrests or amazons and the like, and then the internet merchants can realize the crawling of the data through second servers in different areas.

As a possible implementation manner of the present application, the image file may be a docker image file, and the docker image file includes a parameter to be crawled.

The second server 120 can start the image file by using a docker technology after receiving the docker image file, automatically start the crawler system after the start is completed, obtain the parameter to be crawled, access a specified third server according to the parameter to be crawled, and obtain corresponding target data from the third server.

As an implementation manner, referring to fig. 4, before S106, the method further includes:

s105-1, the first server sends the information for starting or ending crawling to the second server.

And S105-2, the second server crawls the target data or finishes crawling the target data for the third server according to the crawling starting or finishing information.

Wherein, the first server 110 can send a corresponding instruction to the second server 120 to control the second server 120 to start crawling data or end crawling data. Of course, the second server 120 may also crawl data without following the instruction of the first server 110, for example, the second server 120 crawls data according to a certain period duration, for example, once per hour.

In addition, when sending the instruction to the second server 120, the instruction may be sent to a plurality of second servers 120, or may be sent to a single second server 120, which is not limited in this application. For example, when it is necessary to crawl page data of the guangzhou region, the first server 110 may send a start instruction to the second server 120 corresponding to the guangzhou region, and then control the server to crawl the data.

And, as a possible implementation manner, the parameter to be crawled may include a target IP address, and S106 includes:

and the second server determines a target third server according to the IP address and crawls target data for the target third server.

That is, the first server 110 can specify a corresponding IP address, so that the second server can crawl the web page data of the corresponding website according to the IP address.

Certainly, in some other embodiments, the parameter to be crawled may not include the target IP address, and at this time, the second server 120 may randomly crawl data, for example, crawl data of the a webpage in the first time period and crawl data of the B webpage in the second time period.

Further, referring to fig. 5, the method further includes:

s107-1, the first server sends parameter change information to the second server.

S107-2, the second server changes the parameters to be crawled according to the parameter change information.

In this application, when second server 120 is crawling the in-process of data, first server 110 also can send parameter change information to it, and then changes the parameter of waiting to crawl in the mirror image file for second server 120 is carrying out crawling the in-process of data, can crawl different data.

For example, in the image file sent by the first server 110, the parameter to be crawled includes a, and when data crawling is performed, the second server 120 crawls a data from the third server. In the operation process, when different data of the staff are used as references, parameter change information can be sent to the second server 120, if the parameter change information is that the parameter a is changed into the parameter b, then the second server 120 can change the parameter to be crawled in the file after receiving the parameter change information, and data is crawled according to the changed parameter to be crawled.

As another possible implementation manner of the present application, when the parameter to be crawled needs to be changed, after the parameter of the first server 110 is changed, the parameter is directly synchronized to the second server 120, or a new image file is sent to the second server 120 again, so that the second server 120 crawls data according to the new image file, which is not limited in this application.

Meanwhile, referring to fig. 6, after S108, the method further includes:

s110, the first server stores the target data into a database.

S112, when the first server receives the calling instruction, the first server extracts and analyzes the target data from the database, and displays the data.

That is, in the present application, the first server 110 is further connected to a database, and after the first server 110 obtains the information fed back by the second server 120, the first server 110 stores the target data in the database. When the user needs to view and analyze the target data, the user may send a call instruction to the first server 110, and the first server 110 calls the target data from the database to analyze and display the target data.

It should be noted that the format of the data provided in the present application is determined by the display system, for example, the display system is identified as data in the Har JSON format, and after the second server 120 crawls the target data, the data is packaged in the Har JSON format and sent to the first server 110, and the first server 110 directly stores the data in the database.

As a possible implementation manner, referring to fig. 7, before S110, the method further includes:

s109, the second server records the region information and the time information of the crawling target data, and sends the region information and the time information to the first server 110.

S110 actually comprises: the first server 110 generates identification information of the target data according to the area information and the time information, and stores the identification information and the target data in a database.

Due to the large amount of data acquired and the fact that crawled data originates from different regions, and at different time periods. Therefore, in order to facilitate the management of the later data, in the present application, the second server 120 records the region information and the time information of the crawling target data, and sends the region information and the time information to the first server 110. That is, when the target data is sent to the first server 110, the second server 120 also sends the corresponding area for crawling the target data and the corresponding time, for example, if the data sent to the first server 110 by the second server 120 is guangzhou, and 1 month, 1 day, 0 and the target data indicate that the location where the target data is crawled by the second server 120 is guangzhou and the time is 1 month, 1 day, 0.

When the first server 110 receives the information, it generates the identification information of the target data according to the area information and the time information, and stores the identification information and the target data in the database. When the data is called, the areas and the time corresponding to different data can be more conveniently distinguished.

Second embodiment

Referring to fig. 1 again, the embodiment of the present application further provides a web page data crawling system 100, which includes a first server 110 and at least one second server 120, wherein the first server 110 is communicatively connected to the at least one second server 120, and each second server 120 is further configured to be communicatively connected to a plurality of third servers.

The first server 110 is configured to send an image file to at least one second server 120, where the image file includes a linux system, a crawler system, and parameters to be crawled; the second server 120 is configured to parse the mirror image file to start the crawler system, and obtain a parameter to be crawled; the second server 120 is further configured to access a third server according to the parameter to be crawled, and crawl target data corresponding to the parameter to be crawled from the third server.

Further, the first server 110 is also used for storing the target data in a database; when the first server 110 receives the call instruction, the first server 110 is further configured to extract and parse the target data from the database, and display the data.

Further, the second server 120 is further configured to record area information and time information of the crawling target data, and send the area information and the time information to the first server 110; the step of the first server 110 further storing the target data in the database includes: the first server 110 is further configured to generate identification information of the target data according to the area information and the time information, and store the identification information and the target data in a database.

In summary, the present application provides a web page data crawling method, which is applied to a web page crawling system, and the system includes a first server and at least one second server, wherein the first server is in communication connection with the at least one second server, and each second server is further in communication connection with a plurality of third servers. Sending the image file to a second server through a first server, wherein the image file comprises a linux system, a crawler system and a parameter to be crawled, then analyzing the image file by the second server to start the crawler system and acquire the parameter to be crawled, accessing a third server through the second server according to the parameter to be crawled, crawling target data corresponding to the parameter to be crawled from the third server, and finally sending the target data to the first server through the second server. On the one hand, when data needs to be crawled, the first server sends an instruction to the second server, and then the second server crawls the data, so that the performance overhead of the first server is reduced, and the user experience is higher. On the other hand, a plurality of second servers can be deployed by different server suppliers all over the world, the regional limitation is avoided, and data can be further crawled all over the world.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. The method is applied to a webpage crawling system, and the system comprises a first server and at least one second server, wherein the first server is in communication connection with the at least one second server, and each second server is also in communication connection with a plurality of third servers;

the second server sends the target data to the first server.

2. The web page data crawling method according to claim 1, wherein after the step of sending the target data to the first server by the second server, the method further comprises:

the first server stores the target data in a database;

and when the first server receives a calling instruction, the first server extracts and analyzes the target data from the database, and displays the data.

3. The web page data crawling method according to claim 2, wherein before the step of the first server storing the target data in a database, the method comprises:

the second server records region information and time information of crawling of the target data and sends the region information, the time information and the target data to the first server;

the step of the first server storing the target data in a database comprises:

and the first server generates identification information of the target data according to the area information and the time information, and stores the identification information and the target data into the database.

4. The web page data crawling method according to claim 1, wherein after the step of sending the image file from the first server to the second server, the method further comprises:

the first server sends parameter change information to the second server;

and the second server changes the parameters to be crawled according to the parameter change information.

5. The web page data crawling method according to claim 1, wherein the image file comprises a docker image file.

6. A method for web page data crawling according to claim 1, characterized in that the method further comprises:

the first server sends start or end crawling information to the second server;

and the second server crawls target data of the third server according to the starting or ending crawling information or finishes crawling the target data.

7. The web page data crawling method according to claim 1, wherein the parameter to be crawled includes a target IP address, the second server accesses the third server according to the parameter to be crawled, and the step of crawling the target data corresponding to the parameter to be crawled from the third server includes:

and the second server determines a target third server according to the IP address and crawls target data of the target third server.

8. A web page data crawling system, comprising a first server and at least one second server, wherein the first server is in communication connection with the at least one second server, and each second server is further used for being in communication connection with a plurality of third servers; wherein,

the first server is used for sending an image file to the at least one second server, wherein the image file comprises a linux system, a crawler system and parameters to be crawled;

9. The web page data crawling system of claim 8, wherein the first server is further configured to store the target data in a database;

when the first server receives a calling instruction, the first server is also used for extracting the target data from the database, analyzing the target data and displaying the target data.

10. The web page data crawling system according to claim 9, wherein the second server is further configured to record region information and time information for crawling the target data, and send the region information, the time information and the target data to the first server;

the step of the first server further storing the target data in a database comprises:

the first server is further configured to generate identification information of the target data according to the area information and the time information, and store the identification information and the target data in the database.