CN108011931B

CN108011931B - Web data acquisition method and Web data acquisition system

Info

Publication number: CN108011931B
Application number: CN201711174715.9A
Authority: CN
Inventors: 韦立鹏
Original assignee: Yonyou Fintech Information Technology Co ltd
Current assignee: Yonyou Fintech Information Technology Co ltd
Priority date: 2017-11-22
Filing date: 2017-11-22
Publication date: 2021-06-11
Anticipated expiration: 2037-11-22
Also published as: CN108011931A

Abstract

The invention provides a Web data acquisition method, a Web data acquisition system, computer equipment and a computer readable storage medium. The Web data acquisition method comprises the following steps: arranging a crawler environment to be added into the virtual machine; acquiring an IP address to be added into the virtual machine, and adding the IP address into the configuration of the main node; the control host updates the running script so that the virtual machine to be added and the added virtual machine acquire the latest running code; and when receiving a task starting instruction of the virtual machine to be added, executing the task starting instruction according to the latest running code so as to enable the virtual machine to be added to the cluster of the crawling website and start Web data acquisition. The invention realizes the transverse expansion of Web data in crawling and storing when the data sources are greatly increased, improves the efficiency of crawling data and storing data, and finishes the data acquisition in limited time.

Description

Web data acquisition method and Web data acquisition system

Technical Field

The invention relates to the technical field of Web data acquisition, in particular to a Web data acquisition method, a Web data acquisition system, computer equipment and a computer readable storage medium.

Background

No matter whether the data analysis or the public opinion system aims at data, the data acquisition is a basis, the data acquisition mode is self-provided with data, the data on the network needs to be crawled by enterprises if the service relates to the data on the network, the single-machine crawling needs too slow time for processing a large amount of web data and cannot meet the service requirement, and the traditional database cannot meet the current software service requirement for large-data-volume data storage and query performance.

Therefore, how to realize a Web data acquisition method and a Web data acquisition system supporting infinite horizontal extension becomes an urgent problem to be solved.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art or the related art.

Therefore, the invention provides a Web data acquisition method in a first aspect.

A second aspect of the present invention is to provide a Web data collection system.

A third aspect of the invention is directed to a computer device.

A fourth aspect of the present invention is directed to a computer-readable storage medium.

In view of the above, according to an aspect of the present invention, a Web data collecting method is provided, including: arranging a crawler environment to be added into the virtual machine; acquiring an IP address to be added into the virtual machine, and adding the IP address into the configuration of the main node; the control host updates the running script so that the virtual machine to be added and the added virtual machine acquire the latest running code; and when receiving a task starting instruction of the virtual machine to be added, executing the task starting instruction according to the latest running code so as to enable the virtual machine to be added to the cluster of the crawling website and start Web data acquisition.

The Web data acquisition method provided by the invention manages through a scheduling platform, arranges a crawler environment with the added virtual machines so that a cluster for adding data acquisition of the virtual machines to be added can crawl data, acquires IP addresses of the virtual machines with the built crawler environment, adds the IP addresses into the configuration of a main corner point, controls a main sentence to update an operation script, ensures that all machines (the virtual machines to be added and the added virtual machines) acquire latest operation codes from an open-source distributed version control system GIT end, executes a task starting instruction according to the latest operation codes when receiving the task starting instruction of the virtual machines to be added, so that the virtual machines to be added are added into the cluster of a crawling website to start Web data acquisition, when a large number of data sources are increased, newly added worker nodes can be added into the cluster for data acquisition only by starting the tasks, the method and the device realize the transverse expansion of the crawling and the storage of the Web data, improve the efficiency of the crawling data and the storage data and finish the acquisition of the data within limited time. In addition, the invention can also carry out data acquisition in a timing way, and can easily switch production and test environments.

The Web data acquisition method according to the present invention may further have the following technical features:

in the above technical solution, preferably, after executing the task starting instruction according to the latest running code to add the virtual machine to be added to the cluster of the crawling website and start Web data acquisition, the method further includes: receiving a Web data acquisition request of a target website; establishing a task queue according to the acquisition request; and when the cluster has idle resources, controlling the Web crawler to execute the task plan in the task queue by using the idle resources so as to acquire the Web data of the target website.

According to the technical scheme, a task starting instruction is executed according to the latest running code, so that after a virtual machine to be added is added into a cluster of a crawling website and Web data acquisition is started, a Web data acquisition request of a target website is received, data acquisition and crawling are started, a task queue is established according to the acquired data request, a scheduling frame is used for distributing crawler resources of the cluster, when idle resources exist in the cluster, a network crawler is controlled to use the idle resources to acquire a task plan from a message queue, the task of acquiring data of the target website is executed, redis is used as a transport machine of task information, and processing of a real-time task queue and scheduling of the task are achieved.

In any of the above technical solutions, preferably, when there are idle resources in the cluster, controlling the Web crawler to execute the task plan in the task queue using the idle resources to obtain the Web data of the target website specifically includes: acquiring a URL (uniform resource locator) of a target website based on idle resources; sending the URL to a downloader so that the downloader generates and returns page data corresponding to the URL; and processing the page data and storing the processed page data.

In the technical scheme, when the cluster has idle resources, the process of executing the task plan in the task queue by the crawler is controlled, the idle-based resources are realized, firstly, the URL of a target website is obtained, the scheduling is carried out in a scheduler in a Request mode, the URL is forwarded to a downloader through downloading middleware, the downloader downloads page data corresponding to the URL, Response of the page is generated, and then the Response is returned through the downloading middleware, the page data is processed, and the fire station is cleaned. And verifying and persisting, and storing the data into a data storage system at the bottom layer, thereby realizing the control of data flow and finishing a large amount of data acquisition.

In any of the above technical solutions, preferably, after the task queue is established according to the acquisition request, the method further includes: acquiring a web crawler corresponding to a target website according to a preset corresponding relation between the web crawler and the website; and acquiring the crawling period of the Web crawler corresponding to the set target website so as to make the Web crawler crawl the Web data according to the crawling period.

In the technical scheme, before the whole data crawling process starts, the corresponding relation between the web crawlers and the target websites is set, each web crawler is responsible for processing one or more specific target websites, the web crawlers of the target websites crawled by the data are obtained according to the corresponding relation between the web crawlers and the target websites, the crawling period of the web crawlers is obtained, the crawling period can be day-by-day crawling, the crawling by hour and the crawling by week, the crawling of the data is carried out according to the set crawling period, so the time period for freely defining the crawling of the data is realized, the data crawling is carried out at regular time, and the automation of the data crawling is realized.

In any of the above solutions, preferably, the task queue is a distributed message transfer asynchronous task queue.

In the technical scheme, the task queue established according to the data request is a distributed message transmission asynchronous task queue in the data acquisition process, so that distributed processing of a large number of messages can be realized. The efficiency of task processing and the flexibility, the reliability of data processing have been improved.

According to a second aspect of the present invention, a Web data collection system is provided, which includes: the virtual machine management system comprises an arrangement unit, a management unit and a management unit, wherein the arrangement unit is used for arranging a crawler environment to be added into a virtual machine; the adding unit is used for acquiring an IP address to be added into the virtual machine and adding the IP address into the configuration of the main node; the updating unit is used for controlling the host to update the running script so that the virtual machine to be added and the added virtual machine acquire the latest running code; and the starting unit is used for executing the task starting instruction according to the latest running code when receiving the task starting instruction of the virtual machine to be added, so that the virtual machine to be added is added into the cluster of the crawling website and Web data acquisition is started.

The Web data acquisition system provided by the invention is managed by a scheduling platform, a unit is arranged to arrange a crawler environment with added virtual machines so that a cluster to be added into the virtual machines for data acquisition can crawl data, the adding unit obtains the IP addresses of the virtual machines with built crawler environment, the IP addresses are added into the configuration of a main corner point, then an updating unit controls a main sentence to update an operation script, so that all machines (the virtual machines to be added and the added virtual machines) obtain the latest operation code from a GIT (distributed version control system) end of an open source, when a task starting instruction of the virtual machines to be added is received by a starting unit, the starting unit executes the task starting instruction according to the latest operation code, and thus, the virtual machines to be added are added into the cluster of a crawling website to start Web data acquisition, when the data sources are greatly increased, the newly added worker node can be added into the data acquisition cluster only by starting a task, so that the transverse expansion of Web data crawling and storing is realized, the efficiency of crawling data and storing data is improved, and the data acquisition is completed within a limited time. In addition, the invention can also carry out data acquisition in a timing way, and can easily switch production and test environments.

The Web data acquisition system according to the present invention may further have the following technical features:

in the above technical solution, preferably, the request unit is configured to receive a Web data acquisition request of a target website; the establishing unit is used for establishing a task queue according to the acquisition request; and the control unit is used for controlling the Web crawler to execute the task plan in the task queue by using the idle resources so as to acquire the Web data of the target website when the cluster has the idle resources.

According to the technical scheme, a task starting instruction is executed according to the latest running code, so that a virtual machine to be added is added into a cluster of a crawling website and Web data acquisition is started, a Web data acquisition request of a target website is received by a request unit, data acquisition and crawling are started, a task queue is established by an establishing unit according to the acquired data request, a scheduling frame is used for distributing crawler resources of the cluster, when idle resources exist in the cluster, a control unit controls a network crawler to acquire a task plan from a message queue by using the idle resources, a task for acquiring data of the target website is executed, redis is used as a transport machine of task information, and processing of a real-time task queue and scheduling of the task are achieved.

In any of the above technical solutions, preferably, the control unit specifically includes: the first acquisition unit is used for acquiring the URL of the target website based on the idle resources; the downloading unit is used for sending the URL to the downloader so that the downloader generates and returns page data corresponding to the URL; and the processing unit is used for processing the page data and storing the processed page data.

In the technical scheme, when the cluster has idle resources, the control unit controls the process of the crawler executing the task plan in the task queue to realize the idle-based resources, the first acquisition unit acquires the URL of the target website and schedules the URL in a scheduler in a Request mode, the download unit forwards the URL to a downloader through download middleware, the downloader downloads page data corresponding to the URL to generate Response of the page, the Response is returned through the download middleware, and the processing unit processes the page data and cleans the fire bureau. And verifying and persisting, and storing the data into a data storage system at the bottom layer, thereby realizing the control of data flow and finishing a large amount of data acquisition.

In any of the above technical solutions, preferably, the second obtaining unit is configured to obtain a web crawler corresponding to a target website according to a preset correspondence between the web crawler and the website; and the third acquisition unit is used for acquiring the crawling cycle of the Web crawler corresponding to the set target website so as to make the Web crawler crawl the Web data according to the crawling cycle.

In the technical scheme, before the whole data crawling process starts, the corresponding relation between the web crawlers and the target websites is set, each web crawler is responsible for processing one or some specific target websites, according to the corresponding relation between the web crawlers and the target websites, the second acquisition unit acquires the web crawlers of the target websites which are crawled by the data, the third acquisition unit acquires the crawling period for setting the web crawlers, the crawling can be performed according to the day, the crawling is performed according to the hour, the crawling is performed according to the week, the crawling of the data is performed according to the set crawling period, therefore, the time period for freely defining the crawling data is realized, the data crawling is performed at regular time, and the automation of the data crawling is realized.

According to a third aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: arranging a crawler environment to be added into the virtual machine; acquiring an IP address to be added into the virtual machine, and adding the IP address into the configuration of the main node; the control host updates the running script so that the virtual machine to be added and the added virtual machine acquire the latest running code; and when receiving a task starting instruction of the virtual machine to be added, executing the task starting instruction according to the latest running code so as to enable the virtual machine to be added to the cluster of the crawling website and start Web data acquisition.

The invention provides a computer device, which realizes that when a processor executes a computer program: managing through a scheduling platform, arranging a crawler environment with added virtual machines so that crawling data can be carried out in a cluster for collecting added data of the virtual machines to be added, acquiring IP addresses of the virtual machines with the built crawler environment, adding the IP addresses into the configuration of a main corner point, and then controlling a main sentence to update an operation script so that all machines (the virtual machines to be added and the added virtual machines) acquire latest operation codes from an open-source distributed version control system GIT end, executing a task starting instruction according to the latest operation codes when receiving the task starting instruction of the virtual machines to be added, so that the virtual machines to be added are added into the cluster of a crawling website to start Web data collection, when data sources are greatly increased, newly added worker nodes can be added into the cluster for collecting the data only by starting the tasks, and transverse expansion on crawling and storing of the Web data is realized, the efficiency of crawling data and storing data is improved, and the data acquisition is completed within a limited time. In addition, the invention can also carry out data acquisition in a timing way, and can easily switch production and test environments.

According to a fourth aspect of the invention, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of: arranging a crawler environment to be added into the virtual machine; acquiring an IP address to be added into the virtual machine, and adding the IP address into the configuration of the main node; the control host updates the running script so that the virtual machine to be added and the added virtual machine acquire the latest running code; and when receiving a task starting instruction of the virtual machine to be added, executing the task starting instruction according to the latest running code so as to enable the virtual machine to be added to the cluster of the crawling website and start Web data acquisition.

The present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements: managing through a scheduling platform, arranging a crawler environment with added virtual machines so that crawling data can be carried out in a cluster for collecting added data of the virtual machines to be added, acquiring IP addresses of the virtual machines with the built crawler environment, adding the IP addresses into the configuration of a main corner point, and then controlling a main sentence to update an operation script so that all machines (the virtual machines to be added and the added virtual machines) acquire latest operation codes from an open-source distributed version control system GIT end, executing a task starting instruction according to the latest operation codes when receiving the task starting instruction of the virtual machines to be added, so that the virtual machines to be added are added into the cluster of a crawling website to start Web data collection, when data sources are greatly increased, newly added worker nodes can be added into the cluster for collecting the data only by starting the tasks, and transverse expansion on crawling and storing of the Web data is realized, the efficiency of crawling data and storing data is improved, and the data acquisition is completed within a limited time. In addition, the invention can also carry out data acquisition in a timing way, and can easily switch production and test environments.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 shows a schematic flow diagram of a Web data collection method of one embodiment of the present invention;

FIG. 2 shows a schematic flow diagram of a Web data collection method according to another embodiment of the invention;

FIG. 3 shows a schematic flow diagram of a Web data collection method according to a further embodiment of the invention;

FIG. 4 shows a schematic block diagram of a Web data collection system of one embodiment of the present invention;

FIG. 5 shows a schematic block diagram of a Web data collection system of another embodiment of the present invention;

FIG. 6 illustrates a schematic diagram of Web data collection in accordance with one embodiment of the present invention;

FIG. 7 shows a schematic block diagram of a computer device of an embodiment of the present invention.

Detailed Description

So that the manner in which the above recited aspects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments thereof which are illustrated in the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.

An embodiment of a first aspect of the present invention provides a Web data collection method, and fig. 1 shows a schematic flow diagram of the Web data collection method according to an embodiment of the present invention:

102, arranging a crawler environment to be added into a virtual machine;

step 104, acquiring an IP address to be added into the virtual machine, and adding the IP address into the configuration of the main node;

step 106, the control host updates the running script so that the virtual machine to be added and the added virtual machine acquire the latest running code;

and 108, when receiving a task starting instruction of the virtual machine to be added, executing the task starting instruction according to the latest running code so as to enable the virtual machine to be added to the cluster of the crawling website and start Web data acquisition.

The Web data acquisition method provided by the embodiment comprises the steps of managing through a scheduling platform, arranging a crawler environment with added virtual machines, enabling a cluster to be added with the added data acquisition of the virtual machines to be capable of crawling data, obtaining IP addresses of the virtual machines with the built crawler environment, adding the IP addresses into configuration of a main corner point, then controlling a main sentence to update an operation script, enabling all machines (the virtual machines to be added and the added virtual machines) to obtain latest operation codes from an open-source distributed version control system GIT end, executing a task starting instruction according to the latest operation codes when receiving the task starting instruction of the virtual machines to be added, adding the virtual machines to be added into the cluster of a crawling website to start Web data acquisition, adding newly added worker nodes into the cluster of the data acquisition only by starting tasks when data sources are greatly increased, the method and the device realize the transverse expansion of the crawling and the storage of the Web data, improve the efficiency of the crawling data and the storage data and finish the acquisition of the data within limited time. In addition, the invention can also carry out data acquisition in a timing way, and can easily switch production and test environments.

Fig. 2 is a flowchart illustrating a Web data collection method according to another embodiment of the present invention. Wherein, the method comprises the following steps:

step 202, arranging a crawler environment to be added into a virtual machine;

step 204, acquiring an IP address to be added into the virtual machine, and adding the IP address into the configuration of the host node;

step 206, the control host updates the running script so that the virtual machine to be added and the added virtual machine acquire the latest running code;

step 208, when a task starting instruction of the virtual machine to be added is received, executing the task starting instruction according to the latest running code so as to enable the virtual machine to be added to the cluster of the crawling website and start Web data acquisition;

step 210, receiving a Web data acquisition request of a target website;

step 212, establishing a task queue according to the acquisition request;

and 214, when the cluster has the idle resources, controlling the Web crawler to execute the task plan in the task queue by using the idle resources so as to acquire the Web data of the target website.

In the embodiment, management is performed through the scheduling platform, when a large number of data sources are increased, the newly added worker nodes can be added into the data acquisition cluster only by starting tasks, transverse expansion of Web data crawling and storage is achieved, efficiency of crawling data and data storage is improved, and data acquisition is completed within limited time. In addition, the invention can also carry out data acquisition in a timing way, and can easily switch production and test environments.

In the embodiment, a task starting instruction is executed according to the latest running code, so that after a virtual machine to be added is added into a cluster of a crawling website and Web data acquisition is started, a Web data acquisition request of a target website is received, data acquisition and crawling are started, a task queue is established according to the acquired data request, a scheduling frame is used for distributing crawler resources of the cluster, when idle resources exist in the cluster, a network crawler is controlled to use the idle resources to acquire a task plan from a message queue, the task of acquiring data of the target website is executed, redis is used as a transport machine of task messages, and processing of a real-time task queue and scheduling of the task are achieved.

In the embodiment, in the data acquisition process, the task queue established according to the data request is a distributed message transfer asynchronous task queue, so that distributed processing of a large number of messages can be realized. The efficiency of task processing and the flexibility, the reliability of data processing have been improved.

Fig. 3 shows a flow diagram of a Web data collection method according to still another embodiment of the present invention. Wherein, the method comprises the following steps:

302, arranging a crawler environment to be added into a virtual machine;

step 304, acquiring an IP address to be added into the virtual machine, and adding the IP address into the configuration of the host node;

step 306, the control host updates the running script so that the virtual machine to be added and the added virtual machine acquire the latest running code;

308, when a task starting instruction of the virtual machine to be added is received, executing the task starting instruction according to the latest running code so as to enable the virtual machine to be added to the cluster of the crawling website and start Web data acquisition;

step 310, receiving a Web data acquisition request of a target website;

step 312, establishing a task queue according to the acquisition request;

step 314, acquiring a web crawler corresponding to the target website according to a preset corresponding relationship between the web crawler and the website;

step 316, acquiring a crawling cycle of a Web crawler corresponding to the set target website, so that the Web crawler crawls Web data according to the crawling cycle;

step 318, acquiring the URL of the target website based on the idle resources;

step 320, sending the URL to a downloader, so that the downloader generates and returns page data corresponding to the URL;

step 322, processing the page data, and storing the processed page data.

In the embodiment, when the cluster has idle resources, the process of executing the task plan in the task queue by the crawler is controlled, the idle-based resources are realized, firstly, the URL of the target website is obtained, the scheduling is carried out in the scheduler in the form of Request, the URL is forwarded to the downloader through the downloading middleware, the downloader downloads page data corresponding to the URL, Response of the page is generated, and then the page data is returned through the downloading middleware to process the page data and clean the fire bureau. And verifying and persisting, and storing the data into a data storage system at the bottom layer, thereby realizing the control of data flow and finishing a large amount of data acquisition.

In this embodiment, before the whole data crawling process starts, the corresponding relationship between the web crawlers and the target websites is set, each web crawler is responsible for processing one or some specific target websites, according to the corresponding relationship between the web crawlers and the target websites, the web crawlers of the target websites which are crawled according to the data are obtained, the crawling period for setting the web crawlers is obtained, the crawling can be performed day by day, the crawling is performed hour by hour, the crawling is performed week by week, the crawling is performed according to the set crawling period, therefore, the time period for crawling the data is freely defined, the data crawling is performed regularly, and the automation of data crawling is realized.

In a second aspect of the present invention, a Web data collection system 400 is provided, and fig. 4 shows a schematic block diagram of the Web data collection system 400 according to an embodiment of the present invention. The Web data collection system 400 includes: an arrangement unit 10, an addition unit 12, an update unit 14, an activation unit 16.

The Web data collection system 400 provided by this embodiment is managed by a scheduling platform, the arrangement unit 10 arranges a crawler environment with added virtual machines so that a cluster to be added to the virtual machines for joining data collection can crawl data, the addition unit 12 acquires IP addresses of the virtual machines with the built crawler environment, adds the IP addresses to configuration of a main role, the update unit 14 controls a main sentence to update an operation script, so that all machines (the virtual machines to be added and the added virtual machines) acquire latest operation codes from a distributed version control system GIT end of an open source, the start unit 16 executes a task start instruction according to the latest operation codes when receiving the task start instruction of the virtual machines to be added, and thus, the virtual machines to be added are added to the cluster of a crawling website to start Web data collection, when data sources are greatly increased, the newly added worker node can be added into the data acquisition cluster only by starting a task, so that the transverse expansion of Web data crawling and storing is realized, the efficiency of crawling data and storing data is improved, and the data acquisition is completed within a limited time. In addition, the invention can also carry out data acquisition in a timing way, and can easily switch production and test environments.

FIG. 5 shows a schematic block diagram of a Web data collection system 500 of another embodiment of the present invention. The Web data collection system 500 includes: an arrangement unit 20, an adding unit 22, an updating unit 24, an initiating unit 26, a requesting unit 28, a establishing unit 30, a control unit 32, a second obtaining unit 34, a third obtaining unit 36. Wherein, the control unit 32 specifically includes: a first obtaining unit 322, a downloading unit 324, and a processing unit 326.

The Web data collection system 500 provided in this embodiment is managed by a scheduling platform, the arrangement unit 20 arranges a crawler environment with added virtual machines, so that crawling data can be performed in a cluster for adding data collection of virtual machines to be added, the addition unit 22 acquires IP addresses of virtual machines with built crawler environment, adds the IP addresses to configuration of a main role point, the update unit 24 controls a main sentence to update an operation script, so that all machines (the virtual machines to be added and the added virtual machines) acquire latest operation codes from a distributed version control system GIT end of an open source, the start unit 26 executes a task start instruction according to the latest operation codes when receiving the task start instruction of the virtual machines to be added, and thus, the virtual machines to be added are added to the cluster of a crawling website to start Web data collection, when data sources are greatly increased, the newly added worker node can be added into the data acquisition cluster only by starting a task, so that the transverse expansion of Web data crawling and storing is realized, the efficiency of crawling data and storing data is improved, and the data acquisition is completed within a limited time. In addition, the invention can also carry out data acquisition in a timing way, and can easily switch production and test environments.

In this embodiment, after executing a task start instruction according to the latest running code to enable the virtual machine to be added to join the cluster of the crawling website and start Web data acquisition, the request unit 28 receives a Web data acquisition request of a target website and starts data acquisition and crawling, the establishing unit 30 establishes a task queue according to the acquired data request, and the task queue is a distributed message transfer asynchronous task queue. The scheduling framework is used for allocating crawler resources of the cluster, when the cluster has idle resources, the control unit 32 controls the web crawler to acquire a task plan from the message queue by using the idle resources, execute a task of acquiring target website data, and use redis as a transport plane of task messages, so that the processing of the real-time task queue and the scheduling of the task are realized.

In this embodiment, when there are idle resources in the cluster, the control unit 32 controls the crawler to execute the process of executing the task plan in the task queue, so as to implement the idle-based resources, the first obtaining unit 322 obtains the URL of the target website, and performs scheduling in the scheduler in the form of Request, the downloading unit 324 forwards the URL to the downloader through the downloading middleware, the downloader downloads the page data corresponding to the URL, generates a Response of the page, and then returns the Response through the downloading middleware, and the processing unit 326 processes the page data, and cleans the fire bureau. And verifying and persisting, and storing the data into a data storage system at the bottom layer, thereby realizing the control of data flow and finishing a large amount of data acquisition.

In this embodiment, before the whole data crawling process starts, the corresponding relationship between the web crawlers and the target websites is set, each web crawler is responsible for processing one or some specific target websites, according to the corresponding relationship between the web crawlers and the target websites, the second acquiring unit 34 acquires the web crawlers of the target websites of this data crawling, the third acquiring unit 36 acquires the crawling period for setting the web crawlers, which can be crawling by day, crawling by hour, crawling by week, crawling of data according to the set crawling period, thus, the time period for freely defining the crawling data is realized, data crawling is performed at regular time, and automation of data crawling is realized.

FIG. 6 illustrates a schematic diagram of Web data collection in accordance with a specific embodiment of the present invention. The data acquisition device of the specific embodiment comprises a scheduling management module and a data crawling engine, and the data acquisition device is managed through the scheduling management module, can acquire data at regular time and can easily switch production and test environments. The method is also very easy for horizontal expansion, the IP address information of the built virtual machine in the crawler environment is added into the configuration of the master node, the updating script is executed on the host, all machines (including the newly-added virtual machine) can obtain the latest code from the git terminal, and the newly-added worker node can be added into the cluster for crawling the webpage only by starting the worker.

In this embodiment, the invoking management module uses Celery for scheduling, performs crawler resource allocation, and uses redis as a transport for messages to send and receive messages based on a distributed message-passing asynchronous task queue. And when the worker has free resources, crawler execution is acquired from the message queue redis. The user can also set the crawl periods for different crawlers (crawl by day, crawl by hour, crawl by week, etc.). The data crawling engine comprises the following components: the engine is responsible for controlling the data flow to flow in all components in the system and triggering an event when a corresponding action occurs; a scheduler that accepts requests from the engines and enqueues them for later presentation to the engines when they are requested by the engines; the downloader is responsible for acquiring page data, providing the page data to the engine and then providing the page data to the crawler (spider); crawler writers (Spiders) analyzing responses and extracting item (i.e. item retrieved) or classes of additional follow-up URLs, each spider being responsible for processing a specific (or some) web site; item Pipeline, which is responsible for processing Item extracted by the spider, can clean, verify and persist data (stored to the underlying data storage system, such as hbase, mysql, solor, elastic search, etc.); data flow, the Data flow is controlled by the engine.

As shown in fig. 6, a schematic diagram of Web data acquisition is shown, and a data flow control process in the data acquisition device of this embodiment is as follows: (1) the engine opens a website to be crawled, finds a Spider processing the website and requests a first URL to be crawled from the Spider; (2) the engine acquires a first URL to be crawled from the spider and schedules the URL in the scheduler by a Request; (3) the engine requests the next URL to be crawled from the dispatcher; (4) the dispatcher returns a URL to be crawled to the engine, and the engine forwards the URL to the downloader through downloading middleware (request); (5) once the page is downloaded, the downloader generates a Response of the page and sends the Response to the engine through the downloading middleware (returning the Response); (6) the engine receives response from the downloader and sends the response to the Spider for processing through the Spider middleware; (7) the Spider processes the response and returns the crawled Item and a new Request to the engine; (8) the engine gives the Item object information returned by the spider to the component Item Pipeline; (9) when there is no request in the scheduler, the engine will close the crawl of the website.

The data acquisition device of this particular embodiment can crawl a large amount of websites, and 300 websites are crawled to 3 machines at present, and definition that can be free crawls time period, and the cluster can unlimited lateral extension.

In a third aspect of the present invention, a computer device is provided, and fig. 7 is a schematic block diagram of a computer device 700 according to an embodiment of the present invention. Wherein the computer device 700 comprises:

a memory 702, a processor 704, and a computer program stored on the memory 702 and executable on the processor 704, the processor 704 when executing the computer program implementing the steps of: arranging a crawler environment to be added into the virtual machine; acquiring an IP address to be added into the virtual machine, and adding the IP address into the configuration of the main node; the control host updates the running script so that the virtual machine to be added and the added virtual machine acquire the latest running code; and when receiving a task starting instruction of the virtual machine to be added, executing the task starting instruction according to the latest running code so as to enable the virtual machine to be added to the cluster of the crawling website and start Web data acquisition.

In the computer device 700 provided by the present invention, the processor 704, when executing the computer program, implements: managing through a scheduling platform, arranging a crawler environment with added virtual machines so that crawling data can be carried out in a cluster for collecting added data of the virtual machines to be added, acquiring IP addresses of the virtual machines with the built crawler environment, adding the IP addresses into the configuration of a main corner point, and then controlling a main sentence to update an operation script so that all machines (the virtual machines to be added and the added virtual machines) acquire latest operation codes from an open-source distributed version control system GIT end, executing a task starting instruction according to the latest operation codes when receiving the task starting instruction of the virtual machines to be added, so that the virtual machines to be added are added into the cluster of a crawling website to start Web data collection, when data sources are greatly increased, newly added worker nodes can be added into the cluster for collecting the data only by starting the tasks, and transverse expansion on crawling and storing of the Web data is realized, the efficiency of crawling data and storing data is improved, and the data acquisition is completed within a limited time. In addition, the invention can also carry out data acquisition in a timing way, and can easily switch production and test environments.

An embodiment of the fourth aspect of the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of: arranging a crawler environment to be added into the virtual machine; acquiring an IP address to be added into the virtual machine, and adding the IP address into the configuration of the main node; the control host updates the running script so that the virtual machine to be added and the added virtual machine acquire the latest running code; and when receiving a task starting instruction of the virtual machine to be added, executing the task starting instruction according to the latest running code so as to enable the virtual machine to be added to the cluster of the crawling website and start Web data acquisition.

In the description herein, the description of the terms "one embodiment," "some embodiments," "specific embodiments," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A Web data acquisition method is characterized by comprising the following steps:

arranging a crawler environment to be added into the virtual machine;

acquiring the IP address of the virtual machine to be added, and adding the IP address into the configuration of the main node;

the control host updates the running script so that the virtual machine to be added and the added virtual machine acquire the latest running code;

when a task starting instruction of the virtual machine to be added is received, executing the task starting instruction according to the latest running code so as to enable the virtual machine to be added to a cluster of a crawling website and start Web data acquisition;

executing the task starting instruction according to the latest running code so that the virtual machine to be added is added into a cluster of a crawling website and after Web data collection is started, the method further comprises the following steps:

receiving a Web data acquisition request of a target website;

establishing a task queue according to the Web data acquisition request;

when the cluster has idle resources, controlling a Web crawler to execute a task plan in the task queue by using the idle resources so as to acquire Web data of the target website;

when the cluster has idle resources, controlling a Web crawler to execute a task plan in the task queue by using the idle resources to acquire the Web data of the target website, specifically including:

acquiring the URL of the target website based on the idle resources;

sending the URL to a downloader so that the downloader generates and returns page data corresponding to the URL;

processing the page data, and storing the processed page data;

after the task queue is established according to the acquisition request, the method further comprises the following steps:

acquiring a web crawler corresponding to the target website according to a preset corresponding relation between the web crawler and the website;

and acquiring a set crawling period of the Web crawler corresponding to the target website, so that the Web crawler crawls the Web data according to the crawling period.

2. The Web data collection method according to claim 1,

the task queue is a distributed messaging asynchronous task queue.

3. A Web data collection system, comprising:

the virtual machine management system comprises an arrangement unit, a management unit and a management unit, wherein the arrangement unit is used for arranging a crawler environment to be added into a virtual machine;

the adding unit is used for acquiring the IP address of the virtual machine to be added and adding the IP address to the configuration of the main node;

the updating unit is used for controlling the host to update the running script so that the virtual machine to be added and the added virtual machine acquire the latest running code;

the starting unit is used for executing the task starting instruction according to the latest running code when receiving the task starting instruction of the virtual machine to be added, so that the virtual machine to be added is added into a cluster of a crawling website and Web data collection is started;

the request unit is used for receiving a Web data acquisition request of a target website;

the establishing unit is used for establishing a task queue according to the Web data acquisition request;

the control unit is used for controlling a Web crawler to execute the task plan in the task queue by using the idle resources to acquire the Web data of the target website when the cluster has the idle resources;

the control unit specifically includes:

a first obtaining unit, configured to obtain a URL of the target website based on the idle resource;

the downloading unit is used for sending the URL to a downloader so that the downloader generates and returns page data corresponding to the URL;

the processing unit is used for processing the page data and storing the processed page data;

the second acquisition unit is used for acquiring the web crawler corresponding to the target website according to the preset corresponding relation between the web crawler and the website;

and the third acquisition unit is used for acquiring the set crawling cycle of the Web crawler corresponding to the target website so as to make the Web crawler crawl the Web data according to the crawling cycle.

4. The Web data collection system of claim 3,

the task queue is a distributed messaging asynchronous task queue.

5. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the Web data acquisition method as claimed in claim 1 or 2 are implemented by the processor when executing the computer program.

6. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the Web data acquisition method according to claim 1 or 2.