CN108011931A

CN108011931A - Web data acquisition method and web data acquisition system

Info

Publication number: CN108011931A
Application number: CN201711174715.9A
Authority: CN
Inventors: 韦立鹏
Original assignee: Uf Financial Information Technology Ltd By Share Ltd
Current assignee: Uf Financial Information Technology Ltd By Share Ltd
Priority date: 2017-11-22
Filing date: 2017-11-22
Publication date: 2018-05-08
Anticipated expiration: 2037-11-22
Also published as: CN108011931B

Abstract

The present invention proposes a kind of web data acquisition method, web data acquisition system, computer equipment, computer-readable recording medium.Wherein, web data acquisition method includes：Arrange the reptile environment of virtual machine to be added；The IP address of virtual machine to be added is obtained, and IP address is added in host node configuration；Control main frame updates Run Script, so as to virtual machine to be added and add the newest operation code of virtual machine acquisition；When the task start for receiving virtual machine to be added instructs, performing task start according to newest operation code instructs, so that virtual machine to be added adds into the cluster for crawl website and starts web data collection.When the present invention realizes data source substantial increase, web data is crawled and stored and is extending transversely, improves the efficiency for crawling data and storing data, and the collection of data is completed within the limited time.

Description

Web data acquisition method and web data acquisition system

Technical field

The present invention relates to web data acquisition technique field, in particular to a kind of web data acquisition method, Web numbers According to acquisition system, computer equipment, computer-readable recording medium.

Background technology

Either data analysis or public sentiment system is both for data, and data acquisition is basis, and data acquiring mode is certainly There are a data, data on network, if business is related to data on network, it is necessary to which enterprise oneself crawls, unit is crawled for a large amount of Web data processing time it is excessively slow, it is impossible to meet the needs of business, traditional database is stored and inquired about for Volume data Performance cannot increasingly meet present software operation demand.

Therefore, how to realize it is a kind of support unlimited web data acquisition method extending transversely and web data acquisition system into For urgent problem to be solved.

The content of the invention

It is contemplated that at least solve one of technical problem present in the prior art or correlation technique.

For this reason, the first aspect of the present invention is to propose a kind of web data acquisition method.

The second aspect of the invention is to propose a kind of web data acquisition system.

The third aspect of the invention is to propose a kind of computer equipment.

The fourth aspect of the invention is to propose a kind of computer-readable recording medium.

In view of this, according to an aspect of the present invention, it proposes a kind of web data acquisition method, including：Arrange to be added Enter the reptile environment of virtual machine；The IP address of virtual machine to be added is obtained, and IP address is added in host node configuration；Control Host updates Run Script, so as to virtual machine to be added and add the newest operation code of virtual machine acquisition；It is to be added when receiving Enter the task start instruction of virtual machine, task start instruction is performed according to newest operation code, so that virtual machine to be added adds Into the cluster for crawl website and start web data collection.

Web data acquisition method provided by the invention, is managed by dispatching platform, is arranged with the virtual machine added Reptile environment, so as to can carry out crawling data in the cluster of the addition data acquisition of virtual machine to be added, acquisition, which is put up, climbs The IP address of the virtual machine of worm environment, IP address is added in the configuration of leading role's point, then controls main clause renewal Run Script, So that all machines (virtual machine to be added and the virtual machine having been added to) are all from the distributed version control system GIT to increase income End obtains newest operation code, when the task start for receiving virtual machine to be added instructs, according to newest operation code Task start instruction is performed, in this way, virtual machine to be added, which is just added in the cluster for crawl website, starts web data collection, this When data source rolls up, newly-increased worker nodes only need the cluster that startup task can be added to data acquisition for invention In, realize web data crawl and store it is extending transversely, improve crawl data and store data efficiency, having The collection of data is completed in the time of limit.In addition the present invention can also timing carry out data acquisition, can easily switch production and Test environment.

Above-mentioned web data acquisition method according to the present invention, can also have following technical characteristic：

In the above-mentioned technical solutions, it is preferable that task start instruction is performed according to newest operation code, so that void to be added After plan machine adds into the cluster for crawl website and starts web data collection, further include：Receive the web data of targeted website Obtain request；Task queue is established according to request is obtained；When having idling-resource in cluster, control web crawlers uses idle Resource performs the task scheduling in task queue, to obtain the web data of targeted website.

In the technical scheme, task start instruction is performed according to newest operation code, so that virtual machine to be added adds Into the cluster for crawl website and start web data collection after, receiving targeted website web data obtain request, open Beginning carries out data acquisition and crawls, and task queue is established according to the request of data of acquisition, uses Scheduling Framework to carry out the reptile of cluster The distribution of resource, when having idle resource in cluster, control web crawlers is obtained using idling-resource from message queue Task scheduling, performs the obtaining targeted website data of the task, using redis as the transporter of task message, realizes to reality When the processing of the task queue and scheduling of task.

In any of the above-described technical solution, it is preferable that when having idling-resource in cluster, control web crawlers uses empty Not busy resource performs the task scheduling in task queue, to obtain the web data of targeted website, specifically includes：Based on idling-resource, Obtain the URL of targeted website；URL is sent to downloader, so that downloader generates and returns to the corresponding page datas of URL；Place Manage page data, the page data after storage processing.

In the technical scheme, when having idling-resource in cluster, control reptile performs task scheduling in task queue Process, realizes the URL for based on idle resource, obtaining targeted website first, and is carried out in the scheduler in the form of Request Scheduling, is transmitted to downloader, downloader downloads the corresponding page datas of URL, generates the page by URL by downloading middleware Response, is then returned by downloading middleware again, handles page data, office at top speed is cleared up.Verification and persistence, The data-storage system of bottom is stored, realizes the control of data flow, completes substantial amounts of data acquisition.

In any of the above-described technical solution, it is preferable that after establishing task queue according to acquisition request, further include：According to Pre-set web crawlers and the correspondence of website, obtain the corresponding web crawlers in targeted website；Obtain the target set The corresponding web crawlers in website crawls the cycle, so that web crawlers carries out crawling for web data according to the cycle is crawled.

In the technical scheme, crawled in whole data before process starts, web crawlers and pair of targeted website are set It should be related to, each web crawlers is responsible for handling one or some specific targeted websites, according to web crawlers and targeted website Correspondence, obtains the web crawlers for the targeted website that this this data crawls, obtain web crawlers is set crawls the cycle, can be with It is daily to crawl, crawls, crawled by week by the hour, crawling for data is carried out according to the cycle that crawls of setting, in this way, realizing certainly The time cycle of data is crawled by definition, timing carries out data and crawls, and realizes the automation that data crawl.

In any of the above-described technical solution, it is preferable that asynchronous task queue is transmitted in task queue for distributed message.

In the technical scheme, in data acquisition is carried out, the task queue established according to request of data is distribution The asynchronous task queue of formula message transmission, so, it can be achieved that the distributed treatment of a large amount of message.Improve the efficient of task processing Property and flexibility, the reliability of data processing.

According to the second aspect of the invention, it is proposed that a kind of web data acquisition system, including：Arrangement unit, for cloth Put the reptile environment of virtual machine to be added；Adding device, is added for obtaining the IP address of virtual machine to be added, and by IP address In being configured to host node；Updating block, for control main frame update Run Script so that virtual machine to be added and added virtually Machine obtains newest operation code；Start unit, for being instructed when the task start for receiving virtual machine to be added, according to newest fortune Line code performs task start instruction, is adopted so that virtual machine to be added adds into the cluster for crawl website and starts web data Collection.

Web data acquisition system provided by the invention, is managed by dispatching platform, and arrangement unit arrangement is with addition The reptile environment of virtual machine, so as to can carry out crawling data in the cluster of the addition data acquisition of virtual machine to be added, is added Unit obtains the IP address for the virtual machine for putting up reptile environment, IP address is added in the configuration of leading role's point, then updated Unit control main clause renewal Run Script so that all machines (virtual machine to be added and the virtual machine having been added to) are all from opening The distributed version control system GIT ends in source obtain newest operation code, and start unit, which is worked as, receives virtual machine to be added When task start instructs, task start instruction is performed according to newest operation code, in this way, virtual machine to be added is just added to and climbs Take and start web data collection in the cluster of website, when data source rolls up, newly-increased worker nodes only need the present invention Startup task can be added in the cluster of data acquisition, realized web data and crawled and store extending transversely, raising Crawl data and store the efficiency of data, the collection of data is completed within the limited time.In addition the present invention can also timing Data acquisition is carried out, can easily switch production and test environment.

Above-mentioned web data acquisition system according to the present invention, can also have following technical characteristic：

In the above-mentioned technical solutions, it is preferable that request unit, the web data for receiving targeted website obtain request；Build Vertical unit, for establishing task queue according to acquisition request；Control unit, for when having idling-resource in cluster, controlling Web crawlers performs the task scheduling in task queue using idling-resource, to obtain the web data of targeted website.

In the technical scheme, task start instruction is performed according to newest operation code, so that virtual machine to be added adds Into the cluster for crawl website and start web data collection after, request unit receive targeted website web data obtain Request, proceeds by data acquisition and crawls, establish unit and establish task queue according to the request of data of acquisition, use Scheduling Framework The distribution of the reptile resource of cluster is carried out, when having idle resource in cluster, control unit control web crawlers uses empty Not busy resource obtains task scheduling from message queue, performs the obtaining targeted website data of the task, is disappeared using redis as task The transporter of breath, realizes the processing to real-time task queue and the scheduling of task.

In any of the above-described technical solution, it is preferable that control unit specifically includes：First acquisition unit, for based on sky Not busy resource, obtains the URL of targeted website；Download unit, for sending URL to downloader, so that downloader is generated and returned The corresponding page datas of URL；Processing unit, for handling page data, the page data after storage processing.

In the technical scheme, when having idling-resource in cluster, control unit control reptile, which is performed in task queue, appoints Be engaged in the process of plan, realize based on idle resource, first acquisition unit obtains the URL of targeted website, and in the scheduler with The form of Request is scheduled, and URL is transmitted to downloader by download unit by downloading middleware, and downloader downloads URL pairs The page data answered, generates the Response of the page, is then returned again by downloading middleware, processing unit processes page number According to clearing up office at top speed.Verification and persistence, the data-storage system of storage to bottom, realize the control of data flow, Complete substantial amounts of data acquisition.

In any of the above-described technical solution, it is preferable that second acquisition unit, for according to pre-set web crawlers with The correspondence of website, obtains the corresponding web crawlers in targeted website；3rd acquiring unit, for obtaining the targeted website set Corresponding web crawlers crawls the cycle, so that web crawlers carries out crawling for web data according to the cycle is crawled.

In the technical scheme, crawled in whole data before process starts, web crawlers and pair of targeted website are set It should be related to, each web crawlers is responsible for handling one or some specific targeted websites, according to web crawlers and targeted website Correspondence, second acquisition unit obtain the web crawlers for the targeted website that this this data crawls, and the 3rd acquiring unit is obtained and set That puts web crawlers crawls the cycle, can daily crawl, crawl, crawled by week by the hour, according to setting crawl the cycle into Row data crawl, and the time cycle for crawling data is freely defined in this way, realizing, and timing carries out data and crawls, and realizes data The automation crawled.

According to the third aspect of the present invention, the present invention provides a kind of computer equipment, including memory, processor and The computer program that can be run on a memory and on a processor is stored, processor realizes following step when performing computer program Suddenly：Arrange the reptile environment of virtual machine to be added；The IP address of virtual machine to be added is obtained, and IP address is added to host node In configuration；Control main frame updates Run Script, so as to virtual machine to be added and add the newest operation code of virtual machine acquisition；When The task start instruction of virtual machine to be added is received, task start instruction is performed according to newest operation code, so as to be added Virtual machine adds into the cluster for crawl website and starts web data collection.

A kind of computer equipment provided by the invention, processor are realized when performing computer program：By dispatching platform into Row management, arranges the reptile environment with the virtual machine added, so that can in the cluster of the addition data acquisition of virtual machine to be added Crawl data, the IP address for the virtual machine for putting up reptile environment is obtained, IP address is added to the configuration of leading role's point In, then control main clause renewal Run Script so that all machines (virtual machine to be added and the virtual machine having been added to) are all Newest operation code is obtained from the distributed version control system GIT ends increased income, when the receiving virtual machine to be added of task During enabled instruction, task start instruction is performed according to newest operation code, in this way, virtual machine to be added is just added to and crawls net Start web data collection in the cluster stood, when data source rolls up, newly-increased worker nodes only need to start the present invention Task can be added in the cluster of data acquisition, realized web data and crawled and store extending transversely, improve and climb Evidence of fetching and the efficiency for storing data, complete the collection of data within the limited time.In addition the present invention can also timing progress Data acquisition, can easily switch production and test environment.

According to the fourth aspect of the present invention, the present invention provides a kind of computer-readable recording medium, it is stored thereon with Computer program, realizes following steps when computer program is executed by processor：Arrange the reptile environment of virtual machine to be added；Obtain The IP address of virtual machine to be added is taken, and IP address is added in host node configuration；Control main frame updates Run Script, so that Virtual machine to be added and virtual machine is added and obtains newest operation code；When the task start for receiving virtual machine to be added refers to Order, performs task start instruction, so that virtual machine to be added is added into the cluster for crawl website simultaneously according to newest operation code Start web data collection.

A kind of computer-readable recording medium provided by the invention, is stored thereon with computer program, computer program quilt Processor is realized when performing：It is managed by dispatching platform, the reptile environment with the virtual machine added is arranged, so as to be added It can carry out crawling data in the cluster of the addition data acquisition of virtual machine, with obtaining the IP for the virtual machine for putting up reptile environment Location, IP address is added in the configuration of leading role's point, then controls main clause renewal Run Script so that all machines are (to be added Virtual machine and the virtual machine that has been added to) all obtain newest operation generation from the distributed version control system GIT ends increased income Code, when the task start for receiving virtual machine to be added instructs, performs task start instruction, such as according to newest operation code This, virtual machine to be added, which is just added in the cluster for crawl website, starts web data collection, and the present invention rolls up in data source When, newly-increased worker nodes only need startup task to be added in the cluster of data acquisition, realize web data and crawl It is extending transversely in upper and storage, the efficiency for crawling data and storing data is improved, data are completed within the limited time Collection.In addition the present invention can also periodically carry out data acquisition, can easily switch production and test environment.

The additional aspect and advantage of the present invention will become obvious in following description section, or the practice by the present invention Recognize.

Brief description of the drawings

The above-mentioned and/or additional aspect and advantage of the present invention will become in the description from combination accompanying drawings below to embodiment Substantially and it is readily appreciated that, wherein：

Fig. 1 shows the flow diagram of the web data acquisition method of one embodiment of the present of invention；

Fig. 2 shows the flow diagram of the web data acquisition method of an alternative embodiment of the invention；

Fig. 3 shows the flow diagram of the web data acquisition method of yet another embodiment of the present invention；

Fig. 4 shows the schematic block diagram of the web data acquisition system of one embodiment of the present of invention；

Fig. 5 shows the schematic block diagram of the web data acquisition system of an alternative embodiment of the invention；

Fig. 6 shows the schematic diagram of the web data collection of the specific embodiment of the present invention；

Fig. 7 shows the schematic block diagram of the computer equipment of one embodiment of the present of invention.

Embodiment

It is below in conjunction with the accompanying drawings and specific real in order to be more clearly understood that aforementioned aspect of the present invention, feature and advantage Mode is applied the present invention is further described in detail.It should be noted that in the case where there is no conflict, the implementation of the application Feature in example and embodiment can be mutually combined.

Many details are elaborated in the following description to facilitate a thorough understanding of the present invention, still, the present invention may be used also To be implemented using other different from other modes described here, therefore, protection scope of the present invention is not limited to following public affairs The limitation for the specific embodiment opened.

The embodiment of first aspect present invention, proposes a kind of web data acquisition method, and Fig. 1 shows one of the present invention The flow diagram of the web data acquisition method of embodiment：

Step 102, the reptile environment of virtual machine to be added is arranged；

Step 104, the IP address of virtual machine to be added is obtained, and IP address is added in host node configuration；

Step 106, control main frame renewal Run Script, so as to virtual machine to be added and add the newest fortune of virtual machine acquisition Line code；

Step 108, when the task start for receiving virtual machine to be added instructs, performing task according to newest operation code opens Dynamic instruction, so that virtual machine to be added adds into the cluster for crawl website and starts web data collection.

The web data acquisition method that the embodiment provides, is managed by dispatching platform, is arranged with the virtual machine added Reptile environment, so as to can carry out crawling data in the cluster of the addition data acquisition of virtual machine to be added, acquisition is put up The IP address of the virtual machine of reptile environment, IP address is added in the configuration of leading role's point, then controls main clause renewal operation foot This so that all machines (virtual machine to be added and the virtual machine having been added to) are all from the distributed version control system increased income GIT ends obtain newest operation code, when the task start for receiving virtual machine to be added instructs, according to newest operation generation Code performs task start instruction, in this way, virtual machine to be added, which is just added in the cluster for crawl website, starts web data collection, For the present invention when data source rolls up, newly-increased worker nodes only need the collection that startup task can be added to data acquisition Group in, realize web data crawl and store it is extending transversely, improve crawl data and storage data efficiency, The collection of data is completed in the limited time.In addition the present invention can also periodically carry out data acquisition, can easily switch production And test environment.

Fig. 2 shows the flow diagram of the web data acquisition method of an alternative embodiment of the invention.Wherein, the party Method includes：

Step 202, the reptile environment of virtual machine to be added is arranged；

Step 204, the IP address of virtual machine to be added is obtained, and IP address is added in host node configuration；

Step 206, control main frame renewal Run Script, so as to virtual machine to be added and add the newest fortune of virtual machine acquisition Line code；

Step 208, when the task start for receiving virtual machine to be added instructs, performing task according to newest operation code opens Dynamic instruction, so that virtual machine to be added adds into the cluster for crawl website and starts web data collection；

Step 210, the web data for receiving targeted website obtains request；

Step 212, task queue is established according to acquisition request；

Step 214, when having idling-resource in cluster, control web crawlers is performed in task queue using idling-resource Task scheduling, to obtain the web data of targeted website.

In this embodiment, it is managed by dispatching platform, when data source rolls up, newly-increased worker nodes Only need startup task to be added in the cluster of data acquisition, realize the horizontal expansion that web data is crawled and stored Exhibition, improves the efficiency for crawling data and storing data, and the collection of data is completed within the limited time.In addition the present invention may be used also Periodically to carry out data acquisition, it can easily switch production and test environment.

In this embodiment, according to newest operation code perform task start instruction so that virtual machine to be added add to After crawling in the cluster of website and starting web data collection, request is obtained in the web data for receiving targeted website, is started Carry out data acquisition to crawl, task queue is established according to the request of data of acquisition, the reptile that cluster is carried out using Scheduling Framework is provided The distribution in source, when having idle resource in cluster, control web crawlers is obtained using idling-resource from message queue appoints Business plan, performs the obtaining targeted website data of the task, using redis as the transporter of task message, realizes to real-time The processing of task queue and the scheduling of task.

In this embodiment, in data acquisition is carried out, the task queue established according to request of data is distribution The asynchronous task queue of message transmission, so, it can be achieved that the distributed treatment of a large amount of message.Improve the high efficiency of task processing And flexibility, the reliability of data processing.

Fig. 3 shows the flow diagram of the web data acquisition method of yet another embodiment of the present invention.Wherein, the party Method includes：

Step 302, the reptile environment of virtual machine to be added is arranged；

Step 304, the IP address of virtual machine to be added is obtained, and IP address is added in host node configuration；

Step 306, control main frame renewal Run Script, so as to virtual machine to be added and add the newest fortune of virtual machine acquisition Line code；

Step 308, when the task start for receiving virtual machine to be added instructs, performing task according to newest operation code opens Dynamic instruction, so that virtual machine to be added adds into the cluster for crawl website and starts web data collection；

Step 310, the web data for receiving targeted website obtains request；

Step 312, task queue is established according to acquisition request；

Step 314, according to pre-set web crawlers and the correspondence of website, the corresponding network in targeted website is obtained Reptile；

Step 316, obtain the corresponding web crawlers in targeted website of setting crawls the cycle, so that web crawlers is according to climbing The cycle is taken to carry out crawling for web data；

Step 318, based on idling-resource, the URL of targeted website is obtained；

Step 320, URL is sent to downloader, so that downloader generates and returns to the corresponding page datas of URL；

Step 322, page data, the page data after storage processing are handled.

In this embodiment, when having idling-resource in cluster, control reptile performs the mistake of task scheduling in task queue Journey, realizes based on idle resource, obtains the URL of targeted website first, and is adjusted in the scheduler in the form of Request Degree, is transmitted to downloader, downloader downloads the corresponding page datas of URL, generates the page by URL by downloading middleware Response, is then returned by downloading middleware again, handles page data, office at top speed is cleared up.Verification and persistence, The data-storage system of bottom is stored, realizes the control of data flow, completes substantial amounts of data acquisition.

In this embodiment, crawled in whole data before process starts, set web crawlers corresponding with targeted website Relation, each web crawlers is responsible for handling one or some specific targeted websites, according to web crawlers and pair of targeted website It should be related to, obtain the web crawlers for the targeted website that this this data crawls, obtain setting web crawlers crawls cycle, Ke Yishi Daily crawl, crawl by the hour, crawled by week, crawling for data is carried out according to the cycle that crawls of setting, in this way, realizing freedom Definition crawls the time cycle of data, and timing carries out data and crawls, and realizes the automation that data crawl.

The embodiment of second aspect of the present invention, proposes a kind of web data acquisition system 400, and Fig. 4 shows the one of the present invention The schematic block diagram of the web data acquisition system 400 of a embodiment.Web data acquisition system 400 includes：Arrangement unit 10, addition Unit 12, updating block 14, start unit 16.

Web data acquisition system 400 provided in this embodiment, is managed by dispatching platform, and arrangement unit 10 is arranged Reptile environment with the virtual machine added, so as to can carry out crawling number in the cluster of the addition data acquisition of virtual machine to be added According to, adding device 12 obtains the IP address for the virtual machine for putting up reptile environment, IP address is added in the configuration of leading role's point, Then updating block 14 controls main clause renewal Run Script so that all machine (virtual machine to be added and the void having been added to Plan machine) newest operation code all is obtained from the distributed version control system GIT ends increased income, start unit 16, which works as to receive, to be treated When adding the task start instruction of virtual machine, task start instruction is performed according to newest operation code, in this way, to be added virtual Machine, which is just added in the cluster for crawl website, starts web data collection, and the present invention is newly-increased when data source rolls up Worker nodes only need startup task to be added in the cluster of data acquisition, realize web data and crawl and store On it is extending transversely, improve crawl data and store data efficiency, within the limited time complete data collection.In addition It is of the invention periodically to carry out data acquisition, it can easily switch production and test environment.

Fig. 5 shows the schematic block diagram of the web data acquisition system 500 of an alternative embodiment of the invention.Wherein, Web Data collecting system 500 includes：Arrangement unit 20, adding device 22, updating block 24, start unit 26, request unit 28, build Vertical unit 30, control unit 32, second acquisition unit 34, the 3rd acquiring unit 36.Wherein, control unit 32 specifically includes：The One acquiring unit 322, download unit 324, processing unit 326.

The web data acquisition system 500 that the embodiment provides, is managed, arrangement unit 20 is arranged by dispatching platform Reptile environment with the virtual machine added, so as to can carry out crawling number in the cluster of the addition data acquisition of virtual machine to be added According to, adding device 22 obtains the IP address for the virtual machine for putting up reptile environment, IP address is added in the configuration of leading role's point, Then updating block 24 controls main clause renewal Run Script so that all machine (virtual machine to be added and the void having been added to Plan machine) newest operation code all is obtained from the distributed version control system GIT ends increased income, start unit 26, which works as to receive, to be treated When adding the task start instruction of virtual machine, task start instruction is performed according to newest operation code, in this way, to be added virtual Machine, which is just added in the cluster for crawl website, starts web data collection, and the present invention is newly-increased when data source rolls up Worker nodes only need startup task to be added in the cluster of data acquisition, realize web data and crawl and store On it is extending transversely, improve crawl data and store data efficiency, within the limited time complete data collection.In addition It is of the invention periodically to carry out data acquisition, it can easily switch production and test environment.

In this embodiment, according to newest operation code perform task start instruction so that virtual machine to be added add to After crawling in the cluster of website and starting web data collection, the web data that targeted website is received in request unit 28 obtains Request, proceeds by data acquisition and crawls, establish unit 30 and establish task queue according to the request of data of acquisition, task queue is Distributed message transmits asynchronous task queue, so, it can be achieved that the distributed treatment of a large amount of message, improves task processing The flexibility of high efficiency and data processing, reliability.The distribution of the reptile resource of cluster is carried out using Scheduling Framework, works as cluster In when there is idle resource, control unit 32 controls web crawlers to obtain task meter from message queue using idling-resource Draw, perform the obtaining targeted website data of the task, using redis as the transporter of task message, realize to real-time task The processing of queue and the scheduling of task.

In this embodiment, when having idling-resource in cluster, control unit 32 controls appoints in reptile execution task queue The process of business plan, realizes and obtains the URL of targeted website based on idle resource, first acquisition unit 322, and in the scheduler It is scheduled in the form of Request, URL is transmitted to downloader by download unit 324 by downloading middleware, and downloader is downloaded The corresponding page datas of URL, generate the Response of the page, are then returned again by downloading middleware, at processing unit 326 Page data is managed, office at top speed is cleared up.Verification and persistence, the data-storage system of storage to bottom, realize data The control of stream, completes substantial amounts of data acquisition.

In this embodiment, crawled in whole data before process starts, set web crawlers corresponding with targeted website Relation, each web crawlers is responsible for handling one or some specific targeted websites, according to web crawlers and pair of targeted website It should be related to, second acquisition unit 34 obtains the web crawlers for the targeted website that this this data crawls, and the 3rd acquiring unit 36 obtains Set web crawlers crawls the cycle, can daily crawl, crawl, crawled by week by the hour, the cycle is crawled according to setting Crawling for data is carried out, the time cycle for crawling data is freely defined in this way, realizing, timing carries out data and crawls, and realizes number According to the automation crawled.

Fig. 6 shows the schematic diagram of the web data collection of the specific embodiment of the present invention.The number of the specific embodiment Include dispatching management module and data according to harvester and crawl engine, be managed by dispatching management module, can timing into Row data acquisition, can easily switch production and test environment.For extending transversely and very easy, the reptile that will be put up The virtual machine IP address information of environment is added in the configuration of master nodes, and more new script, all machines are performed on host (including the virtual machine newly increased) all can obtain newest code from git ends, and the worker nodes newly increased need to only start worker It can be added in the cluster for crawling webpage.

In this specific embodiment, transfer management module and asynchronous task queue is transmitted based on distributed message, use Celery is scheduled, and the distribution of reptile resource is carried out, using redis as the transporter of message, to send and receive message. Reptile execution can be obtained from message queue redis when the available free resources of worker.User can also set different reptiles Crawl the cycle (daily crawl, crawl by the hour, by crawl in week etc.).Data, which crawl engine, includes such as lower component：Engine, is responsible for Control data flow is flowed in all component in systems, and the trigger event when corresponding actions occur；Scheduler, receives from engine Ask that simultaneously they join the team, so as to engine requests afterwards they when be supplied to engine；Downloader, is responsible for obtaining page data and carries Engine is supplied, is then supplied to reptile (spider)；Reptile author (Spiders), for analyzing response and extracting The class of item (item got) or the URL additionally to follow up, each spider are responsible for specific (or some) net of processing one Stand；Item Pipeline, are responsible for the item that is extracted by spider of processing, data can be cleared up, verified and persistently Change (data-storage system of storage to bottom, such as hbase, mysql, solor, elasticsearch etc.)；Data flow (Data flow), data flow is by engine control.

The schematic diagram that web data as shown in showing Fig. 6 gathers, data-flow-control in the data acquisition device of the specific embodiment Process processed is：(1) engine, which opens one, will crawl website, find the Spider for handling the website and ask first to the spider It is a to crawl URL；(2) engine is got first URL to be crawled from spider and is carried out in the scheduler with Request Scheduling；(3) engine is to the next URL to be crawled of scheduler request；(4) scheduler returns to a URL to be crawled to engine, URL is transmitted to downloader by engine by downloading middleware (request request)；(5) once page-downloading finishes, downloader life Into the Response of the page, and by it engine is issued by downloading middleware (returning to response)；(6) engine is under Carry and receive response in device and spider processing is sent to by Spider middlewares；(7) Spider processing Response simultaneously returns to the Item crawled and new Request to engine；(8) the Items objects that engine returns to spider Information gives component Item Pipeline；(9) when not having request in scheduler, engine will close crawling for the website.

The data acquisition device of the specific embodiment can crawl a large amount of websites, and current 3 machines crawl website more than 300, It is free to definition and crawl the time cycle, what cluster can be infinitely is extending transversely.

The embodiment of third aspect present invention, proposes a kind of computer equipment, and Fig. 7 shows one embodiment of the present of invention Computer equipment 700 schematic block diagram.Wherein, which includes：

Memory 702, processor 704 and it is stored in the computer journey that can be run on memory 702 and on processor 704 Sequence, processor 704 realize following steps when performing computer program：Arrange the reptile environment of virtual machine to be added；Obtain to be added Enter the IP address of virtual machine, and IP address is added in host node configuration；Control main frame updates Run Script, so as to be added Virtual machine and virtual machine is added and obtains newest operation code；When receive virtual machine to be added task start instruct, according to Newest operation code performs task start instruction, so that virtual machine to be added adds into the cluster for crawl website and starts Web Data acquisition.

A kind of computer equipment 700 provided by the invention, processor 704 are realized when performing computer program：Pass through scheduling Platform is managed, and arranges the reptile environment with the virtual machine added, so that the collection of the addition data acquisition of virtual machine to be added It can carry out crawling data in group, obtain the IP address for the virtual machine for putting up reptile environment, IP address is added to leading role's point Configuration in, then control main clause renewal Run Script so that all machines (virtual machine to be added and have been added to virtual Machine) newest operation code all is obtained from the distributed version control system GIT ends increased income, when receiving virtual machine to be added When task start instructs, task start instruction is performed according to newest operation code, in this way, virtual machine to be added is just added to and climbs Take and start web data collection in the cluster of website, when data source rolls up, newly-increased worker nodes only need the present invention Startup task can be added in the cluster of data acquisition, realized web data and crawled and store extending transversely, raising Crawl data and store the efficiency of data, the collection of data is completed within the limited time.In addition the present invention can also timing Data acquisition is carried out, can easily switch production and test environment.

The embodiment of fourth aspect present invention, there is provided a kind of computer-readable recording medium, is stored thereon with computer Program, realizes following steps when computer program is executed by processor：Arrange the reptile environment of virtual machine to be added；Obtain to be added Enter the IP address of virtual machine, and IP address is added in host node configuration；Control main frame updates Run Script, so as to be added Virtual machine and virtual machine is added and obtains newest operation code；When receive virtual machine to be added task start instruct, according to Newest operation code performs task start instruction, so that virtual machine to be added adds into the cluster for crawl website and starts Web Data acquisition.

In the description of this specification, the description of term " one embodiment ", " some embodiments ", " specific embodiment " etc. Mean to combine at least one reality that the particular features, structures, materials, or characteristics that the embodiment or example describe are contained in the present invention Apply in example or example.In the present specification, schematic expression of the above terms is not necessarily referring to identical embodiment or reality Example.Moreover, description particular features, structures, materials, or characteristics can in any one or more embodiments or example with Suitable mode combines.

The foregoing is only a preferred embodiment of the present invention, is not intended to limit the invention, for the skill of this area For art personnel, the invention may be variously modified and varied.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should all be included in the protection scope of the present invention.

Claims

A kind of 1. web data acquisition method, it is characterised in that including：

Arrange the reptile environment of virtual machine to be added；

The IP address of the virtual machine to be added is obtained, and the IP address is added in host node configuration；

Control main frame updates Run Script, so as to the virtual machine to be added and add the newest operation code of virtual machine acquisition；

When the task start for receiving the virtual machine to be added instructs, performing the task according to the newest operation code opens Dynamic instruction, so that the virtual machine to be added adds into the cluster for crawl website and starts web data collection.
2. web data acquisition method according to claim 1, it is characterised in that performed according to the newest operation code The task start instruction, so that the virtual machine to be added adds into the cluster for crawl website and starts web data collection Afterwards, further include：

The web data for receiving targeted website obtains request；

Request is obtained according to the web data and establishes task queue；

When having idling-resource in the cluster, control web crawlers is performed in the task queue using the idling-resource Task scheduling, to obtain the web data of the targeted website.
3. web data acquisition method according to claim 2, it is characterised in that when having idling-resource in the cluster When, control web crawlers performs the task scheduling in the task queue using the idling-resource, to obtain the target network The web data stood, specifically includes：

Based on the idling-resource, the URL of the targeted website is obtained；

The URL is sent to downloader, so that the downloader generates and returns to the corresponding page datas of the URL；

Handle the page data, the page data after storage processing.
4. web data acquisition method according to claim 2, it is characterised in that task is established according to the acquisition request After queue, further include：

According to pre-set web crawlers and the correspondence of website, the corresponding web crawlers in the targeted website is obtained；

Obtain the corresponding web crawlers in the targeted website of setting crawls the cycle, so that the web crawlers is climbed according to described The cycle is taken to carry out crawling for the web data.
5. web data acquisition method according to any one of claim 2 to 4, it is characterised in that

Asynchronous task queue is transmitted in the task queue for distributed message.
A kind of 6. web data acquisition system, it is characterised in that including：

Arrangement unit, for arranging the reptile environment of virtual machine to be added；

Adding device, configures for obtaining the IP address of the virtual machine to be added, and by the IP address added to host node In；

Updating block, for control main frame update Run Script so that the virtual machine to be added and added virtual machine obtain Newest operation code；

Start unit, for being instructed when the task start for receiving the virtual machine to be added, according to the newest operation code The task start instruction is performed, so that the virtual machine to be added adds into the cluster for crawl website and starts web data Collection.
7. web data acquisition system according to claim 6, it is characterised in that further include：

Request unit, the web data for receiving targeted website obtain request；

Unit is established, task queue is established for obtaining request according to the web data；

Control unit, for when having idling-resource in the cluster, control web crawlers to be performed using the idling-resource Task scheduling in the task queue, to obtain the web data of the targeted website.
8. web data acquisition system according to claim 7, it is characterised in that described control unit specifically includes：

First acquisition unit, for based on the idling-resource, obtaining the URL of the targeted website；

Download unit, for sending the URL to downloader, so that the downloader generates and to return to the URL corresponding Page data；

Processing unit, for handling the page data, the page data after storage processing.
9. web data acquisition system according to claim 7, it is characterised in that further include：

Second acquisition unit, for the correspondence according to pre-set web crawlers and website, obtains the targeted website Corresponding web crawlers；

3rd acquiring unit, the corresponding web crawlers in the targeted website set for acquisition crawls the cycle, so that described Web crawlers crawls crawling for the cycle progress web data according to described.
10. the web data acquisition system according to any one of claim 7 to 9, it is characterised in that

Asynchronous task queue is transmitted in the task queue for distributed message.
11. a kind of computer equipment, including memory, processor and it is stored on the memory and can be on the processor The computer program of operation, it is characterised in that the processor realizes such as claim 1 to 5 when performing the computer program Any one of web data acquisition method the step of.
12. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the computer program Realized when being executed by processor as any one of claim 1 to 5 the step of web data acquisition method.