CN107895009A

CN107895009A - One kind is based on distributed internet data acquisition method and system

Info

Publication number: CN107895009A
Application number: CN201711105738.4A
Authority: CN
Inventors: 廖尚围; 刘遥; 周庚新
Original assignee: Beijing State Xin Hong Number Science And Technology Co Ltd
Current assignee: Zhang Wei
Priority date: 2017-11-10
Filing date: 2017-11-10
Publication date: 2018-04-10
Anticipated expiration: 2037-11-10
Also published as: CN107895009B

Abstract

The present invention is that one kind is based on distributed internet data acquisition method, and the request for receiving user's establishment data acquisition session simultaneously creates data acquisition session, the data acquisition session that user creates is distributed into multiple reptile threads, and start the multiple reptile thread；After each reptile thread receives data acquisition session and is activated, URL is obtained from URL queues to be captured, and webpage is downloaded from the website that data acquisition session is specified, the data processing plug-in unit that data acquisition session is specified is performed, carries out the extraction of data；The data processing plug-in unit that the data acquisition session is specified is the data processing plug-in unit with difference in functionality according to the Type of website selection specified；The target data extracted deposit specified database is used for subsequent treatment, the URL to be crawled extracted is pushed to queue to be crawled.Realize that polymorphic type, the mass data of complicated website crawl.

Description

One kind is based on distributed internet data acquisition method and system

Technical field

The present invention relates to be based on distributed interconnection netting index based on internet data acquisition technique field, more particularly to one kind According to acquisition method and system.

Background technology

With developing rapidly for network, internet turns into the carrier of bulk information, wherein including public feelings information, social thing Part, policy repercussion, various industries information, talent market etc. are big data the analysis of public opinion system, the number of macro economic analysis system According to basis.How to efficiently extract and turn into a huge challenge using these information.Web crawlers is data analysis system In highly important part, it is responsible for collecting webpage from internet, gathers information, and these info webs are used to establish rope To draw so as to provide support for searching analysis, it decides whether the content of whole data analysis system is enriched, and whether information is timely, Therefore the quality of its performance directly affects the effect of data analysis.

As shown in figure 1, general reptile operational process is probably as follows：

（1）Scheduler (Scheduler) takes out a link (URL) from link (URL) queue to be downloaded

（2）Scheduler starts acquisition module Spiders modules

（3）URL is transmitted to downloader by acquisition module（Downloader）, downloader gets off resource downloading

（4）Target data is extracted, extracts destination object（Item）, then entity pipeline is given（item pipeline）Enter to advance The processing of one step；Such as deposit database, file etc.

If what is parsed is link（URL）, then link（URL）It is inserted among queue to be crawled.

At present, one of internet data acquisition method is artificial using simulation based on flow configuration acquisition method, the method The process for clicking on browser carries out data collecting rule definition, such as：First click plate and paging obtain the data list of collection, The specifying information page is entered back into, last field needed for extraction as needed.

This method uses browser kernel, and simulation browser downloads webpage, performs JS codes.Although it can collect all The data that can be seen, the information without being shown on webpage can not really gather, it is maximum the problem of be will be by browsing Data can be just collected after device parsing JS codes, rendering data, speed is slow（3 ~ 10 seconds）, and the bandwidth needed is also very It is more.Often can not meet demand when the data volume for needing to gather is more.

Another method of internet data collection is a management configuration collection rule page, and collection rule are inputted by user Then（Plate list, data extraction information, tasks carrying information etc.）, acquisition tasks issue number local or server, server root Configured according to collection rule and carry out data acquisition, user downloading data file or can be pushed away data by server after the completion of data acquisition Give user.

Most of simple website data can be gathered above by the method for configuration collection rule, for more complicated net Stand, need the processing such as the data of secondary AJAX request to rise if desired for login, POST request, special Header information, partial information Come comparatively laborious.Can not solve the situation for needing frequent switching dynamic IP to act on behalf of.Can not handle needs to carry out specially treated ability The data of identification, data such as are parsed from picture, from audio parsing data etc..

The content of the invention

In view of the deficiency of the above method, the present invention proposes a kind of based on template, plug-in unit, the collection of distributed internet data Method, this method according to the concrete condition of each website, the suitable acquisition module of flexible configuration and collection plug-in unit, so as to so that With miscellaneous website data in unified method collection internet.

The present invention provides one kind and is based on template, plug-in unit, distributed internet data acquisition method and system, is led in system To include acquisition module customization, the customization of collection plug-in unit, distributed data acquisition.Acquisition module is customized for mission bit stream, rule Information；Gather plug-in unit customization provide personalized web site collection customization plug-in unit, as data download plug-in, Agent IP selection plug-in unit, Document analysis plug-in unit, image data parsing plug-in unit, audio parsing plug-in unit etc.；Distributed data acquisition is coordinated in a collection cluster Reptile resource, need to start a number of reptile according to acquisition tasks and carry out data and crawl.

The present invention, which is that one kind is a kind of, is based on distributed internet data acquisition method, receives user and creates data acquisition times The request of business simultaneously creates data acquisition session, the data acquisition session that user creates is distributed into multiple reptile threads, and start The multiple reptile thread；

After each reptile thread receives data acquisition session and is activated, obtained from URL queues to be captured URL, and webpage is downloaded from the website that data acquisition session is specified, the data processing plug-in unit that data acquisition session is specified is performed, is entered The extraction of row data；The data processing plug-in unit that the data acquisition session is specified is having according to the Type of website selection specified The data processing plug-in unit of difference in functionality；

The target data extracted deposit specified database is used for subsequent treatment, the URL to be crawled extracted is pushed To queue to be crawled.

Preferably, the URL to be crawled extracted is pushed to before queue to be crawled and re-scheduling confirmation is carried out to URL, judged Whether the URL had been collected data, if it is judged that being yes, then abandoned the URL；If it is judged that be it is no, then The URL is pushed into queue to be crawled.

Preferably, data processing plug-in unit include log in plug-in unit, data parsing plug-in unit, metadata plug-in unit, paging parsing plug-in unit, URL parses plug-in unit；

The plug-in unit that logs in is used to need to log in the website that could carry out data extraction；

The extraction expression formula that the data parsing plug-in unit is used to be specified according to each data field carries out data extraction；

The metadata plug-in unit is used to need the website for carrying out data extraction from list page and detail page respectively；

The paging parsing plug-in unit is used for the website for obtaining the paging information of webpage；

The URL parsings plug-in unit is used to extract URL.

Preferably, data processing plug-in unit also includes picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit.

Preferably, the information that the login plug-in unit needs to configure includes being used to be implanted into user list, the website logins page The ID of html elements corresponding to login user name, password, identifying code html elements ID, the user list includes and logs in use Name in an account book, password.

The present invention is that one kind is based on distributed internet data acquisition system, including：

Control centre, receive user and create the request of data acquisition session and create data acquisition session, the number that user is created Multiple reptile threads are distributed to according to acquisition tasks, and start the multiple reptile thread；

Collection center, after each reptile thread receives data acquisition session and is activated, from URL queues to be captured Middle acquisition URL, and webpage is downloaded from the website that data acquisition session is specified, perform the data processing that data acquisition session is specified and insert Part, the extraction of data is carried out, the data extracted are subjected to data parsing；The data processing that the data acquisition session is specified is inserted Part is the data processing plug-in unit with difference in functionality according to the Type of website selection specified；

Data center, for the target data extracted deposit specified database to be used for into subsequent treatment；

URL pushes center, for the URL to be crawled extracted to be pushed into queue to be crawled.

Preferably, in addition to re-scheduling center, it is right before queue to be crawled for the URL to be crawled extracted to be pushed to URL carries out re-scheduling confirmation, judges whether the URL had been collected data, if it is judged that being yes, then abandons the URL； If it is judged that being no, then the URL is pushed into queue to be crawled.

The paging parsing plug-in unit is used for the website for needing to obtain the paging information of webpage；

The URL parsings plug-in unit is used to extract URL.

The advantageous effects of the present invention are：（1）Based on distributed internet data acquisition method, complex web is realized The mass data stood crawls.（2）Based on data processing Plugin Mechanism, for the needs of different type website, selection uses plug-in unit There is the plug-in unit of corresponding function in list, realize that polymorphic type, the mass data of complicated website crawl.（3）The present invention passes through setting Re-scheduling center, URL caused by distributed reptile is subjected to re-scheduling, avoids repeating to crawl.（4）The present invention proposes one kind and is based on disappearing The method of queue is ceased, coordinates URL queues to be crawled caused by distributed reptile.

Brief description of the drawings

Fig. 1：Prior art reptile operational process schematic diagram；

Fig. 2：The embodiment of the present invention 1 is based on distributed internet data acquisition method schematic diagram；

Fig. 3：The embodiment of the present invention 2 is based on distributed internet data acquisition system schematic diagram；

Fig. 4：Acquisition tasks manage schematic diagram in embodiment 3；

Fig. 5：The acquisition tasks configuration page of embodiment 3；

Fig. 6：The URL rule configuration pages of embodiment 3；

Fig. 7：Embodiment 3 specifies the acquisition list for gathering the website；

Fig. 8：Embodiment 3 specifies the scope page of extraction URL link；

Fig. 9：Embodiment 3 specifies metadata to lift the expression formula page；

Figure 10：Embodiment 3 specifies the expression formula page of detailed data extraction field；

Figure 11：Embodiment 3 specifies the acquisition server for gathering the website；

Figure 12：The source file administration page of embodiment 3；

Figure 13：The startup at the control centre of embodiment 3 control collection center, stop, checking data, check seed, monitor task Running status；

Figure 14：The specific data that the task of embodiment 3 collects；

Figure 15：The task of embodiment 3 plate information to be gathered.

Embodiment

With reference to the accompanying drawings and examples, the embodiment of the present invention is described in further detail.Implement below Example is used to illustrate the present invention, but is not limited to the scope of the present invention.

Embodiment 1

As shown in Fig. 2 the present invention, which is one kind, is based on distributed internet data acquisition method, comprise the following steps：

S101：Receive user to create the request of data acquisition session and create data acquisition session, the data that user creates are adopted Set task distributes to multiple reptile threads, and starts the multiple reptile thread.

First, the request that user creates data acquisition session is received, carries out the management of data acquisition session, including newly-built, Editor, acquisition tasks are deleted, specify the titles of acquisition tasks, URL re-schedulings rule, interval time when crawling URL twice, work Thread Count, number of retries etc..Acquisition tasks are managed with tree structure classification.

According to the demonstrative definition acquisition tasks of user, the plate list of collection, IP agencies is specified to use strategy, web data Downloading mode（Support to download based on browser kernel and downloaded based on HttpClient）, specify data processing plug-in unit, URL extractions Scope, url list are just in expression formula, URL contents regular expression, metadata information, detailed data item extraction XPATH/CSS/ REGEX expression formulas etc..

After task creation, data acquisition session is distributed to collection reptile, start collection reptile and start data acquisition.When After data acquisition session terminates, stop data acquisition.Simultaneously to the collection running status of reptile, the CPU of acquisition server, net Network, internal memory, the state of disk, gathered data situation are monitored；Data whether can be collected to collection rule, collection reptile is No normal operation provides early warning, alarm.

S102：After each reptile thread receives data acquisition session and is activated, from URL queues to be captured Middle acquisition URL, and webpage is downloaded from the website that data acquisition session is specified, perform the data processing that data acquisition session is specified and insert Part, carry out the extraction of data；The data processing plug-in unit that the data acquisition session is specified is according to the Type of website selection specified The data processing plug-in unit with difference in functionality.

After acquisition tasks create, acquisition tasks are simultaneously published to collection center by control centre, and collection center is according to control The acquisition tasks that center processed is sent carry out the extraction of data.It is distributed interconnection data acquisition to gather reptile and carry out data acquisition Data download, the pith of data extraction are carried out in method.When carrying out distributed interconnection data acquisition, while by tens A cluster is formed to hundreds of reptile threads, an acquisition tasks carry out collaboration processing by one or more reptile threads.Adopt When set task starts, enabled instruction will be received simultaneously by participating in the reptile thread of this acquisition tasks, and reptile is actively from team to be captured URL to be crawled is obtained in row, starts to download webpage from specified website, is performed at all data that data acquisition session is specified Plug-in unit is managed, such as logs in plug-in unit, data parsing plug-in unit, metadata parsing plug-in unit, paging parsing plug-in unit, URL parsing plug-in units, it is special Plug-in unit also has picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit etc..The present invention is provided with plug-in unit list, including Have and log in plug-in unit, data parsing plug-in unit, metadata parsing plug-in unit, paging parsing plug-in unit, URL parsing plug-in units, special plug-in unit is also There are the data processing plug-in units such as picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit.Data are carried out in collection reptile to adopt The needs of different types of website are directed to during collection, selection uses the plug-in unit in plug-in unit list with corresponding function, realized more Type, the mass data of complicated website crawl.

The plug-in unit that logs in is used to need to log in the website that could carry out data extraction；Need to log in just for part website Derivative evidence can be gathered, system provides the plug-in unit for uniformly configuring and logging in, and the information that the plug-in unit needs to configure includes user list（Comprising Login user name, password）, be used to being implanted into login user name, the ID of html elements corresponding to password, checking in the website logins page The ID of the html elements of code.System chooses a login user name from user list immediately, password attempt logs in website, such as goes out Identifying code picture is sent to decoding server and obtains identifying code by existing identifying code, system, and implantation login page carries out simulation and stepped on Land, log in and successfully preserve cookie information afterwards, so that duration data gathers.Each thread is believed using a user in acquisition system Breath is logged in, therefore supports that multiple accounts log in collection simultaneously.Landing plug is supported by duration switching account, such as half Hour changes an account and logged in, and also supports to switch account by after the certain webpage quantity of collection.

The extraction expression formula that the data parsing plug-in unit is used to be specified according to each data field carries out data extraction；This is inserted It is data extraction that the group of part, which wants function, and each data field specifies extraction expression formula, and the expression formula of support has:Xpath expression formulas, Css expression formulas, regular expression, system extract data according to this expression formula from the html source codes of webpage.

The metadata plug-in unit is used to need the website for carrying out data extraction from list page and detail page respectively；This plug-in unit Situation about being extracted stage by stage primarily directed to data, extracts partial information in list page（Referred to as metadata）, by metadata Url with the detailed page of data is connected and is saved in queue to be crawled, afterwards can when carrying out detailed page data acquisition To obtain this partial data from metadata, rather than extracted from the detailed page.When typical example is collection electric quotient data, It is difficult to obtain the information such as price, the sales volume of commodity in item detail page, but is but easy to obtain commodity in items list Price, sales volume, therefore commodity price, sales volume are just first collected when collecting items list（Metadata）, finally in collection business Commodity price, sales volume can be obtained during the detailed page of product from metadata.

The paging parsing plug-in unit is used for the website for needing to obtain the paging information of webpage；In the web2.0 epoch, in webpage Analysis information, can not be directly obtained in webpage source code than paging informations such as following one page, last page, which pages, and It is to be got by AJAX modes from the server of website, and these paging informations is generated using javascript.System provides Customize interface and paging url connections are generated according to the paging javascript logics of website for data acquisition personnel.The URL solutions Analysis plug-in unit is used to extract URL.

S103：The target data extracted deposit specified database is used for subsequent treatment, waits to climb by what is extracted URL is taken to be pushed to queue to be crawled.

The data extracted are stored in specified database after re-scheduling, denoising, mark, for subsequent treatment.Database Using Hadoop distributed type assemblies storage basic data, for conventional data Cun Chudao ES index databases so as to quick-searching.

The URL to be crawled extracted is pushed to queue to be crawled, and reptile is obtained after having handled data from queue to be crawled URL to be crawled is taken to continue to gather.Queue to be crawled provides support for distributed capture, each to gather reptile from team to be crawled The URL addresses that will be crawled are obtained in row, the url list that reptile parses uniformly is sent to queue United Dispatching to be crawled.Adopt Collect the data acquisition daily record of reptile, the health status real-time report of collection reptile to control centre.

S104：The URL to be crawled extracted is pushed to before queue to be crawled re-scheduling confirmation is carried out to URL, judging should Whether URL had been collected data, if it is judged that being yes, then abandoned the URL；If it is judged that being no, then will The URL pushes to queue to be crawled.

The URL link of website is present in multiple pages, and system uses URL addresses+key（It can be defined according to business demand） The UUID for generating MD5 carries out re-scheduling, ensures that each URL addresses only gather a data.Re-scheduling is carried out to URL and confirms there are two kinds of moulds Formula.Pattern one is by MongoDB database re-schedulings, and each UUID carries out write into Databasce operation, and write-in failure then represents should Record existing, otherwise the record is not present.The advantages of this pattern is to support reptile endless number system, and shortcoming is disk to be consumed IO.Pattern two is to use BloomFilter re-schedulings, and Bloom Filter are a kind of very high random data knots of space efficiency Structure, it compactly represents a set very much using bit array, and can judge whether an element belongs to this set.Bloom This of Filter efficiently has certain cost：When judging whether an element belongs to some set, it is possible to can be not The element for belonging to this set is mistakenly considered to belong to this set（false positive）.Therefore, Bloom Filter are not suitable for The application scenario of those " zero errors ".And in the case where the application scenario of low error rate can be tolerated, Bloom Filter pass through few Mistake has exchanged being greatly saved for memory space for.

Embodiment 2：

Collection center, after each reptile thread receives data acquisition session and is activated, from URL queues to be captured Middle acquisition URL, and webpage is downloaded from the website that data acquisition session is specified, perform the data processing that data acquisition session is specified and insert Part, the extraction of data is carried out, the data extracted are subjected to data parsing；The data processing that the data acquisition session is specified is inserted Part is the data processing plug-in unit with difference in functionality according to the Type of website selection specified.

Data center, for the target data extracted deposit specified database to be used for into subsequent treatment；Database is adopted With Hadoop distributed type assemblies storage basic data, for conventional data Cun Chudao ES index databases so as to quick-searching.Collection System is acquired data, transmits data to specified database（Kafka message queues）, there is special storage journey in data center Sequence is responsible for being put in storage, stored, inquiry operation.

Re-scheduling center, for the URL to be crawled extracted to be pushed to before queue to be crawled, to carry out re-scheduling to URL true Recognize, judge whether the URL had been collected data, if it is judged that being yes, then abandon the URL；If it is judged that It is no, then the URL is pushed into queue to be crawled.The re-scheduling center of the present invention can realize self-management function, each collection The corresponding re-scheduling queue of task, during newly-built acquisition tasks re-scheduling queue automatically create, when deleting acquisition tasks, re-scheduling queue from It is dynamic to delete.

The data processing plug-in unit includes logging in plug-in unit, data parsing plug-in unit, metadata plug-in unit, paging parsing plug-in unit, URL Parse plug-in unit, picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit.

The plug-in unit that logs in is used to need to log in the website that could carry out data extraction, and the plug-in unit that logs in needs what is configured Information includes user list, is used to being implanted into login user name, the ID of html elements corresponding to password, checking in the website logins page The ID of the html elements of code, the user list include login user name, password.The data parsing plug-in unit is used for according to each The extraction expression formula that data field is specified carries out data extraction.The metadata plug-in unit be used for need respectively from list page and in detail The website of data extraction is carried out in page.The paging parsing plug-in unit is used for the website for needing to obtain the paging information of webpage；It is described URL parsing plug-in units are used to extract URL.

Embodiment 3：

The present embodiment is to gather the bid data instance of " Government of Shandong Province website ".After control centre receives user's instruction, Start to create data acquisition session.Due to hundreds of acquisition tasks can be run in computer system, for convenience of managing, collection Task carries out Classification Management with tree structure.As shown in figure 4, left side is acquisition tasks tree sort, right side is under the classification Acquisition tasks list, the specific configuration information of acquisition tasks can be configured from the page, can also be direct from xml configuration files Import.There is " Government of Shandong Province buying net " in task names list shown in Fig. 4, specify the title of acquisition tasks, i.e., this is adopted Website specified by set task is " Shandong government procurement net ".

Fig. 5 is that data acquisition session operation rule configures the page, data acquisition session is configured in this page, mainly Configuration task ID, task names, Thread Count, request timed out time, number of retries, agent pool, data type etc..Wherein data class Type refers to the data classification of collection, comprising：News, forum, house property, electric business, employment etc..

Fig. 5 is the URL rule configuration pages, according to Fig. 6, refers to the URL regular expressions for determining content pages, row in task The URL regular expressions of table page and URL blacklist regular expression, these URL rules, which can use, realizes specified connect The java classes of mouth, specific regular expression is generated by java classes.

The data processing plug-in unit specified is listed in resolver list.In distributed internet data acquisition system, including Plug-in unit, data parsing plug-in unit, metadata plug-in unit, paging parsing plug-in unit, URL parsings plug-in unit, picture recognition plug-in unit, sound is logged in know Other plug-in unit, the multiple data processing plug-in units with difference in functionality of Quick Response Code identification plug-in unit etc., for adopting according to being carried out data The type of the website of collection uses the plug-in unit of difference in functionality.When carrying out data extraction, collection center is sent according to control centre Acquisition tasks, the data processing plug-in unit specified is performed, carries out the extraction of data.

In the present embodiment, according to Shandong government procurement net and the feature of wanted gathered data, this acquisition tasks is specified Required data processing plug-in unit includes：LoginParser：Landing plug；DataParser：Data parse plug-in unit； MetaParser：Metadata parses plug-in unit；LinkParser：URL parses plug-in unit.When reptile is carrying out data extraction, perform The above-mentioned data processing plug-in unit specified carries out the extraction of data.

Control centre when creating acquisition tasks, will also to URL seeds, URL extraction scopes, meta-data extraction scope, in detail Count accurately and specified according to extraction, server, source file etc..As shown in fig. 7, specify the acquisition list for gathering the website.Such as Fig. 8 institutes Show, specify the scope of extraction URL link.As shown in figure 9, specify metadata to lift expression formula, meta-data extraction in the present embodiment Title, issuing time, detail page url.As shown in Figure 10, the expression formula of detailed data extraction field is specified, in the present embodiment Specify content, text classification, subclassification, province, the extraction expression formula of picture address.As shown in figure 11, specify and gather the website Acquisition server, can be one or multiple.Each crawler server includes multiple reptile threads, realizes distribution The data extraction of formula.As shown in figure 12, source file can upload xml acquisition tasks configuration files, and system will solve from xml document Analyse configuration information, without being configured in the page, the more convenient and quicker for skilled data acquisition personnel.

After acquisition tasks create, acquisition tasks are published to collection center by control centre, and collection center is according to control The acquisition tasks that center is sent carry out the extraction of data.As shown in figure 13, the startup at control centre's control collection center, stopping, Check data, check the running status of seed, monitor task.After collection center is activated, system will be led to by socket modes Know that selected acquisition server participates in the data acquisition of the task.After stopping, system will be notified selected by socket modes Acquisition server stops the data acquisition of the task.As shown in figure 14, control centre can check that the task collects specific Data.As shown in figure 15, the task plate information to be gathered, i.e. URL link are checked.

Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, without departing from the technical principles of the invention, some improvement and modification can also be made, these improvement and modification Also it should be regarded as protection scope of the present invention.

Claims

1. one kind is based on distributed internet data acquisition method, it is characterised in that：

Receive user to create the request of data acquisition session and create data acquisition session, the data acquisition session that user is created Multiple reptile threads are distributed to, and start the multiple reptile thread；

The target data extracted deposit specified database is used for subsequent treatment, the URL to be crawled extracted is pushed To queue to be crawled。

2. one kind according to claim 1 is based on distributed internet data acquisition method, it is characterised in that：Will extraction URL to be crawled out is pushed to before queue to be crawled and carries out re-scheduling confirmation to URL, judges whether the URL had been collected Data, if it is judged that being yes, then the URL is abandoned；If it is judged that being no, then the URL is pushed into team to be crawled Row。

3. one kind according to claim 1 is based on distributed internet data acquisition method, it is characterised in that：At data Managing plug-in unit includes logging in plug-in unit, data parsing plug-in unit, metadata plug-in unit, paging parsing plug-in unit, URL parsing plug-in units；

The URL parsings plug-in unit is used to extract URL。

4. one kind according to claim 1 is based on distributed internet data acquisition method, it is characterised in that：At data Managing plug-in unit also includes picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit。

5. one kind according to claim 3 is based on distributed internet data acquisition method, it is characterised in that：It is described to step on Record plug-in unit needs the information that configures to include being used in user list, the website logins page being implanted into login user name, corresponding to password The ID of html elements, the html elements of identifying code ID, the user list include login user name, password。

6. one kind is based on distributed internet data acquisition system, it is characterised in that：Including：

Collection center, for after each reptile thread receives data acquisition session and is activated, from URL to be captured URL is obtained in queue, and webpage is downloaded from the website that data acquisition session is specified, is performed at the data that data acquisition session is specified Plug-in unit is managed, carries out the extraction of data, the data extracted are subjected to data parsing；At the data that the data acquisition session is specified Reason plug-in unit is the data processing plug-in unit with difference in functionality according to the Type of website selection specified；

URL pushes center, for the URL to be crawled extracted to be pushed into queue to be crawled。

7. one kind according to claim 6 is based on distributed internet data acquisition system, it is characterised in that：Also include Re-scheduling center, re-scheduling confirmation is carried out to URL for the URL to be crawled extracted to be pushed to before queue to be crawled, judging should Whether URL had been collected data, if it is judged that being yes, then abandoned the URL；If it is judged that being no, then will The URL pushes to queue to be crawled.

8. one kind according to claim 6 is based on distributed internet data acquisition system, it is characterised in that：At data Managing plug-in unit includes logging in plug-in unit, data parsing plug-in unit, metadata plug-in unit, paging parsing plug-in unit, URL parsing plug-in units；

The URL parsings plug-in unit is used to extract URL.

9. one kind according to claim 6 is based on distributed internet data acquisition method, it is characterised in that：At data Managing plug-in unit also includes picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit.

10. one kind according to claim 8 is based on distributed internet data acquisition system, it is characterised in that：It is described The information that logging in plug-in unit needs to configure includes corresponding for being implanted into login user name, password in user list, the website logins page The ID of html elements, identifying code html elements ID, the user list includes login user name, password.