CN107895009A - One kind is based on distributed internet data acquisition method and system - Google Patents

One kind is based on distributed internet data acquisition method and system Download PDF

Info

Publication number
CN107895009A
CN107895009A CN201711105738.4A CN201711105738A CN107895009A CN 107895009 A CN107895009 A CN 107895009A CN 201711105738 A CN201711105738 A CN 201711105738A CN 107895009 A CN107895009 A CN 107895009A
Authority
CN
China
Prior art keywords
data
plug
unit
url
data acquisition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711105738.4A
Other languages
Chinese (zh)
Other versions
CN107895009B (en
Inventor
廖尚围
刘遥
周庚新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhang Wei
Original Assignee
Beijing State Xin Hong Number Science And Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing State Xin Hong Number Science And Technology Co Ltd filed Critical Beijing State Xin Hong Number Science And Technology Co Ltd
Priority to CN201711105738.4A priority Critical patent/CN107895009B/en
Publication of CN107895009A publication Critical patent/CN107895009A/en
Application granted granted Critical
Publication of CN107895009B publication Critical patent/CN107895009B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The present invention is that one kind is based on distributed internet data acquisition method, and the request for receiving user's establishment data acquisition session simultaneously creates data acquisition session, the data acquisition session that user creates is distributed into multiple reptile threads, and start the multiple reptile thread;After each reptile thread receives data acquisition session and is activated, URL is obtained from URL queues to be captured, and webpage is downloaded from the website that data acquisition session is specified, the data processing plug-in unit that data acquisition session is specified is performed, carries out the extraction of data;The data processing plug-in unit that the data acquisition session is specified is the data processing plug-in unit with difference in functionality according to the Type of website selection specified;The target data extracted deposit specified database is used for subsequent treatment, the URL to be crawled extracted is pushed to queue to be crawled.Realize that polymorphic type, the mass data of complicated website crawl.

Description

One kind is based on distributed internet data acquisition method and system
Technical field
The present invention relates to be based on distributed interconnection netting index based on internet data acquisition technique field, more particularly to one kind According to acquisition method and system.
Background technology
With developing rapidly for network, internet turns into the carrier of bulk information, wherein including public feelings information, social thing Part, policy repercussion, various industries information, talent market etc. are big data the analysis of public opinion system, the number of macro economic analysis system According to basis.How to efficiently extract and turn into a huge challenge using these information.Web crawlers is data analysis system In highly important part, it is responsible for collecting webpage from internet, gathers information, and these info webs are used to establish rope To draw so as to provide support for searching analysis, it decides whether the content of whole data analysis system is enriched, and whether information is timely, Therefore the quality of its performance directly affects the effect of data analysis.
As shown in figure 1, general reptile operational process is probably as follows:
(1)Scheduler (Scheduler) takes out a link (URL) from link (URL) queue to be downloaded
(2)Scheduler starts acquisition module Spiders modules
(3)URL is transmitted to downloader by acquisition module(Downloader), downloader gets off resource downloading
(4)Target data is extracted, extracts destination object(Item), then entity pipeline is given(item pipeline)Enter to advance The processing of one step;Such as deposit database, file etc.
If what is parsed is link(URL), then link(URL)It is inserted among queue to be crawled.
At present, one of internet data acquisition method is artificial using simulation based on flow configuration acquisition method, the method The process for clicking on browser carries out data collecting rule definition, such as:First click plate and paging obtain the data list of collection, The specifying information page is entered back into, last field needed for extraction as needed.
This method uses browser kernel, and simulation browser downloads webpage, performs JS codes.Although it can collect all The data that can be seen, the information without being shown on webpage can not really gather, it is maximum the problem of be will be by browsing Data can be just collected after device parsing JS codes, rendering data, speed is slow(3 ~ 10 seconds), and the bandwidth needed is also very It is more.Often can not meet demand when the data volume for needing to gather is more.
Another method of internet data collection is a management configuration collection rule page, and collection rule are inputted by user Then(Plate list, data extraction information, tasks carrying information etc.), acquisition tasks issue number local or server, server root Configured according to collection rule and carry out data acquisition, user downloading data file or can be pushed away data by server after the completion of data acquisition Give user.
Most of simple website data can be gathered above by the method for configuration collection rule, for more complicated net Stand, need the processing such as the data of secondary AJAX request to rise if desired for login, POST request, special Header information, partial information Come comparatively laborious.Can not solve the situation for needing frequent switching dynamic IP to act on behalf of.Can not handle needs to carry out specially treated ability The data of identification, data such as are parsed from picture, from audio parsing data etc..
The content of the invention
In view of the deficiency of the above method, the present invention proposes a kind of based on template, plug-in unit, the collection of distributed internet data Method, this method according to the concrete condition of each website, the suitable acquisition module of flexible configuration and collection plug-in unit, so as to so that With miscellaneous website data in unified method collection internet.
The present invention provides one kind and is based on template, plug-in unit, distributed internet data acquisition method and system, is led in system To include acquisition module customization, the customization of collection plug-in unit, distributed data acquisition.Acquisition module is customized for mission bit stream, rule Information;Gather plug-in unit customization provide personalized web site collection customization plug-in unit, as data download plug-in, Agent IP selection plug-in unit, Document analysis plug-in unit, image data parsing plug-in unit, audio parsing plug-in unit etc.;Distributed data acquisition is coordinated in a collection cluster Reptile resource, need to start a number of reptile according to acquisition tasks and carry out data and crawl.
The present invention, which is that one kind is a kind of, is based on distributed internet data acquisition method, receives user and creates data acquisition times The request of business simultaneously creates data acquisition session, the data acquisition session that user creates is distributed into multiple reptile threads, and start The multiple reptile thread;
After each reptile thread receives data acquisition session and is activated, obtained from URL queues to be captured URL, and webpage is downloaded from the website that data acquisition session is specified, the data processing plug-in unit that data acquisition session is specified is performed, is entered The extraction of row data;The data processing plug-in unit that the data acquisition session is specified is having according to the Type of website selection specified The data processing plug-in unit of difference in functionality;
The target data extracted deposit specified database is used for subsequent treatment, the URL to be crawled extracted is pushed To queue to be crawled.
Preferably, the URL to be crawled extracted is pushed to before queue to be crawled and re-scheduling confirmation is carried out to URL, judged Whether the URL had been collected data, if it is judged that being yes, then abandoned the URL;If it is judged that be it is no, then The URL is pushed into queue to be crawled.
Preferably, data processing plug-in unit include log in plug-in unit, data parsing plug-in unit, metadata plug-in unit, paging parsing plug-in unit, URL parses plug-in unit;
The plug-in unit that logs in is used to need to log in the website that could carry out data extraction;
The extraction expression formula that the data parsing plug-in unit is used to be specified according to each data field carries out data extraction;
The metadata plug-in unit is used to need the website for carrying out data extraction from list page and detail page respectively;
The paging parsing plug-in unit is used for the website for obtaining the paging information of webpage;
The URL parsings plug-in unit is used to extract URL.
Preferably, data processing plug-in unit also includes picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit.
Preferably, the information that the login plug-in unit needs to configure includes being used to be implanted into user list, the website logins page The ID of html elements corresponding to login user name, password, identifying code html elements ID, the user list includes and logs in use Name in an account book, password.
The present invention is that one kind is based on distributed internet data acquisition system, including:
Control centre, receive user and create the request of data acquisition session and create data acquisition session, the number that user is created Multiple reptile threads are distributed to according to acquisition tasks, and start the multiple reptile thread;
Collection center, after each reptile thread receives data acquisition session and is activated, from URL queues to be captured Middle acquisition URL, and webpage is downloaded from the website that data acquisition session is specified, perform the data processing that data acquisition session is specified and insert Part, the extraction of data is carried out, the data extracted are subjected to data parsing;The data processing that the data acquisition session is specified is inserted Part is the data processing plug-in unit with difference in functionality according to the Type of website selection specified;
Data center, for the target data extracted deposit specified database to be used for into subsequent treatment;
URL pushes center, for the URL to be crawled extracted to be pushed into queue to be crawled.
Preferably, in addition to re-scheduling center, it is right before queue to be crawled for the URL to be crawled extracted to be pushed to URL carries out re-scheduling confirmation, judges whether the URL had been collected data, if it is judged that being yes, then abandons the URL; If it is judged that being no, then the URL is pushed into queue to be crawled.
Preferably, data processing plug-in unit include log in plug-in unit, data parsing plug-in unit, metadata plug-in unit, paging parsing plug-in unit, URL parses plug-in unit;
The plug-in unit that logs in is used to need to log in the website that could carry out data extraction;
The extraction expression formula that the data parsing plug-in unit is used to be specified according to each data field carries out data extraction;
The metadata plug-in unit is used to need the website for carrying out data extraction from list page and detail page respectively;
The paging parsing plug-in unit is used for the website for needing to obtain the paging information of webpage;
The URL parsings plug-in unit is used to extract URL.
Preferably, data processing plug-in unit also includes picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit.
Preferably, the information that the login plug-in unit needs to configure includes being used to be implanted into user list, the website logins page The ID of html elements corresponding to login user name, password, identifying code html elements ID, the user list includes and logs in use Name in an account book, password.
The advantageous effects of the present invention are:(1)Based on distributed internet data acquisition method, complex web is realized The mass data stood crawls.(2)Based on data processing Plugin Mechanism, for the needs of different type website, selection uses plug-in unit There is the plug-in unit of corresponding function in list, realize that polymorphic type, the mass data of complicated website crawl.(3)The present invention passes through setting Re-scheduling center, URL caused by distributed reptile is subjected to re-scheduling, avoids repeating to crawl.(4)The present invention proposes one kind and is based on disappearing The method of queue is ceased, coordinates URL queues to be crawled caused by distributed reptile.
Brief description of the drawings
Fig. 1:Prior art reptile operational process schematic diagram;
Fig. 2:The embodiment of the present invention 1 is based on distributed internet data acquisition method schematic diagram;
Fig. 3:The embodiment of the present invention 2 is based on distributed internet data acquisition system schematic diagram;
Fig. 4:Acquisition tasks manage schematic diagram in embodiment 3;
Fig. 5:The acquisition tasks configuration page of embodiment 3;
Fig. 6:The URL rule configuration pages of embodiment 3;
Fig. 7:Embodiment 3 specifies the acquisition list for gathering the website;
Fig. 8:Embodiment 3 specifies the scope page of extraction URL link;
Fig. 9:Embodiment 3 specifies metadata to lift the expression formula page;
Figure 10:Embodiment 3 specifies the expression formula page of detailed data extraction field;
Figure 11:Embodiment 3 specifies the acquisition server for gathering the website;
Figure 12:The source file administration page of embodiment 3;
Figure 13:The startup at the control centre of embodiment 3 control collection center, stop, checking data, check seed, monitor task Running status;
Figure 14:The specific data that the task of embodiment 3 collects;
Figure 15:The task of embodiment 3 plate information to be gathered.
Embodiment
With reference to the accompanying drawings and examples, the embodiment of the present invention is described in further detail.Implement below Example is used to illustrate the present invention, but is not limited to the scope of the present invention.
Embodiment 1
As shown in Fig. 2 the present invention, which is one kind, is based on distributed internet data acquisition method, comprise the following steps:
S101:Receive user to create the request of data acquisition session and create data acquisition session, the data that user creates are adopted Set task distributes to multiple reptile threads, and starts the multiple reptile thread.
First, the request that user creates data acquisition session is received, carries out the management of data acquisition session, including newly-built, Editor, acquisition tasks are deleted, specify the titles of acquisition tasks, URL re-schedulings rule, interval time when crawling URL twice, work Thread Count, number of retries etc..Acquisition tasks are managed with tree structure classification.
According to the demonstrative definition acquisition tasks of user, the plate list of collection, IP agencies is specified to use strategy, web data Downloading mode(Support to download based on browser kernel and downloaded based on HttpClient), specify data processing plug-in unit, URL extractions Scope, url list are just in expression formula, URL contents regular expression, metadata information, detailed data item extraction XPATH/CSS/ REGEX expression formulas etc..
After task creation, data acquisition session is distributed to collection reptile, start collection reptile and start data acquisition.When After data acquisition session terminates, stop data acquisition.Simultaneously to the collection running status of reptile, the CPU of acquisition server, net Network, internal memory, the state of disk, gathered data situation are monitored;Data whether can be collected to collection rule, collection reptile is No normal operation provides early warning, alarm.
S102:After each reptile thread receives data acquisition session and is activated, from URL queues to be captured Middle acquisition URL, and webpage is downloaded from the website that data acquisition session is specified, perform the data processing that data acquisition session is specified and insert Part, carry out the extraction of data;The data processing plug-in unit that the data acquisition session is specified is according to the Type of website selection specified The data processing plug-in unit with difference in functionality.
After acquisition tasks create, acquisition tasks are simultaneously published to collection center by control centre, and collection center is according to control The acquisition tasks that center processed is sent carry out the extraction of data.It is distributed interconnection data acquisition to gather reptile and carry out data acquisition Data download, the pith of data extraction are carried out in method.When carrying out distributed interconnection data acquisition, while by tens A cluster is formed to hundreds of reptile threads, an acquisition tasks carry out collaboration processing by one or more reptile threads.Adopt When set task starts, enabled instruction will be received simultaneously by participating in the reptile thread of this acquisition tasks, and reptile is actively from team to be captured URL to be crawled is obtained in row, starts to download webpage from specified website, is performed at all data that data acquisition session is specified Plug-in unit is managed, such as logs in plug-in unit, data parsing plug-in unit, metadata parsing plug-in unit, paging parsing plug-in unit, URL parsing plug-in units, it is special Plug-in unit also has picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit etc..The present invention is provided with plug-in unit list, including Have and log in plug-in unit, data parsing plug-in unit, metadata parsing plug-in unit, paging parsing plug-in unit, URL parsing plug-in units, special plug-in unit is also There are the data processing plug-in units such as picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit.Data are carried out in collection reptile to adopt The needs of different types of website are directed to during collection, selection uses the plug-in unit in plug-in unit list with corresponding function, realized more Type, the mass data of complicated website crawl.
The plug-in unit that logs in is used to need to log in the website that could carry out data extraction;Need to log in just for part website Derivative evidence can be gathered, system provides the plug-in unit for uniformly configuring and logging in, and the information that the plug-in unit needs to configure includes user list(Comprising Login user name, password), be used to being implanted into login user name, the ID of html elements corresponding to password, checking in the website logins page The ID of the html elements of code.System chooses a login user name from user list immediately, password attempt logs in website, such as goes out Identifying code picture is sent to decoding server and obtains identifying code by existing identifying code, system, and implantation login page carries out simulation and stepped on Land, log in and successfully preserve cookie information afterwards, so that duration data gathers.Each thread is believed using a user in acquisition system Breath is logged in, therefore supports that multiple accounts log in collection simultaneously.Landing plug is supported by duration switching account, such as half Hour changes an account and logged in, and also supports to switch account by after the certain webpage quantity of collection.
The extraction expression formula that the data parsing plug-in unit is used to be specified according to each data field carries out data extraction;This is inserted It is data extraction that the group of part, which wants function, and each data field specifies extraction expression formula, and the expression formula of support has:Xpath expression formulas, Css expression formulas, regular expression, system extract data according to this expression formula from the html source codes of webpage.
The metadata plug-in unit is used to need the website for carrying out data extraction from list page and detail page respectively;This plug-in unit Situation about being extracted stage by stage primarily directed to data, extracts partial information in list page(Referred to as metadata), by metadata Url with the detailed page of data is connected and is saved in queue to be crawled, afterwards can when carrying out detailed page data acquisition To obtain this partial data from metadata, rather than extracted from the detailed page.When typical example is collection electric quotient data, It is difficult to obtain the information such as price, the sales volume of commodity in item detail page, but is but easy to obtain commodity in items list Price, sales volume, therefore commodity price, sales volume are just first collected when collecting items list(Metadata), finally in collection business Commodity price, sales volume can be obtained during the detailed page of product from metadata.
The paging parsing plug-in unit is used for the website for needing to obtain the paging information of webpage;In the web2.0 epoch, in webpage Analysis information, can not be directly obtained in webpage source code than paging informations such as following one page, last page, which pages, and It is to be got by AJAX modes from the server of website, and these paging informations is generated using javascript.System provides Customize interface and paging url connections are generated according to the paging javascript logics of website for data acquisition personnel.The URL solutions Analysis plug-in unit is used to extract URL.
S103:The target data extracted deposit specified database is used for subsequent treatment, waits to climb by what is extracted URL is taken to be pushed to queue to be crawled.
The data extracted are stored in specified database after re-scheduling, denoising, mark, for subsequent treatment.Database Using Hadoop distributed type assemblies storage basic data, for conventional data Cun Chudao ES index databases so as to quick-searching.
The URL to be crawled extracted is pushed to queue to be crawled, and reptile is obtained after having handled data from queue to be crawled URL to be crawled is taken to continue to gather.Queue to be crawled provides support for distributed capture, each to gather reptile from team to be crawled The URL addresses that will be crawled are obtained in row, the url list that reptile parses uniformly is sent to queue United Dispatching to be crawled.Adopt Collect the data acquisition daily record of reptile, the health status real-time report of collection reptile to control centre.
S104:The URL to be crawled extracted is pushed to before queue to be crawled re-scheduling confirmation is carried out to URL, judging should Whether URL had been collected data, if it is judged that being yes, then abandoned the URL;If it is judged that being no, then will The URL pushes to queue to be crawled.
The URL link of website is present in multiple pages, and system uses URL addresses+key(It can be defined according to business demand) The UUID for generating MD5 carries out re-scheduling, ensures that each URL addresses only gather a data.Re-scheduling is carried out to URL and confirms there are two kinds of moulds Formula.Pattern one is by MongoDB database re-schedulings, and each UUID carries out write into Databasce operation, and write-in failure then represents should Record existing, otherwise the record is not present.The advantages of this pattern is to support reptile endless number system, and shortcoming is disk to be consumed IO.Pattern two is to use BloomFilter re-schedulings, and Bloom Filter are a kind of very high random data knots of space efficiency Structure, it compactly represents a set very much using bit array, and can judge whether an element belongs to this set.Bloom This of Filter efficiently has certain cost:When judging whether an element belongs to some set, it is possible to can be not The element for belonging to this set is mistakenly considered to belong to this set(false positive).Therefore, Bloom Filter are not suitable for The application scenario of those " zero errors ".And in the case where the application scenario of low error rate can be tolerated, Bloom Filter pass through few Mistake has exchanged being greatly saved for memory space for.
Embodiment 2:
The present invention is that one kind is based on distributed internet data acquisition system, including:
Control centre, receive user and create the request of data acquisition session and create data acquisition session, the number that user is created Multiple reptile threads are distributed to according to acquisition tasks, and start the multiple reptile thread;
Collection center, after each reptile thread receives data acquisition session and is activated, from URL queues to be captured Middle acquisition URL, and webpage is downloaded from the website that data acquisition session is specified, perform the data processing that data acquisition session is specified and insert Part, the extraction of data is carried out, the data extracted are subjected to data parsing;The data processing that the data acquisition session is specified is inserted Part is the data processing plug-in unit with difference in functionality according to the Type of website selection specified.
Data center, for the target data extracted deposit specified database to be used for into subsequent treatment;Database is adopted With Hadoop distributed type assemblies storage basic data, for conventional data Cun Chudao ES index databases so as to quick-searching.Collection System is acquired data, transmits data to specified database(Kafka message queues), there is special storage journey in data center Sequence is responsible for being put in storage, stored, inquiry operation.
URL pushes center, for the URL to be crawled extracted to be pushed into queue to be crawled.
Re-scheduling center, for the URL to be crawled extracted to be pushed to before queue to be crawled, to carry out re-scheduling to URL true Recognize, judge whether the URL had been collected data, if it is judged that being yes, then abandon the URL;If it is judged that It is no, then the URL is pushed into queue to be crawled.The re-scheduling center of the present invention can realize self-management function, each collection The corresponding re-scheduling queue of task, during newly-built acquisition tasks re-scheduling queue automatically create, when deleting acquisition tasks, re-scheduling queue from It is dynamic to delete.
The data processing plug-in unit includes logging in plug-in unit, data parsing plug-in unit, metadata plug-in unit, paging parsing plug-in unit, URL Parse plug-in unit, picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit.
The plug-in unit that logs in is used to need to log in the website that could carry out data extraction, and the plug-in unit that logs in needs what is configured Information includes user list, is used to being implanted into login user name, the ID of html elements corresponding to password, checking in the website logins page The ID of the html elements of code, the user list include login user name, password.The data parsing plug-in unit is used for according to each The extraction expression formula that data field is specified carries out data extraction.The metadata plug-in unit be used for need respectively from list page and in detail The website of data extraction is carried out in page.The paging parsing plug-in unit is used for the website for needing to obtain the paging information of webpage;It is described URL parsing plug-in units are used to extract URL.
Embodiment 3:
The present embodiment is to gather the bid data instance of " Government of Shandong Province website ".After control centre receives user's instruction, Start to create data acquisition session.Due to hundreds of acquisition tasks can be run in computer system, for convenience of managing, collection Task carries out Classification Management with tree structure.As shown in figure 4, left side is acquisition tasks tree sort, right side is under the classification Acquisition tasks list, the specific configuration information of acquisition tasks can be configured from the page, can also be direct from xml configuration files Import.There is " Government of Shandong Province buying net " in task names list shown in Fig. 4, specify the title of acquisition tasks, i.e., this is adopted Website specified by set task is " Shandong government procurement net ".
Fig. 5 is that data acquisition session operation rule configures the page, data acquisition session is configured in this page, mainly Configuration task ID, task names, Thread Count, request timed out time, number of retries, agent pool, data type etc..Wherein data class Type refers to the data classification of collection, comprising:News, forum, house property, electric business, employment etc..
Fig. 5 is the URL rule configuration pages, according to Fig. 6, refers to the URL regular expressions for determining content pages, row in task The URL regular expressions of table page and URL blacklist regular expression, these URL rules, which can use, realizes specified connect The java classes of mouth, specific regular expression is generated by java classes.
The data processing plug-in unit specified is listed in resolver list.In distributed internet data acquisition system, including Plug-in unit, data parsing plug-in unit, metadata plug-in unit, paging parsing plug-in unit, URL parsings plug-in unit, picture recognition plug-in unit, sound is logged in know Other plug-in unit, the multiple data processing plug-in units with difference in functionality of Quick Response Code identification plug-in unit etc., for adopting according to being carried out data The type of the website of collection uses the plug-in unit of difference in functionality.When carrying out data extraction, collection center is sent according to control centre Acquisition tasks, the data processing plug-in unit specified is performed, carries out the extraction of data.
In the present embodiment, according to Shandong government procurement net and the feature of wanted gathered data, this acquisition tasks is specified Required data processing plug-in unit includes:LoginParser:Landing plug;DataParser:Data parse plug-in unit; MetaParser:Metadata parses plug-in unit;LinkParser:URL parses plug-in unit.When reptile is carrying out data extraction, perform The above-mentioned data processing plug-in unit specified carries out the extraction of data.
Control centre when creating acquisition tasks, will also to URL seeds, URL extraction scopes, meta-data extraction scope, in detail Count accurately and specified according to extraction, server, source file etc..As shown in fig. 7, specify the acquisition list for gathering the website.Such as Fig. 8 institutes Show, specify the scope of extraction URL link.As shown in figure 9, specify metadata to lift expression formula, meta-data extraction in the present embodiment Title, issuing time, detail page url.As shown in Figure 10, the expression formula of detailed data extraction field is specified, in the present embodiment Specify content, text classification, subclassification, province, the extraction expression formula of picture address.As shown in figure 11, specify and gather the website Acquisition server, can be one or multiple.Each crawler server includes multiple reptile threads, realizes distribution The data extraction of formula.As shown in figure 12, source file can upload xml acquisition tasks configuration files, and system will solve from xml document Analyse configuration information, without being configured in the page, the more convenient and quicker for skilled data acquisition personnel.
After acquisition tasks create, acquisition tasks are published to collection center by control centre, and collection center is according to control The acquisition tasks that center is sent carry out the extraction of data.As shown in figure 13, the startup at control centre's control collection center, stopping, Check data, check the running status of seed, monitor task.After collection center is activated, system will be led to by socket modes Know that selected acquisition server participates in the data acquisition of the task.After stopping, system will be notified selected by socket modes Acquisition server stops the data acquisition of the task.As shown in figure 14, control centre can check that the task collects specific Data.As shown in figure 15, the task plate information to be gathered, i.e. URL link are checked.
Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, without departing from the technical principles of the invention, some improvement and modification can also be made, these improvement and modification Also it should be regarded as protection scope of the present invention.

Claims (10)

1. one kind is based on distributed internet data acquisition method, it is characterised in that:
Receive user to create the request of data acquisition session and create data acquisition session, the data acquisition session that user is created Multiple reptile threads are distributed to, and start the multiple reptile thread;
After each reptile thread receives data acquisition session and is activated, obtained from URL queues to be captured URL, and webpage is downloaded from the website that data acquisition session is specified, the data processing plug-in unit that data acquisition session is specified is performed, is entered The extraction of row data;The data processing plug-in unit that the data acquisition session is specified is having according to the Type of website selection specified The data processing plug-in unit of difference in functionality;
The target data extracted deposit specified database is used for subsequent treatment, the URL to be crawled extracted is pushed To queue to be crawled
2. one kind according to claim 1 is based on distributed internet data acquisition method, it is characterised in that:Will extraction URL to be crawled out is pushed to before queue to be crawled and carries out re-scheduling confirmation to URL, judges whether the URL had been collected Data, if it is judged that being yes, then the URL is abandoned;If it is judged that being no, then the URL is pushed into team to be crawled Row
3. one kind according to claim 1 is based on distributed internet data acquisition method, it is characterised in that:At data Managing plug-in unit includes logging in plug-in unit, data parsing plug-in unit, metadata plug-in unit, paging parsing plug-in unit, URL parsing plug-in units;
The plug-in unit that logs in is used to need to log in the website that could carry out data extraction;
The extraction expression formula that the data parsing plug-in unit is used to be specified according to each data field carries out data extraction;
The metadata plug-in unit is used to need the website for carrying out data extraction from list page and detail page respectively;
The paging parsing plug-in unit is used for the website for obtaining the paging information of webpage;
The URL parsings plug-in unit is used to extract URL
4. one kind according to claim 1 is based on distributed internet data acquisition method, it is characterised in that:At data Managing plug-in unit also includes picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit
5. one kind according to claim 3 is based on distributed internet data acquisition method, it is characterised in that:It is described to step on Record plug-in unit needs the information that configures to include being used in user list, the website logins page being implanted into login user name, corresponding to password The ID of html elements, the html elements of identifying code ID, the user list include login user name, password
6. one kind is based on distributed internet data acquisition system, it is characterised in that:Including:
Control centre, receive user and create the request of data acquisition session and create data acquisition session, the number that user is created Multiple reptile threads are distributed to according to acquisition tasks, and start the multiple reptile thread;
Collection center, for after each reptile thread receives data acquisition session and is activated, from URL to be captured URL is obtained in queue, and webpage is downloaded from the website that data acquisition session is specified, is performed at the data that data acquisition session is specified Plug-in unit is managed, carries out the extraction of data, the data extracted are subjected to data parsing;At the data that the data acquisition session is specified Reason plug-in unit is the data processing plug-in unit with difference in functionality according to the Type of website selection specified;
Data center, for the target data extracted deposit specified database to be used for into subsequent treatment;
URL pushes center, for the URL to be crawled extracted to be pushed into queue to be crawled
7. one kind according to claim 6 is based on distributed internet data acquisition system, it is characterised in that:Also include Re-scheduling center, re-scheduling confirmation is carried out to URL for the URL to be crawled extracted to be pushed to before queue to be crawled, judging should Whether URL had been collected data, if it is judged that being yes, then abandoned the URL;If it is judged that being no, then will The URL pushes to queue to be crawled.
8. one kind according to claim 6 is based on distributed internet data acquisition system, it is characterised in that:At data Managing plug-in unit includes logging in plug-in unit, data parsing plug-in unit, metadata plug-in unit, paging parsing plug-in unit, URL parsing plug-in units;
The plug-in unit that logs in is used to need to log in the website that could carry out data extraction;
The extraction expression formula that the data parsing plug-in unit is used to be specified according to each data field carries out data extraction;
The metadata plug-in unit is used to need the website for carrying out data extraction from list page and detail page respectively;
The paging parsing plug-in unit is used for the website for needing to obtain the paging information of webpage;
The URL parsings plug-in unit is used to extract URL.
9. one kind according to claim 6 is based on distributed internet data acquisition method, it is characterised in that:At data Managing plug-in unit also includes picture recognition plug-in unit, voice recognition plug-in unit, Quick Response Code identification plug-in unit.
10. one kind according to claim 8 is based on distributed internet data acquisition system, it is characterised in that:It is described The information that logging in plug-in unit needs to configure includes corresponding for being implanted into login user name, password in user list, the website logins page The ID of html elements, identifying code html elements ID, the user list includes login user name, password.
CN201711105738.4A 2017-11-10 2017-11-10 Distributed internet data acquisition method and system Active CN107895009B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711105738.4A CN107895009B (en) 2017-11-10 2017-11-10 Distributed internet data acquisition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711105738.4A CN107895009B (en) 2017-11-10 2017-11-10 Distributed internet data acquisition method and system

Publications (2)

Publication Number Publication Date
CN107895009A true CN107895009A (en) 2018-04-10
CN107895009B CN107895009B (en) 2021-09-03

Family

ID=61804941

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711105738.4A Active CN107895009B (en) 2017-11-10 2017-11-10 Distributed internet data acquisition method and system

Country Status (1)

Country Link
CN (1) CN107895009B (en)

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109088908A (en) * 2018-06-06 2018-12-25 武汉酷犬数据科技有限公司 A kind of the distributed general collecting method and system of network-oriented
CN109145233A (en) * 2018-08-27 2019-01-04 山东浪潮商用系统有限公司 internet information acquisition system
CN109285046A (en) * 2018-08-10 2019-01-29 浙江工业大学 A kind of electric business big data acquisition system based on business plug-in unit
CN109445870A (en) * 2018-09-28 2019-03-08 浙江乾冠信息安全研究院有限公司 A kind of data processing method, electronic equipment and storage medium
CN109542903A (en) * 2018-11-14 2019-03-29 珠海格力智能装备有限公司 Data processing method and device
CN109753421A (en) * 2018-12-29 2019-05-14 广东益萃网络科技有限公司 Optimization method, device, computer equipment and the storage medium of service system
CN109783714A (en) * 2019-01-08 2019-05-21 上海因致信息科技有限公司 Interface data acquisition methods and system
CN109933706A (en) * 2019-03-29 2019-06-25 北京达佳互联信息技术有限公司 A kind of data capture method, device, electronic equipment and storage medium
CN109948079A (en) * 2019-03-11 2019-06-28 湖南衍金征信数据服务有限公司 A kind of method that distributed capture discloses page data
CN110134854A (en) * 2019-05-28 2019-08-16 江苏快页信息技术有限公司 A kind of crawler acquisition method based on user's incentive mechanism
CN110147475A (en) * 2019-03-29 2019-08-20 汇通达网络股份有限公司 A kind of network data acquisition system of distributed deployment
CN110147362A (en) * 2019-04-04 2019-08-20 中电科大数据研究院有限公司 One kind is based on the acquisition of event driven DOC DATA and processing system and its method
CN110175862A (en) * 2019-04-16 2019-08-27 苏宁易购集团股份有限公司 Commodity price calculation method and system are sold for electric business platform
CN110188258A (en) * 2019-04-19 2019-08-30 平安科技(深圳)有限公司 The method and device of external data is obtained using crawler
CN110334139A (en) * 2018-12-18 2019-10-15 济南百航信息技术有限公司 A method of third party system data are docked by simulated operation
CN110347899A (en) * 2019-07-04 2019-10-18 北京熵简科技有限公司 Distributed interconnection data collection system and method based on event-based model
CN110365810A (en) * 2019-07-23 2019-10-22 中南民族大学 Domain name caching method, device, equipment and storage medium based on web crawlers
CN110457312A (en) * 2019-07-05 2019-11-15 中国平安财产保险股份有限公司 Acquisition method, device, equipment and the readable storage medium storing program for executing of diversiform data
CN110737647A (en) * 2019-08-20 2020-01-31 广州宏数科技有限公司 Internet big data cleaning method
CN110737813A (en) * 2019-09-26 2020-01-31 苏州浪潮智能科技有限公司 method, equipment and medium for improving efficiency of reptile
CN111428107A (en) * 2020-03-23 2020-07-17 新华智云科技有限公司 Multi-center comprehensive web crawler system
CN111488508A (en) * 2020-04-10 2020-08-04 长春博立电子科技有限公司 Internet information acquisition system and method supporting multi-protocol distributed high concurrency
CN111597421A (en) * 2020-04-30 2020-08-28 武汉思普崚技术有限公司 Method, device, equipment and storage medium for realizing website picture crawler
CN111680203A (en) * 2020-05-07 2020-09-18 支付宝(杭州)信息技术有限公司 Data acquisition method and device and electronic equipment
CN111722980A (en) * 2020-06-11 2020-09-29 咪咕文化科技有限公司 Data acquisition system and method
CN111881335A (en) * 2020-07-28 2020-11-03 芯薇(上海)智能科技有限公司 Crawler technology-based multitasking system and method
CN111953766A (en) * 2020-08-07 2020-11-17 福建省天奕网络科技有限公司 Method and system for collecting network data
CN111949852A (en) * 2020-08-31 2020-11-17 东华理工大学 Macroscopic economy analysis method and system based on internet big data
CN112100061A (en) * 2020-08-28 2020-12-18 广州探迹科技有限公司 Visual crawler code compiling and debugging method
CN112579862A (en) * 2020-12-22 2021-03-30 福建江夏学院 Xpath automatic extraction method based on MD5 value comparison
CN112597373A (en) * 2020-12-29 2021-04-02 科技谷(厦门)信息技术有限公司 Data acquisition method based on distributed crawler engine
CN112650570A (en) * 2020-12-29 2021-04-13 百果园技术(新加坡)有限公司 Dynamically expandable distributed crawler system, data processing method and device
CN112667873A (en) * 2020-12-16 2021-04-16 北京华如慧云数据科技有限公司 Crawler system and method suitable for general data acquisition of most websites
WO2021088350A1 (en) * 2019-11-07 2021-05-14 南京莱斯网信技术研究院有限公司 Script-based web service paging data acquisition system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102685220A (en) * 2012-04-28 2012-09-19 苏州阔地网络科技有限公司 Method and system for data interaction based on WEB page
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
CN105956175A (en) * 2016-05-24 2016-09-21 考拉征信服务有限公司 Webpage content crawling method and device
US20170270206A1 (en) * 2014-11-05 2017-09-21 Facebook, Inc. Social-Based Optimization of Web Crawling for Online Social Networks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102685220A (en) * 2012-04-28 2012-09-19 苏州阔地网络科技有限公司 Method and system for data interaction based on WEB page
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
US20170270206A1 (en) * 2014-11-05 2017-09-21 Facebook, Inc. Social-Based Optimization of Web Crawling for Online Social Networks
CN105956175A (en) * 2016-05-24 2016-09-21 考拉征信服务有限公司 Webpage content crawling method and device

Cited By (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109088908A (en) * 2018-06-06 2018-12-25 武汉酷犬数据科技有限公司 A kind of the distributed general collecting method and system of network-oriented
CN109285046A (en) * 2018-08-10 2019-01-29 浙江工业大学 A kind of electric business big data acquisition system based on business plug-in unit
CN109145233A (en) * 2018-08-27 2019-01-04 山东浪潮商用系统有限公司 internet information acquisition system
CN109445870A (en) * 2018-09-28 2019-03-08 浙江乾冠信息安全研究院有限公司 A kind of data processing method, electronic equipment and storage medium
CN109542903A (en) * 2018-11-14 2019-03-29 珠海格力智能装备有限公司 Data processing method and device
CN110334139A (en) * 2018-12-18 2019-10-15 济南百航信息技术有限公司 A method of third party system data are docked by simulated operation
CN109753421B (en) * 2018-12-29 2022-06-10 广东益萃网络科技有限公司 Service system optimization method and device, computer equipment and storage medium
CN109753421A (en) * 2018-12-29 2019-05-14 广东益萃网络科技有限公司 Optimization method, device, computer equipment and the storage medium of service system
CN109783714A (en) * 2019-01-08 2019-05-21 上海因致信息科技有限公司 Interface data acquisition methods and system
CN109948079A (en) * 2019-03-11 2019-06-28 湖南衍金征信数据服务有限公司 A kind of method that distributed capture discloses page data
CN109933706A (en) * 2019-03-29 2019-06-25 北京达佳互联信息技术有限公司 A kind of data capture method, device, electronic equipment and storage medium
CN110147475A (en) * 2019-03-29 2019-08-20 汇通达网络股份有限公司 A kind of network data acquisition system of distributed deployment
CN110147475B (en) * 2019-03-29 2023-07-21 汇通达网络股份有限公司 Distributed deployment network data acquisition system
CN110147362A (en) * 2019-04-04 2019-08-20 中电科大数据研究院有限公司 One kind is based on the acquisition of event driven DOC DATA and processing system and its method
CN110175862A (en) * 2019-04-16 2019-08-27 苏宁易购集团股份有限公司 Commodity price calculation method and system are sold for electric business platform
CN110188258A (en) * 2019-04-19 2019-08-30 平安科技(深圳)有限公司 The method and device of external data is obtained using crawler
CN110134854A (en) * 2019-05-28 2019-08-16 江苏快页信息技术有限公司 A kind of crawler acquisition method based on user's incentive mechanism
CN110347899A (en) * 2019-07-04 2019-10-18 北京熵简科技有限公司 Distributed interconnection data collection system and method based on event-based model
CN110347899B (en) * 2019-07-04 2021-06-22 北京熵简科技有限公司 Distributed internet data acquisition system and method based on event-driven model
CN110457312A (en) * 2019-07-05 2019-11-15 中国平安财产保险股份有限公司 Acquisition method, device, equipment and the readable storage medium storing program for executing of diversiform data
CN110457312B (en) * 2019-07-05 2023-07-07 中国平安财产保险股份有限公司 Method, device, equipment and readable storage medium for collecting multi-type data
CN110365810A (en) * 2019-07-23 2019-10-22 中南民族大学 Domain name caching method, device, equipment and storage medium based on web crawlers
CN110737647A (en) * 2019-08-20 2020-01-31 广州宏数科技有限公司 Internet big data cleaning method
CN110737647B (en) * 2019-08-20 2023-07-25 广州宏数科技有限公司 Internet big data cleaning method
CN110737813A (en) * 2019-09-26 2020-01-31 苏州浪潮智能科技有限公司 method, equipment and medium for improving efficiency of reptile
CN110737813B (en) * 2019-09-26 2022-07-29 苏州浪潮智能科技有限公司 Method, equipment and medium for improving efficiency of reptiles
WO2021088350A1 (en) * 2019-11-07 2021-05-14 南京莱斯网信技术研究院有限公司 Script-based web service paging data acquisition system
CN111428107B (en) * 2020-03-23 2023-09-01 新华智云科技有限公司 Multi-center comprehensive web crawler system
CN111428107A (en) * 2020-03-23 2020-07-17 新华智云科技有限公司 Multi-center comprehensive web crawler system
CN111488508A (en) * 2020-04-10 2020-08-04 长春博立电子科技有限公司 Internet information acquisition system and method supporting multi-protocol distributed high concurrency
CN111597421A (en) * 2020-04-30 2020-08-28 武汉思普崚技术有限公司 Method, device, equipment and storage medium for realizing website picture crawler
CN111597421B (en) * 2020-04-30 2022-08-30 武汉思普崚技术有限公司 Method, device, equipment and storage medium for realizing website picture crawler
CN111680203A (en) * 2020-05-07 2020-09-18 支付宝(杭州)信息技术有限公司 Data acquisition method and device and electronic equipment
CN111680203B (en) * 2020-05-07 2023-04-18 支付宝(杭州)信息技术有限公司 Data acquisition method and device and electronic equipment
CN111722980A (en) * 2020-06-11 2020-09-29 咪咕文化科技有限公司 Data acquisition system and method
CN111722980B (en) * 2020-06-11 2023-10-20 咪咕文化科技有限公司 Data acquisition system and method
CN111881335A (en) * 2020-07-28 2020-11-03 芯薇(上海)智能科技有限公司 Crawler technology-based multitasking system and method
CN111953766A (en) * 2020-08-07 2020-11-17 福建省天奕网络科技有限公司 Method and system for collecting network data
CN112100061A (en) * 2020-08-28 2020-12-18 广州探迹科技有限公司 Visual crawler code compiling and debugging method
CN111949852A (en) * 2020-08-31 2020-11-17 东华理工大学 Macroscopic economy analysis method and system based on internet big data
CN112667873A (en) * 2020-12-16 2021-04-16 北京华如慧云数据科技有限公司 Crawler system and method suitable for general data acquisition of most websites
CN112579862B (en) * 2020-12-22 2022-06-14 福建江夏学院 Xpath automatic extraction method based on MD5 value comparison
CN112579862A (en) * 2020-12-22 2021-03-30 福建江夏学院 Xpath automatic extraction method based on MD5 value comparison
CN112650570A (en) * 2020-12-29 2021-04-13 百果园技术(新加坡)有限公司 Dynamically expandable distributed crawler system, data processing method and device
CN112597373A (en) * 2020-12-29 2021-04-02 科技谷(厦门)信息技术有限公司 Data acquisition method based on distributed crawler engine
CN112597373B (en) * 2020-12-29 2023-09-15 科技谷(厦门)信息技术有限公司 Data acquisition method based on distributed crawler engine

Also Published As

Publication number Publication date
CN107895009B (en) 2021-09-03

Similar Documents

Publication Publication Date Title
CN107895009A (en) One kind is based on distributed internet data acquisition method and system
CN101222349B (en) Method and system for collecting web user action and performance data
US9519561B2 (en) Method and system for configuration-controlled instrumentation of application programs
CN107071009A (en) A kind of distributed big data crawler system of load balancing
CN107025296B (en) Based on science service information intelligent grasping system method of data capture
CN106897357A (en) A kind of method for crawling the network information for band checking distributed intelligence
CN107885777A (en) A kind of control method and system of the crawl web data based on collaborative reptile
CN102184184B (en) Method for acquiring webpage dynamic information
CN105912587A (en) Data acquisition method and system
CN110020062B (en) Customizable web crawler method and system
CN109933701A (en) A kind of microblog data acquisition methods based on more strategy fusions
Bollen et al. An architecture for the aggregation and analysis of scholarly usage data
CN106603296A (en) Log processing method and device
WO2014055579A1 (en) Pagination of data based on recorded url requests
CN109729044A (en) A kind of general internet data acquisition is counter to climb system and method
CN106209863B (en) A kind of web portal security monitoring method based on whole station scanning
CN110851681A (en) Crawler processing method and device, server and computer readable storage medium
CN105721578A (en) User behavior data collection method and system
JP6286559B2 (en) Method and device for adding sign icons in interactive applications
CN106897313B (en) Mass user service preference evaluation method and device
CN107480189A (en) A kind of various dimensions real-time analyzer and method
CN105245394A (en) Method and equipment for analyzing network access log based on layered approach
Intapong et al. Collecting data of SNS user behavior to detect symptoms of excessive usage: Development of data collection application
CN108205548A (en) A kind of Web Spider structure and its method of work based on agriculture webpage information acquisition
Sah A New Architecture for Managing Enterprise Log Data.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20231012

Address after: Address: Building 44, Building 241, Shuaifuyuan Community, Tongzhou District, Beijing, 101100

Patentee after: Zhang Wei

Address before: Floor 9, block D, Beifa building, No. 15, Xueyuan South Road, Haidian District, Beijing 100080

Patentee before: BEIJING GUOXIN HONGSHU TECHNOLOGY CO.,LTD.