CN106126747A

CN106126747A - Data capture method based on reptile and device

Info

Publication number: CN106126747A
Application number: CN201610556254.0A
Authority: CN
Inventors: 陈剑
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2016-07-14
Filing date: 2016-07-14
Publication date: 2016-11-16

Abstract

The present invention provides a kind of data capture method based on reptile and device, the method includes that the first data obtaining on first page to be captured and at least one redirect link, wherein, at least one redirects the jump address being linked as jumping to second page to be captured on first page to be captured, link is redirected according at least one, enter each second page to be captured redirecting link correspondence, and obtain the second data on second page to be captured, the first data and the second data are stored in default data base.The crawl of data is realized by obtaining the data in the page, by obtaining the page redirects link, and jump to this page redirecting link correspondence, the behavior of analog manual operation's browser, to realize the page jump of mutual abundant dynamic page, solve the problem that page total data cannot be obtained by tradition reptile when capturing dynamic web page.

Description

Data capture method based on reptile and device

Technical field

The present invention relates to Internet technical field, particularly relate to a kind of data capture method based on reptile and device.

Background technology

Along with the fast development of main flow Web technology, internet platform occurs to change all the time.Nowadays, the Internet by Static Web page and yellow page information, user by various social network-i i-platform such as blog, microblogging, BBS (English: Bulletin Board System, is called for short: BBS), social network sites (English: Social Network Site, be called for short: SNS), News analysiss etc. carry out the acquisition of information, comment etc..

It is analyzed extensively based on the information of hot issue on various networks and pays close attention to, quickly, accurately obtain the use on network The data such as user data, comment data just highlight its importance.At present, network data acquisition method mainly has a following two: one Plant the application programming interface being to utilize network self to provide (English: Application Programming Interface, is called for short: API), be generally not capable of meeting multidirectional data analysis requirements；Another kind is to utilize tradition crawlers to obtain Take related data, need to analyze and resolve complicated web page element, screen desired data, the most such as, obtain at least one and include URL (English: Uniform Resource Locator, URL), website numbering and the seed of type it is called for short:, Using the URL of seed as current URL, the website of seed is numbered and numbers as current site, using the type of seed as currently Type；Obtain at least one strategy, determine at least one crawler capturing parameter according to strategy；Obtain with current according to current type The rule that type is corresponding；Capture web data according to crawler capturing parameter from current URL, according to rule, web data is solved Analysis obtains and resolves data.Above two mode is all to use traditional crawlers to obtain network data.

For utilizing tradition crawlers to obtain network-related data, obtain static page by obtaining the URL of the page In data, and for nowadays mutual abundant dynamic page and the complicated mode that redirects, traditional crawlers can not realize The acquisition of total data.

Summary of the invention

The present invention provides a kind of data capture method based on reptile and device, to realize having abundant dynamic interaction page The crawl of the data in the webpage in face, improves speed and the stability of data grabber in the webpage of the dynamic interaction page.

First aspect present invention provides the data grab method of a kind of web crawlers, including:

Obtain the first data on first page to be captured and at least one redirects link；Wherein, at least one jumping described Turn and be linked as can jumping on described first page to be captured the jump address of second page to be captured；

According to described at least one redirect link, enter and each redirect second page to be captured that link is corresponding, and obtain The second data on described second page to be captured；

Described first data and described second data are stored in default data base.

Further, the first data on described acquisition first page to be captured redirect link with at least one, including:

Resolve the layout of described first page to be captured, position the first Data Position on described first page to be captured and Described at least one redirect the position of link；

Reptile mode is used to obtain described first number that described first Data Position on described first page to be captured is corresponding According to, and obtain described at least one redirect the position of link corresponding described at least one redirect link.

Optionally, the layout of the page to be captured described in described parsing, position the first data of described first page to be captured Position and described at least one redirect the position of link, including:

Use position and the layout of the page to be captured described in the parsing of extensible markup language path language, obtain described first The position of data and described at least one redirect the position of link.

Further, before the first data on described acquisition first page to be captured redirect link with at least one, institute Method of stating also includes,

From at least one default account information, select the first account information, and log according to described first account information The website at page place to be captured, enters described first page to be captured；

Wherein, each account information includes login account and login password.

Further, described method also includes:

Detect whether the first account information lost efficacy；

If described first account information lost efficacy, then described first account information is marked, and described at least one Account information selects the second account information；

Log in described website according to described second account information, enter described first page to be captured.

Further, crawl data times and/or the crawl time of described first account information are detected；

When described crawl data times exceedes default crawl frequency threshold value, select from least one account information described Select the 3rd account information, and log in described website according to described 3rd account information, enter described first page to be captured；With/ Or, when the described crawl time exceedes default crawl time threshold, from least one account information described, select the 3rd account Number information, and log in described website according to described 3rd account information, enter described first page to be captured.

Second aspect present invention provides the data grabber device of a kind of web crawlers, including:

Data acquisition module, redirects link for obtaining the first data on first page to be captured and at least one；Its In, described at least one redirect be linked as on described first page to be captured jumping to second page to be captured redirect ground Location；

Processing module, for according to described at least one redirect link, enter each redirect link corresponding second wait to grab Take the page；

Described data acquisition module is additionally operable to obtain the second data on described second page to be captured；

Memory module, for being stored in described first data and described second data in default data base.

Further, described data acquisition module, specifically for:

Optionally, described data acquisition module, treat described in the parsing of extensible markup language path language specifically for using Capture the position of the page and layout, obtain described first data position and described at least one redirect the position of link.

Further, described processing module is additionally operable to select the first account letter from least one default account information Breath, and the website at page place to be captured is logged according to described first account information, enter described first page to be captured；Its In, each account information includes login account and login password.

Further, described processing module is additionally operable to detect whether the first account information lost efficacy；

If described first account information lost efficacy, then described first account information is marked, and described at least one Account information selects the second account information；Log in described website according to described second account information, enter described first and wait to grab Take the page.

Further, described processing module is additionally operable to detect the crawl data times of described first account information and/or grab Take the time；

The data grab method of web crawlers that the present invention provides and device, by obtaining the on first page to be captured One data redirect link with at least one, redirect link according at least one, enter each the second of link correspondence that redirects and wait to grab Take the page, and obtain the second data on second page to be captured；And the data of crawl are stored in default data base.This Invent the crawl realizing data by obtaining the data in the page, by obtaining the page redirects link, and jump to this jumping Turn the page that link is corresponding, it is achieved redirecting of the page, the behavior of analog manual operation's browser, mutual abundant dynamic to realize The page jump of the page, even stochastic generation redirect link, as long as obtaining this link, and jump to the page that this link is corresponding Behind face, can realize the data grabber of this page, solving tradition reptile cannot number whole to the page when capturing dynamic web page According to the problem of acquisition.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing In having technology to describe, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, also may be used To obtain other accompanying drawing according to these accompanying drawings.

The flow chart of the data capture method embodiment one based on reptile that Fig. 1 provides for the embodiment of the present invention；

The flow chart of the data capture method embodiment two based on reptile that Fig. 2 provides for the embodiment of the present invention；

The flow chart of the data capture method embodiment three based on reptile that Fig. 3 provides for the embodiment of the present invention；

The flow chart of the data capture method embodiment three based on reptile that Fig. 4 provides for the embodiment of the present invention；

The structural representation of the data acquisition facility embodiment based on reptile that Fig. 5 provides for the embodiment of the present invention.

Detailed description of the invention

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Describe, it is clear that described embodiment is only a part of embodiment of the present invention rather than whole embodiments wholely.Based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under not making creative work premise Embodiment, broadly falls into the scope of protection of the invention.

The application scenarios of the present embodiment is to the crawl of data in mutual abundant dynamic page, to carry out adding up and dividing Analysis.Such as analyze the user comment about hot issue of microblogging and to participating in the classification of user and the statistics of comment, and for example Add up the ratio etc. that the user of support, opposition and neutral attitude is held in the comment to a certain news.It is thus desirable to it is the richest to these The rich data in dynamic page carry out quick, accurately obtain.

The flow chart of the data capture method embodiment one based on reptile that Fig. 1 provides for the embodiment of the present invention, such as Fig. 1 institute Showing, the executive agent of the present embodiment is the terminal unit that computer, mobile phone, panel computer etc. are capable of data grabber function, The step being somebody's turn to do data capture method based on reptile specifically includes:

S101: obtain the first data on first page to be captured and at least one redirects link；Wherein, at least one is jumped Turn and be linked as can jumping on first page to be captured the jump address of second page to be captured.

In the present embodiment, the behavior of analog manual operation's browser, first page to be captured opened captures, Wherein it is possible to use existing browser to open the page to be captured, it would however also be possible to employ other can open the application program visitor of the page Family end.First, obtaining the first data on first page to be captured and at least one redirects link, wherein redirecting link can be In static page or dynamic page, lower one page redirects link, the link corresponding to following one page button or drop-down loading, it is possible to Be make comments individual subscriber homepage or individual's public information redirect link etc., wherein, abundant dynamic even with having The page that state is mutual and the topic page of the social platform such as page of next page address of stochastic generation, such as microblogging, at this page Middle acquisition redirects link, and this redirects the jump address being linked as jumping to second page to be captured, and uses the artificial behaviour of simulation Make the behavior of browser, open second page to be captured, obtain the data on second page to be captured and redirect link, Ji Keshi Now have redirecting and data acquisition of the page of abundant dynamic interaction and the page of next page address of stochastic generation, solve existing skill The problem that in art, tradition reptile cannot redirect when crawling dynamic page.Wherein, the data type of the first data to be captured and jumping The type turning link can be configured according to the demand of data acquisition.In the present embodiment, the row of analog manual operation's browser For, can effectively avoid anti-reptile authentication mechanism.

S102: redirect link according at least one, enters each second page to be captured redirecting link correspondence, and obtains The second data on second page to be captured.

In the present embodiment, the behavior of analog manual operation's browser, redirect link according at least one, enter each jumping Turn second page to be captured that link is corresponding, the data on second page to be captured are being captured, when the second page to be captured There is also when redirecting link on face, obtain this and redirect link, and page that this is redirected link corresponding captures.The present embodiment Use the mode of depth-first, along the page redirect link go to can not more deeply till, then return again to previous page, continue Continue data and redirect the crawl of link.

S103: the first data and the second data are stored in default data base.

In the present embodiment, the data captured in step S101 and step S102 are stored in default data base, deposit Data are stored in the relevant position in default data base by storage process, to facilitate the statistics and analysis of data.

Owing to tradition reptile is not with browser mode accession page, only downloads the html source code of webpage, use URL Mode carries out page data crawl, is not loaded with the js/css/ picture etc. being included in the page, and for mutual abundant dynamic page Face, owing to it has a mode that redirects of complexity, and these pages its be not real to store in cyberspace, but dynamically Generating, be that the visiting demand according to user the most dynamically generates the page, concrete, the page dynamically generated is compiled by script Write, until after script program operation, the content in the page just can be obtained.Therefore tradition reptile can not realize in dynamic page complete The acquisition of portion's data.And the data capture method based on reptile provided in the present embodiment uses browser behavior, simulation is artificial The mode of operation realizes redirecting of dynamic web page stochastic generation down hop network address, enters the page that this down hop network address is corresponding, so After carry out capture data, it is possible to achieve the crawl of data in dynamic web page.

The data capture method based on reptile that the present embodiment provides, by the behavior of analog manual operation's browser, obtains Take the first data on first page to be captured and at least one redirects link, redirect link according at least one, enter each Redirect second page to be captured that link is corresponding, and obtain the second data on second page to be captured；And the data that will capture It is stored in default data base.The present invention realizes the crawl of data by obtaining the data in the page, by obtaining in the page Redirect link, and jump to this and redirect the page that link is corresponding, it is achieved redirecting of the page, the row of analog manual operation's browser For, to realize the page jump of mutual abundant dynamic page, even stochastic generation redirect link, as long as obtaining this chain Connect, and after jumping to the page that this link is corresponding, the data grabber of this page can be realized, solve tradition reptile capture dynamic Cannot be to the problem of the acquisition of the page total data during state webpage.

The flow chart of the data capture method embodiment two based on reptile that Fig. 2 provides for the embodiment of the present invention, such as Fig. 2 institute Showing, the present embodiment improves on the basis of embodiment one further, wherein, obtains on first page to be captured in step S101 First data and at least one step that implements redirecting link be:

S201: resolve the layout of first page to be captured, positions the first Data Position on first page to be captured and extremely A few position redirecting link.

In the present embodiment, by resolving the page layout of first page to be captured, can the quick number in positioning webpage According to the position of distribution with redirect the distributing position of link, during crawl can for correspondence position carry out the retrieval of data and capture with And redirect the acquisition of link, parsing work numerous and diverse to full page during eliminating tradition reptile data grabber, it is not necessary to Travel through full page from the beginning to the end, can quickly get rid of webpage noise (billboard, navigation bar, copyright hurdle etc. and data to be captured Incoherent content), speed and the skip chain of significantly more efficient lifting crawler capturing obtain the speed taken, have higher flexibly Property.It addition, each several part in the page under the most same website has going up most of identical page layout, such as webpage Face part is often the title of website, logo picture and navigation bar etc., and webpage bottom is copyright information, and left or right sidebar is phase Closing link or advertising message, mid portion is main information, by resolving the page layout of first page to be captured, positions first The position of the data distribution on the page to be captured and the distributing position redirecting link, when on other pages under the same website of crawl Data and when redirecting link, can be according to the position of the data distribution that the first parsing capturing the page obtain and the distribution redirecting link Position, is directed to this position and carries out data and redirect the crawl of link, can further improve data acquisition speed.

S202: use reptile mode to obtain the first data that on first page to be captured, the first Data Position is corresponding, and obtain Take corresponding at least one at least one position redirecting link and redirect link.

In the present embodiment, obtain the first Data Position on first page to be captured in parsing and redirect link with at least one Position after, use reptile mode to extract and filter data corresponding in this position and redirect link, to realize data and skip chain The acquisition connect.

The data capture method based on reptile that the present embodiment provides, by resolving the page layout of the page to be captured, can The position being distributed with the data in quick positioning webpage and the distributing position redirecting link, for correspondence position number during crawl According to retrieval and crawl and redirect the acquisition of link, numerous and diverse to full page during eliminating tradition reptile data grabber Resolve work, get rid of webpage noise contents incoherent with data to be captured such as () billboard, navigation bar, copyright hurdles, more added with The speed promoting crawler capturing of effect and skip chain obtain the speed taken, and have higher motility.

The flow chart of the data capture method embodiment three based on reptile that Fig. 3 provides for the embodiment of the present invention, above-mentioned On the basis of any embodiment, as it is shown on figure 3, step S201 redirects link according at least one, enter and each redirect link The second corresponding page to be captured, and the one obtaining the second data on second page to be captured implements step and is:

S301: use extensible markup language path language to resolve position and the layout of the page to be captured, obtains the first number According to position and at least one redirect the position of link.

In the present embodiment, extensible markup language path language is used to resolve position and the layout of the page to be captured, its In, extensible markup language path language (English: Xml Path Language, it is called for short: XPath) be used for determining expansible mark Note language (English: Extensible Markup Language, it is called for short: XML) (also referred to as: the son of standard generalized markup language Collection) language of certain portion, XPath tree based on XML, it is provided that look for node in data-structure tree in document Ability, is navigated by element, attribute, text etc. in XML document.

And webpage generally uses DIV layout, by the subregion in definition document or joint, build website structure (framework).This Embodiment utilize XPath resolve the DIV layout of the page, thus the position of the quickly data distribution in positioning webpage and skip chain The distributing position connect, can carry out the retrieval of data and crawl for correspondence position and redirect the acquisition of link during crawl, save Parsing work numerous and diverse to full page during having removed traditional reptile data grabber, gets rid of webpage noise (billboard, navigation The contents incoherent with data to be captured such as hurdle, copyright hurdle), speed and the skip chain of significantly more efficient lifting crawler capturing are obtained The speed taken, has higher motility.And, XPath can be directly targeted in XML document the node comprising information, can Quickly to navigate to the position of the data distribution on the page according to DIV layout each in the page and to redirect the distributing position of link, from And significantly mention parsing page layout speed, to realize the quick obtaining of data.

S302: use reptile mode to obtain the first data that on first page to be captured, the first Data Position is corresponding, and obtain Take corresponding at least one at least one position redirecting link and redirect link.

The flow chart of the data capture method embodiment four based on reptile that Fig. 4 provides for the embodiment of the present invention, such as Fig. 4 institute Showing, on the basis of the data capture method based on reptile that embodiment one to embodiment three provides, the present embodiment provides one The specific implementation of the method, concrete steps include:

S401: select the first account information from least one default account information, and step on according to the first account information Record the website at page place to be captured, enter first page to be captured.

Wherein, each account information includes login account and login password.

In the present embodiment, in the present embodiment, increase step S401 that account logs in, mainly for social networkies such as microbloggings Platform open data browsed under being not logged in state are limited, and just open total data after login account, by step S401 After, the crawl of total data to first page to be captured can be realized, it is ensured that the integrity of data grabber.Specifically can use Xpath identifies the account logon form in the page, and according to the first account information selected in account logon form relevant position Fill in login account corresponding to the first account information and login password, complete the process that account logs in.

Additionally, can be potentially encountered the situation needing identifying code just can log in after verifying during account logs in, therefore, As further improvement of this embodiment, in step S401 that account logs in, specifically can also include, be judged by page parsing Verify that the need of identifying code if desired identifying code then captures this identifying code picture, and identifying code picture is uploaded to identifying code The server of Identification platform, after the identification of server, returns the text message of identifying code, and fills in identifying code correspondence table Dan Zhong, continues executing with register.

S402: obtain the first data on first page to be captured and at least one redirects link.

Wherein, at least one redirect be linked as on first page to be captured jumping to second page to be captured redirect ground Location.

S403: redirect link according at least one, enters each second page to be captured redirecting link correspondence, and obtains The second data on second page to be captured.

S404: the first data and described second data are stored in default data base.

In the present embodiment, by the behavior of analog manual operation's browser, first on first page to be captured is obtained Data redirect link with at least one, redirect link according at least one, enter each the second of link correspondence that redirects and wait to capture The page, and obtain the second data on second page to be captured；And the data of crawl are stored in default data base.This reality Execute the crawl realizing data in example by obtaining the data in the page, by obtaining the page redirects link, and jump to this Redirecting and link the corresponding page, it is achieved redirecting of the page, the behavior of analog manual operation's browser, to realize mutual abundant moving The page jump of the state page, even stochastic generation redirect link, as long as obtaining this link, and it is corresponding to jump to this link After the page, can realize the data grabber of this page, solving tradition reptile cannot be whole to the page when capturing dynamic web page The problem of the acquisition of data.

As further improvement of this embodiment, between above-mentioned steps S401 and S402, it is also possible to the first account information Detecting, detect whether the first account information can continue to use, concrete detection mode is:

First, detect whether the first account information lost efficacy.If the first account information did not lose efficacy, the most directly log into One page to be captured, then according to above-mentioned process carries out the crawl of all of data on the page.If the first account information lost efficacy, Then the first account information is marked, and at least one account information, selects the second account information, according to the second account Information registration website, enters first page to be captured.

In this scenario, detect whether the first account information lost efficacy, be included in the detection logged in after capturing application and grab The logging status of the first account information during taking detects.Concrete, the log-on message of account would be typically displayed to be treated Capture in the page, such as, can show login account Information, login account Information display login account on page navigation hurdle The information such as head portrait, the pet name or account ID, and when being not logged in or account lost efficacy, be sky at this, do not show head portrait, the pet name or account Number ID, utilizes Xpath to position this login account Information, and obtains content at this, it is judged that the first account logging status is the most just Often, i.e. assert when judging and there is head portrait corresponding to the first account, the pet name or account ID at this that the first account logging status is normal； Assert that when judging and there is not head portrait corresponding to the first account, the pet name or account ID at this first account logging status lost efficacy.

After judging that the first account information lost efficacy, being marked this first account information, mark mode can use Increasing by an account conditional code in this first account information, this account conditional code is used for recording this first account information and lost efficacy.And it is every Secondary account information of choosing at least one account information when logging in, first identifies in account information whether comprise account state Code, thus judge the account information of the most promising inefficacy of this account information, when the account information chosen is the account information lost efficacy, The most again another account information is chosen.In the present embodiment, after the first account information lost efficacy, change the second account information and step on Record, to ensure proceeding of data grabber.

Certainly, during data grabber, it is also possible to whether the first account information was lost efficacy and detects, to guarantee During data grabber, account normally logs in.

During implementing, except the effectiveness of account information being detected, it is determined whether be continuing with This account information carries out the crawl of data, it is also possible to crawl data times and crawl time to account information detect, really Determining whether to change other account information to log in, concrete implementation mode is as follows:

Detect crawl data times and/or the crawl time of the first account information；Exceed default when capturing data times When capturing frequency threshold value, from least one account information, select the 3rd account information, and log in net according to the 3rd account information Stand, enter first page to be captured；And/or, when the crawl time exceedes default crawl time threshold, from least one account Information selects the 3rd account information, and according to the 3rd account information Website login, enters first page to be captured.

In this programme, for the anti-reptile authentication mechanism that some network platforms are stronger, the present embodiment uses many accounts Rotation formula log in, to ensure that the time that each effective account captures will not be so long that title.Concrete, restriction can be used to grab Fetch data number of times or crawl time.Such as, when capturing data times and exceeding default crawl frequency threshold value, from least one account Number information selects the 3rd account information, and logs according to the 3rd account information and wait to capture application, enter the page to be captured, its In, capturing data times can be the number of times obtaining data to be captured, it is also possible to for asking to the webserver of the page to be captured Number of times seeking data etc.；And for example, when the crawl time exceedes default crawl time threshold, from least one account information Select the 3rd account information, and log according to the 3rd account information and wait to capture application, enter the page to be captured.Can certainly be same Time limit capturing data times and crawl time.By above-mentioned scheme, it is ensured that the most each one Account request threshold value in the number of times of network platform server request data is less than this network platform anti-reptile authentication mechanism, Thus ensure the effective status of this account, also ensure persistently carrying out smoothly of data grabber process.Wherein, the number of times of data is captured Or the detection of the time of crawl can detect after a page to be captured is crawled, it is also possible to detect in real time.

I.e. may reach during data grabber for filling in identifying code additionally, some social networkies arrange anti-reptile mechanism During the trigger condition of its anti-reptile mechanism, the window of input validation code can be ejected.As further improvement of this embodiment, permissible One of two kinds of processing modes are below used to tackle: the first, to fill in identifying code and continue data and crawl, second, change account weight Data grabber is continued after new login.First kind of way, specifically can fill in by there are needs in page parsing to the page to be captured During the list of identifying code, then capture this identifying code picture, and identifying code picture be uploaded to the server of identifying code Identification platform, After the identification of server, return the text message of identifying code, and fill in identifying code correspondence list, complete identifying code and test Card, and continue the crawl of data.The second way concretely, reselects another account information and logs in, and continue number According to crawl.

The data capture method based on reptile that above-described embodiment provides can also obtain for distributed data, holds the most simultaneously The multiple data acquisition task of row, the distributed parallel that multiple pages carry out data captures, and can improve data grabber efficiency.

The further improvement implemented as this, it would however also be possible to employ obtain first page to be captured in the first data acquisition task On the first data and after at least one redirects link, start the second data acquisition task, the second data acquisition task redirect Redirect, to this, the jump address that link is corresponding, and the first data acquisition task continues executing with the data on the page to be captured and redirects The acquisition of link.

It can be MySQL database that this external memory has captured the data base preset of data, when distributed data captures, Within certain data-level, the isolation characteristic of MySQL data self can ensure the guarantor of the integrity in addition of the read-write to every data Protect, do not have reading and writing data conflict.

The structural representation of the data acquisition facility embodiment based on reptile that Fig. 5 provides for the embodiment of the present invention, such as Fig. 5 Shown in, the data acquisition facility based on reptile 10 of the offer of the present embodiment includes:

Data acquisition module 11, redirects link for obtaining the first data on first page to be captured and at least one； Wherein, at least one redirects the jump address being linked as jumping to second page to be captured on first page to be captured；

Processing module 12, is additionally operable to redirect link according at least one, enters each the second of link correspondence that redirects and waits to grab Take the page；

Data acquisition module 11 is additionally operable to obtain the second data on second page to be captured；

Memory module 13, for being stored in the first data and the second data in default data base.

In the present embodiment, the data acquisition module 11 behavior by analog manual operation's browser, obtain first and wait to grab Taking the first data on the page and at least one redirects link, processing module 12 redirects link according at least one, by simulation The behavior of manual operation browser enters each second page to be captured redirecting link correspondence, and is obtained by data acquisition module 11 Take the second data on second page to be captured and redirect link, can realize having the page of abundant dynamic interaction and random life Become the redirecting and data acquisition of the page of next page address, solve tradition reptile in prior art and cannot jump when crawling dynamic page The problem turned.Memory module 13 is for being stored in the data of crawl in default data base, and data are stored in pre-by storing process If data base in relevant position, to facilitate the statistics and analysis of data.

As further improvement of this embodiment, data acquisition module 11, specifically for:

Resolve the layout of first page to be captured, position the first Data Position on first page to be captured and at least one Redirect the position of link；

In the present embodiment, reptile mode is used to obtain the first number that on first page to be captured, the first Data Position is corresponding According to, and obtain corresponding at least one at least one position redirecting link and redirect link.

Data acquisition module 11, specifically for resolving the page layout of first page to be captured, and in quick positioning webpage The position of data distribution and redirect the distributing position of link, the retrieval of data can be carried out for correspondence position and grab during crawl Take and redirect the acquisition of link, parsing work numerous and diverse to full page during eliminating tradition reptile data grabber, no Need to travel through from the beginning to the end full page, can quickly get rid of webpage noise (billboard, navigation bar, copyright hurdle etc. with wait to capture The incoherent content of data), speed and the skip chain of significantly more efficient lifting crawler capturing obtain the speed taken, and have higher Motility.In parsing acquisition first after the first Data Position on the crawl page redirects the position of link with at least one, data The reptile mode that uses acquisition module 11 extracts and filters data corresponding in this position and redirect link, to realize data and skip chain The acquisition connect.

Use extensible markup language path language to resolve position and the layout of the page to be captured, obtain the position of the first data Put and redirect the position of link with at least one.

In the present embodiment, data acquisition module 11 is by utilizing XPath to resolve the DIV layout of the page, thus quickly determines The position of the data distribution on the page of position and the distributing position redirecting link, can carry out data for correspondence position during crawl Retrieve and capture and redirect the acquisition of link, parsing numerous and diverse to full page during eliminating tradition reptile data grabber Work, gets rid of webpage noise contents incoherent with data to be captured such as () billboard, navigation bar, copyright hurdles, significantly more efficient The speed and the skip chain that promote crawler capturing obtain the speed taken, and have higher motility.And, XPath can directly determine Position, to comprising the node of information in XML document, can quickly navigate to the data distribution on the page according to DIV layout each in the page Position and redirect the distributing position of link, thus significantly mention parsing page layout speed, to realize the quick obtaining of data.

As further improvement of this embodiment, processing module 12 is additionally operable to from least one default account information choosing Select the first account information, and log in the website at page place to be captured according to the first account information, enter first page to be captured； Wherein, each account information includes login account and login password.

In the present embodiment, increase processing module 12, after login account can be realized, just open the website page to be grabbed of total data The social network-i i-platform etc. such as the crawl of the total data in face, such as microblogging, it is ensured that the integrity of data grabber.Specifically can be by data Acquisition module 11 uses Xpath to identify the account logon form in the page, and steps in account according to the first account information selected Login account corresponding to the first account information and login password are filled in record list relevant position, complete the process that account logs in.

Need identifying code just can log in after verifying additionally, processing module 12 can be potentially encountered during account logs in Situation, accordingly, as further improvement of this embodiment, processing module 12 specifically can be also used for by page parsing judge be No identifying code is needed to verify, if desired identifying code, then capture this identifying code picture, and identifying code picture is uploaded to identifying code knows The server of other platform, after the identification of server, returns the text message of identifying code, and fills in identifying code correspondence list In, continue executing with register.

As further improvement of this embodiment, in the device that the present embodiment provides, processing module 12 is additionally operable to detect the Whether one account information lost efficacy；

If the first account information lost efficacy, then the first account information is marked, and selects at least one account information Select the second account information；

According to the second account information Website login, enter first page to be captured.

In the present embodiment, processing module 12 is additionally operable to detect whether the first account information lost efficacy, and is included in log in and waits to capture The logging status of the detection after application and the first account information during crawl detects.Concrete, the login letter of account Breath would be typically displayed in the page to be captured, such as, can show login account Information on page navigation hurdle, and login account is believed The information such as the breath hurdle display head portrait of login account, the pet name or account ID, and when being not logged in or account lost efficacy, be empty at this, no Display head portrait, the pet name or account ID, processing module 12 utilizes Xpath to position this login account Information, and obtains content at this, Judge that the first account logging status is whether normal, i.e. there is head portrait corresponding to the first account, the pet name or account ID at this when judging Time assert that the first account logging status is normal；When judging not exist at this head portrait corresponding to the first account, the pet name or account ID Assert that the first account logging status lost efficacy.After judging that the first account information lost efficacy, processing module 12 is to this first account information Being marked, mark mode can use increases by an account conditional code in this first account information, and this account conditional code is used for Record this first account information to lose efficacy.And processing module 12 chooses account information login at least one account information every time Time, first identify in account information whether comprise account conditional code, thus judge the account of the most promising inefficacy of this account information Information, when the account information chosen is the account information lost efficacy, chooses another account information the most again.In the present embodiment, when After first account information lost efficacy, processing module 12 is changed the second account information and is logged in, to ensure proceeding of data grabber.

As further improvement of this embodiment, the processing module 12 of the present embodiment, it is additionally operable to detect the first account information Crawl data times and/or capture the time；When capturing data times and exceeding default crawl frequency threshold value, from least one Account information selects the 3rd account information, and according to the 3rd account information Website login, enters first page to be captured；With/ Or, when the crawl time exceedes default crawl time threshold, from least one account information, select the 3rd account information, and According to the 3rd account information Website login, enter first page to be captured.

In the present embodiment, for the anti-reptile authentication mechanism that some network platforms are stronger, the present embodiment uses many accounts Rotation formula log in, to ensure that the time that each effective account captures will not be so long that title.Concrete, process mould can be used Block 12 is limited in default crawl data times threshold value or captures time threshold switching account.Such as, super when capturing data times When crossing the crawl frequency threshold value preset, processing module 12 selects the 3rd account information from least one account information, and according to 3rd account information logs in be waited to capture application, enters the page to be captured, and wherein, capturing data times can be to obtain number to be captured According to number of times, it is also possible to for number of times of web server requests data of the page to be captured etc.；And for example, surpass when the time of crawl When crossing the crawl time threshold preset, processing module 12 selects the 3rd account information from least one account information, and according to 3rd account information logs in be waited to capture application, enters the page to be captured.Certainly processing module 12 can also be simultaneously to capturing data Number of times and crawl time limit.Above-mentioned scheme is performed, it is ensured that the most each by processing module 12 One account request threshold in the number of times of network platform server request data is less than this network platform anti-reptile authentication mechanism Value, thus ensure the effective status of this account, also ensure persistently carrying out smoothly of data grabber process.Wherein, data are captured The detection of number of times or the time of crawl can detect after a page to be captured is crawled, it is also possible to examines in real time Survey.

The computer program of the carrying out that the embodiment of the present invention is provided data capture method based on reptile, including depositing Having stored up the computer-readable recording medium of program code, the instruction that described program code includes can be used for performing previous methods and implements Method described in example, implements and can be found in embodiment of the method, does not repeats them here.Those skilled in the art can be clear Recognize to Chu, for convenience and simplicity of description, the specific works process of the device of foregoing description, it is referred to preceding method real Execute the corresponding process in example, do not repeat them here.

Flow chart and block diagram in accompanying drawing show that the method and computer program of the multiple embodiments according to the present invention produces Architectural framework in the cards, function and the operation of product.In this, each square frame in flow chart or block diagram can represent one A part for individual module, program segment or code, a part for described module, program segment or code comprises one or more for reality The executable instruction of the logic function now specified.It should also be noted that at some as in the realization replaced, square frame is marked Function can also occur to be different from the order marked in accompanying drawing.Such as, two continuous print square frames can essentially the most also Performing, they can also perform sometimes in the opposite order capablely, and this is depending on involved function.It is also noted that frame Each square frame in figure and/or flow chart and the combination of the square frame in block diagram and/or flow chart, can be with performing regulation The special hardware based system of function or action realizes, or can come with the combination of specialized hardware with computer instruction Realize.

Last it is noted that various embodiments above is only in order to illustrate technical scheme, it is not intended to limit；To the greatest extent The present invention has been described in detail by pipe with reference to foregoing embodiments, it will be understood by those within the art that: it depends on So the technical scheme described in foregoing embodiments can be modified, or the most some or all of technical characteristic is entered Row equivalent；And these amendments or replacement, do not make the essence of appropriate technical solution depart from various embodiments of the present invention technology The scope of scheme.

Claims

1. a data capture method based on reptile, it is characterised in that including:

Obtain the first data on first page to be captured and at least one redirects link；Wherein, at least one skip chain described Connect the jump address for second page to be captured can be jumped on described first page to be captured；

According to described at least one redirect link, enter and each redirect second page to be captured that link is corresponding, and described in obtaining The second data on second page to be captured；

Described first data and described second data are stored in default data base.

Method the most according to claim 1, it is characterised in that the first data on described acquisition first page to be captured and At least one redirects link, including:

Resolve the layout of described first page to be captured, position position and the institute of the first data on described first page to be captured State at least one position redirecting link；

Reptile mode is used to obtain described first data that on described first page to be captured, described first Data Position is corresponding, and Obtain described at least one redirect the position of link corresponding described at least one redirect link.

Method the most according to claim 2, it is characterised in that the layout of the page to be captured described in described parsing, positions institute State the first data of first page to be captured position and described at least one redirect the position of link, including:

Use position and the layout of the page to be captured described in the parsing of extensible markup language path language, obtain described first data Position and described at least one redirect the position of link.

4. according to the method described in any one of claims 1 to 3, it is characterised in that on described acquisition first page to be captured Before first data redirect link with at least one, described method also includes,

From at least one default account information, select the first account information, and log according to described first account information and wait to grab Take the website at page place, enter described first page to be captured；

Wherein, each account information includes login account and login password.

Method the most according to claim 4, it is characterised in that described method also includes:

Detect whether the first account information lost efficacy；

If described first account information lost efficacy, then described first account information is marked, and at least one account described Information selects the second account information；

Detect crawl data times and/or the crawl time of described first account information；

When described crawl data times exceedes default crawl frequency threshold value, from least one account information described, select the Three account information, and log in described website according to described 3rd account information, enter described first page to be captured；And/or, when When the described crawl time exceedes default crawl time threshold, from least one account information described, select the 3rd account letter Breath, and log in described website according to described 3rd account information, enter described first page to be captured.

7. a data acquisition facility based on reptile, it is characterised in that including:

Data acquisition module, redirects link for obtaining the first data on first page to be captured and at least one；Wherein, institute State at least one and redirect the jump address being linked as jumping to second page to be captured on described first page to be captured；

Processing module, for according to described at least one redirect link, enter and each redirect the second page to be captured that link is corresponding Face；

Device the most according to claim 7, it is characterised in that described data acquisition module, specifically for:

Resolve the layout of described first page to be captured, position the first Data Position on described first page to be captured and described At least one redirects the position of link；

Device the most according to claim 8, it is characterised in that described data acquisition module, specifically for:

10. according to the device described in any one of claim 7 to 9, it is characterised in that

Described processing module is additionally operable to select the first account information from least one default account information, and according to described One account information logs in the website at page place to be captured, and enters described first page to be captured；

Wherein, each account information includes login account and login password.