CN106126747A - Data capture method based on reptile and device - Google Patents
Data capture method based on reptile and device Download PDFInfo
- Publication number
- CN106126747A CN106126747A CN201610556254.0A CN201610556254A CN106126747A CN 106126747 A CN106126747 A CN 106126747A CN 201610556254 A CN201610556254 A CN 201610556254A CN 106126747 A CN106126747 A CN 106126747A
- Authority
- CN
- China
- Prior art keywords
- page
- data
- captured
- link
- account information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9558—Details of hyperlinks; Management of linked annotations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention provides a kind of data capture method based on reptile and device, the method includes that the first data obtaining on first page to be captured and at least one redirect link, wherein, at least one redirects the jump address being linked as jumping to second page to be captured on first page to be captured, link is redirected according at least one, enter each second page to be captured redirecting link correspondence, and obtain the second data on second page to be captured, the first data and the second data are stored in default data base.The crawl of data is realized by obtaining the data in the page, by obtaining the page redirects link, and jump to this page redirecting link correspondence, the behavior of analog manual operation's browser, to realize the page jump of mutual abundant dynamic page, solve the problem that page total data cannot be obtained by tradition reptile when capturing dynamic web page.
Description
Technical field
The present invention relates to Internet technical field, particularly relate to a kind of data capture method based on reptile and device.
Background technology
Along with the fast development of main flow Web technology, internet platform occurs to change all the time.Nowadays, the Internet by
Static Web page and yellow page information, user by various social network-i i-platform such as blog, microblogging, BBS (English:
Bulletin Board System, is called for short: BBS), social network sites (English: Social Network Site, be called for short: SNS),
News analysiss etc. carry out the acquisition of information, comment etc..
It is analyzed extensively based on the information of hot issue on various networks and pays close attention to, quickly, accurately obtain the use on network
The data such as user data, comment data just highlight its importance.At present, network data acquisition method mainly has a following two: one
Plant the application programming interface being to utilize network self to provide (English: Application Programming
Interface, is called for short: API), be generally not capable of meeting multidirectional data analysis requirements;Another kind is to utilize tradition crawlers to obtain
Take related data, need to analyze and resolve complicated web page element, screen desired data, the most such as, obtain at least one and include
URL (English: Uniform Resource Locator, URL), website numbering and the seed of type it is called for short:,
Using the URL of seed as current URL, the website of seed is numbered and numbers as current site, using the type of seed as currently
Type;Obtain at least one strategy, determine at least one crawler capturing parameter according to strategy;Obtain with current according to current type
The rule that type is corresponding;Capture web data according to crawler capturing parameter from current URL, according to rule, web data is solved
Analysis obtains and resolves data.Above two mode is all to use traditional crawlers to obtain network data.
For utilizing tradition crawlers to obtain network-related data, obtain static page by obtaining the URL of the page
In data, and for nowadays mutual abundant dynamic page and the complicated mode that redirects, traditional crawlers can not realize
The acquisition of total data.
Summary of the invention
The present invention provides a kind of data capture method based on reptile and device, to realize having abundant dynamic interaction page
The crawl of the data in the webpage in face, improves speed and the stability of data grabber in the webpage of the dynamic interaction page.
First aspect present invention provides the data grab method of a kind of web crawlers, including:
Obtain the first data on first page to be captured and at least one redirects link;Wherein, at least one jumping described
Turn and be linked as can jumping on described first page to be captured the jump address of second page to be captured;
According to described at least one redirect link, enter and each redirect second page to be captured that link is corresponding, and obtain
The second data on described second page to be captured;
Described first data and described second data are stored in default data base.
Further, the first data on described acquisition first page to be captured redirect link with at least one, including:
Resolve the layout of described first page to be captured, position the first Data Position on described first page to be captured and
Described at least one redirect the position of link;
Reptile mode is used to obtain described first number that described first Data Position on described first page to be captured is corresponding
According to, and obtain described at least one redirect the position of link corresponding described at least one redirect link.
Optionally, the layout of the page to be captured described in described parsing, position the first data of described first page to be captured
Position and described at least one redirect the position of link, including:
Use position and the layout of the page to be captured described in the parsing of extensible markup language path language, obtain described first
The position of data and described at least one redirect the position of link.
Further, before the first data on described acquisition first page to be captured redirect link with at least one, institute
Method of stating also includes,
From at least one default account information, select the first account information, and log according to described first account information
The website at page place to be captured, enters described first page to be captured;
Wherein, each account information includes login account and login password.
Further, described method also includes:
Detect whether the first account information lost efficacy;
If described first account information lost efficacy, then described first account information is marked, and described at least one
Account information selects the second account information;
Log in described website according to described second account information, enter described first page to be captured.
Further, crawl data times and/or the crawl time of described first account information are detected;
When described crawl data times exceedes default crawl frequency threshold value, select from least one account information described
Select the 3rd account information, and log in described website according to described 3rd account information, enter described first page to be captured;With/
Or, when the described crawl time exceedes default crawl time threshold, from least one account information described, select the 3rd account
Number information, and log in described website according to described 3rd account information, enter described first page to be captured.
Second aspect present invention provides the data grabber device of a kind of web crawlers, including:
Data acquisition module, redirects link for obtaining the first data on first page to be captured and at least one;Its
In, described at least one redirect be linked as on described first page to be captured jumping to second page to be captured redirect ground
Location;
Processing module, for according to described at least one redirect link, enter each redirect link corresponding second wait to grab
Take the page;
Described data acquisition module is additionally operable to obtain the second data on described second page to be captured;
Memory module, for being stored in described first data and described second data in default data base.
Further, described data acquisition module, specifically for:
Resolve the layout of described first page to be captured, position the first Data Position on described first page to be captured and
Described at least one redirect the position of link;
Reptile mode is used to obtain described first number that described first Data Position on described first page to be captured is corresponding
According to, and obtain described at least one redirect the position of link corresponding described at least one redirect link.
Optionally, described data acquisition module, treat described in the parsing of extensible markup language path language specifically for using
Capture the position of the page and layout, obtain described first data position and described at least one redirect the position of link.
Further, described processing module is additionally operable to select the first account letter from least one default account information
Breath, and the website at page place to be captured is logged according to described first account information, enter described first page to be captured;Its
In, each account information includes login account and login password.
Further, described processing module is additionally operable to detect whether the first account information lost efficacy;
If described first account information lost efficacy, then described first account information is marked, and described at least one
Account information selects the second account information;Log in described website according to described second account information, enter described first and wait to grab
Take the page.
Further, described processing module is additionally operable to detect the crawl data times of described first account information and/or grab
Take the time;
When described crawl data times exceedes default crawl frequency threshold value, select from least one account information described
Select the 3rd account information, and log in described website according to described 3rd account information, enter described first page to be captured;With/
Or, when the described crawl time exceedes default crawl time threshold, from least one account information described, select the 3rd account
Number information, and log in described website according to described 3rd account information, enter described first page to be captured.
The data grab method of web crawlers that the present invention provides and device, by obtaining the on first page to be captured
One data redirect link with at least one, redirect link according at least one, enter each the second of link correspondence that redirects and wait to grab
Take the page, and obtain the second data on second page to be captured;And the data of crawl are stored in default data base.This
Invent the crawl realizing data by obtaining the data in the page, by obtaining the page redirects link, and jump to this jumping
Turn the page that link is corresponding, it is achieved redirecting of the page, the behavior of analog manual operation's browser, mutual abundant dynamic to realize
The page jump of the page, even stochastic generation redirect link, as long as obtaining this link, and jump to the page that this link is corresponding
Behind face, can realize the data grabber of this page, solving tradition reptile cannot number whole to the page when capturing dynamic web page
According to the problem of acquisition.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
In having technology to describe, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is only this
Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, also may be used
To obtain other accompanying drawing according to these accompanying drawings.
The flow chart of the data capture method embodiment one based on reptile that Fig. 1 provides for the embodiment of the present invention;
The flow chart of the data capture method embodiment two based on reptile that Fig. 2 provides for the embodiment of the present invention;
The flow chart of the data capture method embodiment three based on reptile that Fig. 3 provides for the embodiment of the present invention;
The flow chart of the data capture method embodiment three based on reptile that Fig. 4 provides for the embodiment of the present invention;
The structural representation of the data acquisition facility embodiment based on reptile that Fig. 5 provides for the embodiment of the present invention.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Describe, it is clear that described embodiment is only a part of embodiment of the present invention rather than whole embodiments wholely.Based on
Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under not making creative work premise
Embodiment, broadly falls into the scope of protection of the invention.
The application scenarios of the present embodiment is to the crawl of data in mutual abundant dynamic page, to carry out adding up and dividing
Analysis.Such as analyze the user comment about hot issue of microblogging and to participating in the classification of user and the statistics of comment, and for example
Add up the ratio etc. that the user of support, opposition and neutral attitude is held in the comment to a certain news.It is thus desirable to it is the richest to these
The rich data in dynamic page carry out quick, accurately obtain.
The flow chart of the data capture method embodiment one based on reptile that Fig. 1 provides for the embodiment of the present invention, such as Fig. 1 institute
Showing, the executive agent of the present embodiment is the terminal unit that computer, mobile phone, panel computer etc. are capable of data grabber function,
The step being somebody's turn to do data capture method based on reptile specifically includes:
S101: obtain the first data on first page to be captured and at least one redirects link;Wherein, at least one is jumped
Turn and be linked as can jumping on first page to be captured the jump address of second page to be captured.
In the present embodiment, the behavior of analog manual operation's browser, first page to be captured opened captures,
Wherein it is possible to use existing browser to open the page to be captured, it would however also be possible to employ other can open the application program visitor of the page
Family end.First, obtaining the first data on first page to be captured and at least one redirects link, wherein redirecting link can be
In static page or dynamic page, lower one page redirects link, the link corresponding to following one page button or drop-down loading, it is possible to
Be make comments individual subscriber homepage or individual's public information redirect link etc., wherein, abundant dynamic even with having
The page that state is mutual and the topic page of the social platform such as page of next page address of stochastic generation, such as microblogging, at this page
Middle acquisition redirects link, and this redirects the jump address being linked as jumping to second page to be captured, and uses the artificial behaviour of simulation
Make the behavior of browser, open second page to be captured, obtain the data on second page to be captured and redirect link, Ji Keshi
Now have redirecting and data acquisition of the page of abundant dynamic interaction and the page of next page address of stochastic generation, solve existing skill
The problem that in art, tradition reptile cannot redirect when crawling dynamic page.Wherein, the data type of the first data to be captured and jumping
The type turning link can be configured according to the demand of data acquisition.In the present embodiment, the row of analog manual operation's browser
For, can effectively avoid anti-reptile authentication mechanism.
S102: redirect link according at least one, enters each second page to be captured redirecting link correspondence, and obtains
The second data on second page to be captured.
In the present embodiment, the behavior of analog manual operation's browser, redirect link according at least one, enter each jumping
Turn second page to be captured that link is corresponding, the data on second page to be captured are being captured, when the second page to be captured
There is also when redirecting link on face, obtain this and redirect link, and page that this is redirected link corresponding captures.The present embodiment
Use the mode of depth-first, along the page redirect link go to can not more deeply till, then return again to previous page, continue
Continue data and redirect the crawl of link.
S103: the first data and the second data are stored in default data base.
In the present embodiment, the data captured in step S101 and step S102 are stored in default data base, deposit
Data are stored in the relevant position in default data base by storage process, to facilitate the statistics and analysis of data.
Owing to tradition reptile is not with browser mode accession page, only downloads the html source code of webpage, use URL
Mode carries out page data crawl, is not loaded with the js/css/ picture etc. being included in the page, and for mutual abundant dynamic page
Face, owing to it has a mode that redirects of complexity, and these pages its be not real to store in cyberspace, but dynamically
Generating, be that the visiting demand according to user the most dynamically generates the page, concrete, the page dynamically generated is compiled by script
Write, until after script program operation, the content in the page just can be obtained.Therefore tradition reptile can not realize in dynamic page complete
The acquisition of portion's data.And the data capture method based on reptile provided in the present embodiment uses browser behavior, simulation is artificial
The mode of operation realizes redirecting of dynamic web page stochastic generation down hop network address, enters the page that this down hop network address is corresponding, so
After carry out capture data, it is possible to achieve the crawl of data in dynamic web page.
The data capture method based on reptile that the present embodiment provides, by the behavior of analog manual operation's browser, obtains
Take the first data on first page to be captured and at least one redirects link, redirect link according at least one, enter each
Redirect second page to be captured that link is corresponding, and obtain the second data on second page to be captured;And the data that will capture
It is stored in default data base.The present invention realizes the crawl of data by obtaining the data in the page, by obtaining in the page
Redirect link, and jump to this and redirect the page that link is corresponding, it is achieved redirecting of the page, the row of analog manual operation's browser
For, to realize the page jump of mutual abundant dynamic page, even stochastic generation redirect link, as long as obtaining this chain
Connect, and after jumping to the page that this link is corresponding, the data grabber of this page can be realized, solve tradition reptile capture dynamic
Cannot be to the problem of the acquisition of the page total data during state webpage.
The flow chart of the data capture method embodiment two based on reptile that Fig. 2 provides for the embodiment of the present invention, such as Fig. 2 institute
Showing, the present embodiment improves on the basis of embodiment one further, wherein, obtains on first page to be captured in step S101
First data and at least one step that implements redirecting link be:
S201: resolve the layout of first page to be captured, positions the first Data Position on first page to be captured and extremely
A few position redirecting link.
In the present embodiment, by resolving the page layout of first page to be captured, can the quick number in positioning webpage
According to the position of distribution with redirect the distributing position of link, during crawl can for correspondence position carry out the retrieval of data and capture with
And redirect the acquisition of link, parsing work numerous and diverse to full page during eliminating tradition reptile data grabber, it is not necessary to
Travel through full page from the beginning to the end, can quickly get rid of webpage noise (billboard, navigation bar, copyright hurdle etc. and data to be captured
Incoherent content), speed and the skip chain of significantly more efficient lifting crawler capturing obtain the speed taken, have higher flexibly
Property.It addition, each several part in the page under the most same website has going up most of identical page layout, such as webpage
Face part is often the title of website, logo picture and navigation bar etc., and webpage bottom is copyright information, and left or right sidebar is phase
Closing link or advertising message, mid portion is main information, by resolving the page layout of first page to be captured, positions first
The position of the data distribution on the page to be captured and the distributing position redirecting link, when on other pages under the same website of crawl
Data and when redirecting link, can be according to the position of the data distribution that the first parsing capturing the page obtain and the distribution redirecting link
Position, is directed to this position and carries out data and redirect the crawl of link, can further improve data acquisition speed.
S202: use reptile mode to obtain the first data that on first page to be captured, the first Data Position is corresponding, and obtain
Take corresponding at least one at least one position redirecting link and redirect link.
In the present embodiment, obtain the first Data Position on first page to be captured in parsing and redirect link with at least one
Position after, use reptile mode to extract and filter data corresponding in this position and redirect link, to realize data and skip chain
The acquisition connect.
The data capture method based on reptile that the present embodiment provides, by resolving the page layout of the page to be captured, can
The position being distributed with the data in quick positioning webpage and the distributing position redirecting link, for correspondence position number during crawl
According to retrieval and crawl and redirect the acquisition of link, numerous and diverse to full page during eliminating tradition reptile data grabber
Resolve work, get rid of webpage noise contents incoherent with data to be captured such as () billboard, navigation bar, copyright hurdles, more added with
The speed promoting crawler capturing of effect and skip chain obtain the speed taken, and have higher motility.
The flow chart of the data capture method embodiment three based on reptile that Fig. 3 provides for the embodiment of the present invention, above-mentioned
On the basis of any embodiment, as it is shown on figure 3, step S201 redirects link according at least one, enter and each redirect link
The second corresponding page to be captured, and the one obtaining the second data on second page to be captured implements step and is:
S301: use extensible markup language path language to resolve position and the layout of the page to be captured, obtains the first number
According to position and at least one redirect the position of link.
In the present embodiment, extensible markup language path language is used to resolve position and the layout of the page to be captured, its
In, extensible markup language path language (English: Xml Path Language, it is called for short: XPath) be used for determining expansible mark
Note language (English: Extensible Markup Language, it is called for short: XML) (also referred to as: the son of standard generalized markup language
Collection) language of certain portion, XPath tree based on XML, it is provided that look for node in data-structure tree in document
Ability, is navigated by element, attribute, text etc. in XML document.
And webpage generally uses DIV layout, by the subregion in definition document or joint, build website structure (framework).This
Embodiment utilize XPath resolve the DIV layout of the page, thus the position of the quickly data distribution in positioning webpage and skip chain
The distributing position connect, can carry out the retrieval of data and crawl for correspondence position and redirect the acquisition of link during crawl, save
Parsing work numerous and diverse to full page during having removed traditional reptile data grabber, gets rid of webpage noise (billboard, navigation
The contents incoherent with data to be captured such as hurdle, copyright hurdle), speed and the skip chain of significantly more efficient lifting crawler capturing are obtained
The speed taken, has higher motility.And, XPath can be directly targeted in XML document the node comprising information, can
Quickly to navigate to the position of the data distribution on the page according to DIV layout each in the page and to redirect the distributing position of link, from
And significantly mention parsing page layout speed, to realize the quick obtaining of data.
S302: use reptile mode to obtain the first data that on first page to be captured, the first Data Position is corresponding, and obtain
Take corresponding at least one at least one position redirecting link and redirect link.
In the present embodiment, obtain the first Data Position on first page to be captured in parsing and redirect link with at least one
Position after, use reptile mode to extract and filter data corresponding in this position and redirect link, to realize data and skip chain
The acquisition connect.
The flow chart of the data capture method embodiment four based on reptile that Fig. 4 provides for the embodiment of the present invention, such as Fig. 4 institute
Showing, on the basis of the data capture method based on reptile that embodiment one to embodiment three provides, the present embodiment provides one
The specific implementation of the method, concrete steps include:
S401: select the first account information from least one default account information, and step on according to the first account information
Record the website at page place to be captured, enter first page to be captured.
Wherein, each account information includes login account and login password.
In the present embodiment, in the present embodiment, increase step S401 that account logs in, mainly for social networkies such as microbloggings
Platform open data browsed under being not logged in state are limited, and just open total data after login account, by step S401
After, the crawl of total data to first page to be captured can be realized, it is ensured that the integrity of data grabber.Specifically can use
Xpath identifies the account logon form in the page, and according to the first account information selected in account logon form relevant position
Fill in login account corresponding to the first account information and login password, complete the process that account logs in.
Additionally, can be potentially encountered the situation needing identifying code just can log in after verifying during account logs in, therefore,
As further improvement of this embodiment, in step S401 that account logs in, specifically can also include, be judged by page parsing
Verify that the need of identifying code if desired identifying code then captures this identifying code picture, and identifying code picture is uploaded to identifying code
The server of Identification platform, after the identification of server, returns the text message of identifying code, and fills in identifying code correspondence table
Dan Zhong, continues executing with register.
S402: obtain the first data on first page to be captured and at least one redirects link.
Wherein, at least one redirect be linked as on first page to be captured jumping to second page to be captured redirect ground
Location.
S403: redirect link according at least one, enters each second page to be captured redirecting link correspondence, and obtains
The second data on second page to be captured.
S404: the first data and described second data are stored in default data base.
In the present embodiment, by the behavior of analog manual operation's browser, first on first page to be captured is obtained
Data redirect link with at least one, redirect link according at least one, enter each the second of link correspondence that redirects and wait to capture
The page, and obtain the second data on second page to be captured;And the data of crawl are stored in default data base.This reality
Execute the crawl realizing data in example by obtaining the data in the page, by obtaining the page redirects link, and jump to this
Redirecting and link the corresponding page, it is achieved redirecting of the page, the behavior of analog manual operation's browser, to realize mutual abundant moving
The page jump of the state page, even stochastic generation redirect link, as long as obtaining this link, and it is corresponding to jump to this link
After the page, can realize the data grabber of this page, solving tradition reptile cannot be whole to the page when capturing dynamic web page
The problem of the acquisition of data.
As further improvement of this embodiment, between above-mentioned steps S401 and S402, it is also possible to the first account information
Detecting, detect whether the first account information can continue to use, concrete detection mode is:
First, detect whether the first account information lost efficacy.If the first account information did not lose efficacy, the most directly log into
One page to be captured, then according to above-mentioned process carries out the crawl of all of data on the page.If the first account information lost efficacy,
Then the first account information is marked, and at least one account information, selects the second account information, according to the second account
Information registration website, enters first page to be captured.
In this scenario, detect whether the first account information lost efficacy, be included in the detection logged in after capturing application and grab
The logging status of the first account information during taking detects.Concrete, the log-on message of account would be typically displayed to be treated
Capture in the page, such as, can show login account Information, login account Information display login account on page navigation hurdle
The information such as head portrait, the pet name or account ID, and when being not logged in or account lost efficacy, be sky at this, do not show head portrait, the pet name or account
Number ID, utilizes Xpath to position this login account Information, and obtains content at this, it is judged that the first account logging status is the most just
Often, i.e. assert when judging and there is head portrait corresponding to the first account, the pet name or account ID at this that the first account logging status is normal;
Assert that when judging and there is not head portrait corresponding to the first account, the pet name or account ID at this first account logging status lost efficacy.
After judging that the first account information lost efficacy, being marked this first account information, mark mode can use
Increasing by an account conditional code in this first account information, this account conditional code is used for recording this first account information and lost efficacy.And it is every
Secondary account information of choosing at least one account information when logging in, first identifies in account information whether comprise account state
Code, thus judge the account information of the most promising inefficacy of this account information, when the account information chosen is the account information lost efficacy,
The most again another account information is chosen.In the present embodiment, after the first account information lost efficacy, change the second account information and step on
Record, to ensure proceeding of data grabber.
Certainly, during data grabber, it is also possible to whether the first account information was lost efficacy and detects, to guarantee
During data grabber, account normally logs in.
During implementing, except the effectiveness of account information being detected, it is determined whether be continuing with
This account information carries out the crawl of data, it is also possible to crawl data times and crawl time to account information detect, really
Determining whether to change other account information to log in, concrete implementation mode is as follows:
Detect crawl data times and/or the crawl time of the first account information;Exceed default when capturing data times
When capturing frequency threshold value, from least one account information, select the 3rd account information, and log in net according to the 3rd account information
Stand, enter first page to be captured;And/or, when the crawl time exceedes default crawl time threshold, from least one account
Information selects the 3rd account information, and according to the 3rd account information Website login, enters first page to be captured.
In this programme, for the anti-reptile authentication mechanism that some network platforms are stronger, the present embodiment uses many accounts
Rotation formula log in, to ensure that the time that each effective account captures will not be so long that title.Concrete, restriction can be used to grab
Fetch data number of times or crawl time.Such as, when capturing data times and exceeding default crawl frequency threshold value, from least one account
Number information selects the 3rd account information, and logs according to the 3rd account information and wait to capture application, enter the page to be captured, its
In, capturing data times can be the number of times obtaining data to be captured, it is also possible to for asking to the webserver of the page to be captured
Number of times seeking data etc.;And for example, when the crawl time exceedes default crawl time threshold, from least one account information
Select the 3rd account information, and log according to the 3rd account information and wait to capture application, enter the page to be captured.Can certainly be same
Time limit capturing data times and crawl time.By above-mentioned scheme, it is ensured that the most each one
Account request threshold value in the number of times of network platform server request data is less than this network platform anti-reptile authentication mechanism,
Thus ensure the effective status of this account, also ensure persistently carrying out smoothly of data grabber process.Wherein, the number of times of data is captured
Or the detection of the time of crawl can detect after a page to be captured is crawled, it is also possible to detect in real time.
I.e. may reach during data grabber for filling in identifying code additionally, some social networkies arrange anti-reptile mechanism
During the trigger condition of its anti-reptile mechanism, the window of input validation code can be ejected.As further improvement of this embodiment, permissible
One of two kinds of processing modes are below used to tackle: the first, to fill in identifying code and continue data and crawl, second, change account weight
Data grabber is continued after new login.First kind of way, specifically can fill in by there are needs in page parsing to the page to be captured
During the list of identifying code, then capture this identifying code picture, and identifying code picture be uploaded to the server of identifying code Identification platform,
After the identification of server, return the text message of identifying code, and fill in identifying code correspondence list, complete identifying code and test
Card, and continue the crawl of data.The second way concretely, reselects another account information and logs in, and continue number
According to crawl.
The data capture method based on reptile that above-described embodiment provides can also obtain for distributed data, holds the most simultaneously
The multiple data acquisition task of row, the distributed parallel that multiple pages carry out data captures, and can improve data grabber efficiency.
The further improvement implemented as this, it would however also be possible to employ obtain first page to be captured in the first data acquisition task
On the first data and after at least one redirects link, start the second data acquisition task, the second data acquisition task redirect
Redirect, to this, the jump address that link is corresponding, and the first data acquisition task continues executing with the data on the page to be captured and redirects
The acquisition of link.
It can be MySQL database that this external memory has captured the data base preset of data, when distributed data captures,
Within certain data-level, the isolation characteristic of MySQL data self can ensure the guarantor of the integrity in addition of the read-write to every data
Protect, do not have reading and writing data conflict.
The structural representation of the data acquisition facility embodiment based on reptile that Fig. 5 provides for the embodiment of the present invention, such as Fig. 5
Shown in, the data acquisition facility based on reptile 10 of the offer of the present embodiment includes:
Data acquisition module 11, redirects link for obtaining the first data on first page to be captured and at least one;
Wherein, at least one redirects the jump address being linked as jumping to second page to be captured on first page to be captured;
Processing module 12, is additionally operable to redirect link according at least one, enters each the second of link correspondence that redirects and waits to grab
Take the page;
Data acquisition module 11 is additionally operable to obtain the second data on second page to be captured;
Memory module 13, for being stored in the first data and the second data in default data base.
In the present embodiment, the data acquisition module 11 behavior by analog manual operation's browser, obtain first and wait to grab
Taking the first data on the page and at least one redirects link, processing module 12 redirects link according at least one, by simulation
The behavior of manual operation browser enters each second page to be captured redirecting link correspondence, and is obtained by data acquisition module 11
Take the second data on second page to be captured and redirect link, can realize having the page of abundant dynamic interaction and random life
Become the redirecting and data acquisition of the page of next page address, solve tradition reptile in prior art and cannot jump when crawling dynamic page
The problem turned.Memory module 13 is for being stored in the data of crawl in default data base, and data are stored in pre-by storing process
If data base in relevant position, to facilitate the statistics and analysis of data.
As further improvement of this embodiment, data acquisition module 11, specifically for:
Resolve the layout of first page to be captured, position the first Data Position on first page to be captured and at least one
Redirect the position of link;
In the present embodiment, reptile mode is used to obtain the first number that on first page to be captured, the first Data Position is corresponding
According to, and obtain corresponding at least one at least one position redirecting link and redirect link.
Data acquisition module 11, specifically for resolving the page layout of first page to be captured, and in quick positioning webpage
The position of data distribution and redirect the distributing position of link, the retrieval of data can be carried out for correspondence position and grab during crawl
Take and redirect the acquisition of link, parsing work numerous and diverse to full page during eliminating tradition reptile data grabber, no
Need to travel through from the beginning to the end full page, can quickly get rid of webpage noise (billboard, navigation bar, copyright hurdle etc. with wait to capture
The incoherent content of data), speed and the skip chain of significantly more efficient lifting crawler capturing obtain the speed taken, and have higher
Motility.In parsing acquisition first after the first Data Position on the crawl page redirects the position of link with at least one, data
The reptile mode that uses acquisition module 11 extracts and filters data corresponding in this position and redirect link, to realize data and skip chain
The acquisition connect.
As further improvement of this embodiment, data acquisition module 11, specifically for:
Use extensible markup language path language to resolve position and the layout of the page to be captured, obtain the position of the first data
Put and redirect the position of link with at least one.
In the present embodiment, data acquisition module 11 is by utilizing XPath to resolve the DIV layout of the page, thus quickly determines
The position of the data distribution on the page of position and the distributing position redirecting link, can carry out data for correspondence position during crawl
Retrieve and capture and redirect the acquisition of link, parsing numerous and diverse to full page during eliminating tradition reptile data grabber
Work, gets rid of webpage noise contents incoherent with data to be captured such as () billboard, navigation bar, copyright hurdles, significantly more efficient
The speed and the skip chain that promote crawler capturing obtain the speed taken, and have higher motility.And, XPath can directly determine
Position, to comprising the node of information in XML document, can quickly navigate to the data distribution on the page according to DIV layout each in the page
Position and redirect the distributing position of link, thus significantly mention parsing page layout speed, to realize the quick obtaining of data.
As further improvement of this embodiment, processing module 12 is additionally operable to from least one default account information choosing
Select the first account information, and log in the website at page place to be captured according to the first account information, enter first page to be captured;
Wherein, each account information includes login account and login password.
In the present embodiment, increase processing module 12, after login account can be realized, just open the website page to be grabbed of total data
The social network-i i-platform etc. such as the crawl of the total data in face, such as microblogging, it is ensured that the integrity of data grabber.Specifically can be by data
Acquisition module 11 uses Xpath to identify the account logon form in the page, and steps in account according to the first account information selected
Login account corresponding to the first account information and login password are filled in record list relevant position, complete the process that account logs in.
Need identifying code just can log in after verifying additionally, processing module 12 can be potentially encountered during account logs in
Situation, accordingly, as further improvement of this embodiment, processing module 12 specifically can be also used for by page parsing judge be
No identifying code is needed to verify, if desired identifying code, then capture this identifying code picture, and identifying code picture is uploaded to identifying code knows
The server of other platform, after the identification of server, returns the text message of identifying code, and fills in identifying code correspondence list
In, continue executing with register.
As further improvement of this embodiment, in the device that the present embodiment provides, processing module 12 is additionally operable to detect the
Whether one account information lost efficacy;
If the first account information lost efficacy, then the first account information is marked, and selects at least one account information
Select the second account information;
According to the second account information Website login, enter first page to be captured.
In the present embodiment, processing module 12 is additionally operable to detect whether the first account information lost efficacy, and is included in log in and waits to capture
The logging status of the detection after application and the first account information during crawl detects.Concrete, the login letter of account
Breath would be typically displayed in the page to be captured, such as, can show login account Information on page navigation hurdle, and login account is believed
The information such as the breath hurdle display head portrait of login account, the pet name or account ID, and when being not logged in or account lost efficacy, be empty at this, no
Display head portrait, the pet name or account ID, processing module 12 utilizes Xpath to position this login account Information, and obtains content at this,
Judge that the first account logging status is whether normal, i.e. there is head portrait corresponding to the first account, the pet name or account ID at this when judging
Time assert that the first account logging status is normal;When judging not exist at this head portrait corresponding to the first account, the pet name or account ID
Assert that the first account logging status lost efficacy.After judging that the first account information lost efficacy, processing module 12 is to this first account information
Being marked, mark mode can use increases by an account conditional code in this first account information, and this account conditional code is used for
Record this first account information to lose efficacy.And processing module 12 chooses account information login at least one account information every time
Time, first identify in account information whether comprise account conditional code, thus judge the account of the most promising inefficacy of this account information
Information, when the account information chosen is the account information lost efficacy, chooses another account information the most again.In the present embodiment, when
After first account information lost efficacy, processing module 12 is changed the second account information and is logged in, to ensure proceeding of data grabber.
As further improvement of this embodiment, the processing module 12 of the present embodiment, it is additionally operable to detect the first account information
Crawl data times and/or capture the time;When capturing data times and exceeding default crawl frequency threshold value, from least one
Account information selects the 3rd account information, and according to the 3rd account information Website login, enters first page to be captured;With/
Or, when the crawl time exceedes default crawl time threshold, from least one account information, select the 3rd account information, and
According to the 3rd account information Website login, enter first page to be captured.
In the present embodiment, for the anti-reptile authentication mechanism that some network platforms are stronger, the present embodiment uses many accounts
Rotation formula log in, to ensure that the time that each effective account captures will not be so long that title.Concrete, process mould can be used
Block 12 is limited in default crawl data times threshold value or captures time threshold switching account.Such as, super when capturing data times
When crossing the crawl frequency threshold value preset, processing module 12 selects the 3rd account information from least one account information, and according to
3rd account information logs in be waited to capture application, enters the page to be captured, and wherein, capturing data times can be to obtain number to be captured
According to number of times, it is also possible to for number of times of web server requests data of the page to be captured etc.;And for example, surpass when the time of crawl
When crossing the crawl time threshold preset, processing module 12 selects the 3rd account information from least one account information, and according to
3rd account information logs in be waited to capture application, enters the page to be captured.Certainly processing module 12 can also be simultaneously to capturing data
Number of times and crawl time limit.Above-mentioned scheme is performed, it is ensured that the most each by processing module 12
One account request threshold in the number of times of network platform server request data is less than this network platform anti-reptile authentication mechanism
Value, thus ensure the effective status of this account, also ensure persistently carrying out smoothly of data grabber process.Wherein, data are captured
The detection of number of times or the time of crawl can detect after a page to be captured is crawled, it is also possible to examines in real time
Survey.
The computer program of the carrying out that the embodiment of the present invention is provided data capture method based on reptile, including depositing
Having stored up the computer-readable recording medium of program code, the instruction that described program code includes can be used for performing previous methods and implements
Method described in example, implements and can be found in embodiment of the method, does not repeats them here.Those skilled in the art can be clear
Recognize to Chu, for convenience and simplicity of description, the specific works process of the device of foregoing description, it is referred to preceding method real
Execute the corresponding process in example, do not repeat them here.
Flow chart and block diagram in accompanying drawing show that the method and computer program of the multiple embodiments according to the present invention produces
Architectural framework in the cards, function and the operation of product.In this, each square frame in flow chart or block diagram can represent one
A part for individual module, program segment or code, a part for described module, program segment or code comprises one or more for reality
The executable instruction of the logic function now specified.It should also be noted that at some as in the realization replaced, square frame is marked
Function can also occur to be different from the order marked in accompanying drawing.Such as, two continuous print square frames can essentially the most also
Performing, they can also perform sometimes in the opposite order capablely, and this is depending on involved function.It is also noted that frame
Each square frame in figure and/or flow chart and the combination of the square frame in block diagram and/or flow chart, can be with performing regulation
The special hardware based system of function or action realizes, or can come with the combination of specialized hardware with computer instruction
Realize.
Last it is noted that various embodiments above is only in order to illustrate technical scheme, it is not intended to limit;To the greatest extent
The present invention has been described in detail by pipe with reference to foregoing embodiments, it will be understood by those within the art that: it depends on
So the technical scheme described in foregoing embodiments can be modified, or the most some or all of technical characteristic is entered
Row equivalent;And these amendments or replacement, do not make the essence of appropriate technical solution depart from various embodiments of the present invention technology
The scope of scheme.
Claims (10)
1. a data capture method based on reptile, it is characterised in that including:
Obtain the first data on first page to be captured and at least one redirects link;Wherein, at least one skip chain described
Connect the jump address for second page to be captured can be jumped on described first page to be captured;
According to described at least one redirect link, enter and each redirect second page to be captured that link is corresponding, and described in obtaining
The second data on second page to be captured;
Described first data and described second data are stored in default data base.
Method the most according to claim 1, it is characterised in that the first data on described acquisition first page to be captured and
At least one redirects link, including:
Resolve the layout of described first page to be captured, position position and the institute of the first data on described first page to be captured
State at least one position redirecting link;
Reptile mode is used to obtain described first data that on described first page to be captured, described first Data Position is corresponding, and
Obtain described at least one redirect the position of link corresponding described at least one redirect link.
Method the most according to claim 2, it is characterised in that the layout of the page to be captured described in described parsing, positions institute
State the first data of first page to be captured position and described at least one redirect the position of link, including:
Use position and the layout of the page to be captured described in the parsing of extensible markup language path language, obtain described first data
Position and described at least one redirect the position of link.
4. according to the method described in any one of claims 1 to 3, it is characterised in that on described acquisition first page to be captured
Before first data redirect link with at least one, described method also includes,
From at least one default account information, select the first account information, and log according to described first account information and wait to grab
Take the website at page place, enter described first page to be captured;
Wherein, each account information includes login account and login password.
Method the most according to claim 4, it is characterised in that described method also includes:
Detect whether the first account information lost efficacy;
If described first account information lost efficacy, then described first account information is marked, and at least one account described
Information selects the second account information;
Log in described website according to described second account information, enter described first page to be captured.
Method the most according to claim 4, it is characterised in that described method also includes:
Detect crawl data times and/or the crawl time of described first account information;
When described crawl data times exceedes default crawl frequency threshold value, from least one account information described, select the
Three account information, and log in described website according to described 3rd account information, enter described first page to be captured;And/or, when
When the described crawl time exceedes default crawl time threshold, from least one account information described, select the 3rd account letter
Breath, and log in described website according to described 3rd account information, enter described first page to be captured.
7. a data acquisition facility based on reptile, it is characterised in that including:
Data acquisition module, redirects link for obtaining the first data on first page to be captured and at least one;Wherein, institute
State at least one and redirect the jump address being linked as jumping to second page to be captured on described first page to be captured;
Processing module, for according to described at least one redirect link, enter and each redirect the second page to be captured that link is corresponding
Face;
Described data acquisition module is additionally operable to obtain the second data on described second page to be captured;
Memory module, for being stored in described first data and described second data in default data base.
Device the most according to claim 7, it is characterised in that described data acquisition module, specifically for:
Resolve the layout of described first page to be captured, position the first Data Position on described first page to be captured and described
At least one redirects the position of link;
Reptile mode is used to obtain described first data that on described first page to be captured, described first Data Position is corresponding, and
Obtain described at least one redirect the position of link corresponding described at least one redirect link.
Device the most according to claim 8, it is characterised in that described data acquisition module, specifically for:
Use position and the layout of the page to be captured described in the parsing of extensible markup language path language, obtain described first data
Position and described at least one redirect the position of link.
10. according to the device described in any one of claim 7 to 9, it is characterised in that
Described processing module is additionally operable to select the first account information from least one default account information, and according to described
One account information logs in the website at page place to be captured, and enters described first page to be captured;
Wherein, each account information includes login account and login password.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610556254.0A CN106126747A (en) | 2016-07-14 | 2016-07-14 | Data capture method based on reptile and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610556254.0A CN106126747A (en) | 2016-07-14 | 2016-07-14 | Data capture method based on reptile and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106126747A true CN106126747A (en) | 2016-11-16 |
Family
ID=57283292
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610556254.0A Pending CN106126747A (en) | 2016-07-14 | 2016-07-14 | Data capture method based on reptile and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106126747A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107526833A (en) * | 2017-09-05 | 2017-12-29 | 广东科杰通信息科技有限公司 | A kind of URL management methods, system |
CN108062413A (en) * | 2017-12-30 | 2018-05-22 | 平安科技(深圳)有限公司 | Web data processing method, device, computer equipment and storage medium |
CN108090091A (en) * | 2016-11-23 | 2018-05-29 | 北京国双科技有限公司 | Web page crawl method and apparatus |
CN108536788A (en) * | 2018-03-29 | 2018-09-14 | 合肥俊刚机械科技有限公司 | A kind of data capture method and its system based on distributed reptile |
CN108733663A (en) * | 2017-04-13 | 2018-11-02 | 富士通株式会社 | Webpage capture method and apparatus |
CN108829838A (en) * | 2018-06-19 | 2018-11-16 | 彭建超 | A kind of account information batch processing method and server |
CN110069684A (en) * | 2017-09-30 | 2019-07-30 | 北京国双科技有限公司 | A kind of data crawling method, device, storage medium and processor |
CN111125489A (en) * | 2019-12-25 | 2020-05-08 | 北京锐安科技有限公司 | Data capturing method, device, equipment and storage medium |
CN111597421A (en) * | 2020-04-30 | 2020-08-28 | 武汉思普崚技术有限公司 | Method, device, equipment and storage medium for realizing website picture crawler |
CN112528120A (en) * | 2020-12-21 | 2021-03-19 | 北京中安智达科技有限公司 | Method for web data crawler to use browser to divide body and proxy |
CN113343159A (en) * | 2021-08-06 | 2021-09-03 | 万商云集(成都)科技股份有限公司 | Method and system for rapidly acquiring data from any channel, analyzing and storing data |
CN114039782A (en) * | 2021-11-10 | 2022-02-11 | 深圳安巽科技有限公司 | Method, system and storage medium for monitoring hidden network |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101625692A (en) * | 2009-08-04 | 2010-01-13 | 北京大学 | Method for rapidly collecting dynamic script website data |
CN101996196A (en) * | 2009-08-28 | 2011-03-30 | 中国移动通信集团公司 | Dynamic webpage acquisition method and device |
CN102262635A (en) * | 2010-05-25 | 2011-11-30 | 北京启明星辰信息技术股份有限公司 | Page crawler system and page crawler method |
CN102937989A (en) * | 2012-10-29 | 2013-02-20 | 北京腾逸科技发展有限公司 | Parallel distributed internet data capture method and system |
CN103186670A (en) * | 2013-03-27 | 2013-07-03 | 中金数据系统有限公司 | Method and system for integrally acquiring webpage information |
CN103279507A (en) * | 2013-05-16 | 2013-09-04 | 北京尚友通达信息技术有限公司 | Webpage spider operational method and system |
CN103618649A (en) * | 2013-12-03 | 2014-03-05 | 北京人民在线网络有限公司 | Website data acquisition method and device |
CN105320740A (en) * | 2015-09-22 | 2016-02-10 | 清华大学 | WeChat article and official account acquisition method and acquisition system |
CN105589953A (en) * | 2015-12-21 | 2016-05-18 | 南通大学 | Unexpected public health event internet text extraction method |
-
2016
- 2016-07-14 CN CN201610556254.0A patent/CN106126747A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101625692A (en) * | 2009-08-04 | 2010-01-13 | 北京大学 | Method for rapidly collecting dynamic script website data |
CN101996196A (en) * | 2009-08-28 | 2011-03-30 | 中国移动通信集团公司 | Dynamic webpage acquisition method and device |
CN102262635A (en) * | 2010-05-25 | 2011-11-30 | 北京启明星辰信息技术股份有限公司 | Page crawler system and page crawler method |
CN102937989A (en) * | 2012-10-29 | 2013-02-20 | 北京腾逸科技发展有限公司 | Parallel distributed internet data capture method and system |
CN103186670A (en) * | 2013-03-27 | 2013-07-03 | 中金数据系统有限公司 | Method and system for integrally acquiring webpage information |
CN103279507A (en) * | 2013-05-16 | 2013-09-04 | 北京尚友通达信息技术有限公司 | Webpage spider operational method and system |
CN103618649A (en) * | 2013-12-03 | 2014-03-05 | 北京人民在线网络有限公司 | Website data acquisition method and device |
CN105320740A (en) * | 2015-09-22 | 2016-02-10 | 清华大学 | WeChat article and official account acquisition method and acquisition system |
CN105589953A (en) * | 2015-12-21 | 2016-05-18 | 南通大学 | Unexpected public health event internet text extraction method |
Non-Patent Citations (3)
Title |
---|
刘凡凡: "支持AJAX的定址网络爬虫系统的研究和实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
程光等: "《僵尸网络检测技术》", 31 October 2014 * |
马刚: "《基于语义的web 数据挖掘》", 31 January 2014 * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108090091A (en) * | 2016-11-23 | 2018-05-29 | 北京国双科技有限公司 | Web page crawl method and apparatus |
CN108733663A (en) * | 2017-04-13 | 2018-11-02 | 富士通株式会社 | Webpage capture method and apparatus |
CN107526833B (en) * | 2017-09-05 | 2020-03-24 | 广东科杰通信息科技有限公司 | URL management method and system |
CN107526833A (en) * | 2017-09-05 | 2017-12-29 | 广东科杰通信息科技有限公司 | A kind of URL management methods, system |
CN110069684A (en) * | 2017-09-30 | 2019-07-30 | 北京国双科技有限公司 | A kind of data crawling method, device, storage medium and processor |
CN108062413A (en) * | 2017-12-30 | 2018-05-22 | 平安科技(深圳)有限公司 | Web data processing method, device, computer equipment and storage medium |
CN108536788A (en) * | 2018-03-29 | 2018-09-14 | 合肥俊刚机械科技有限公司 | A kind of data capture method and its system based on distributed reptile |
CN108829838B (en) * | 2018-06-19 | 2021-11-26 | 彭建超 | Batch processing method of account information and server |
CN108829838A (en) * | 2018-06-19 | 2018-11-16 | 彭建超 | A kind of account information batch processing method and server |
CN111125489A (en) * | 2019-12-25 | 2020-05-08 | 北京锐安科技有限公司 | Data capturing method, device, equipment and storage medium |
CN111597421A (en) * | 2020-04-30 | 2020-08-28 | 武汉思普崚技术有限公司 | Method, device, equipment and storage medium for realizing website picture crawler |
CN111597421B (en) * | 2020-04-30 | 2022-08-30 | 武汉思普崚技术有限公司 | Method, device, equipment and storage medium for realizing website picture crawler |
CN112528120A (en) * | 2020-12-21 | 2021-03-19 | 北京中安智达科技有限公司 | Method for web data crawler to use browser to divide body and proxy |
CN113343159A (en) * | 2021-08-06 | 2021-09-03 | 万商云集(成都)科技股份有限公司 | Method and system for rapidly acquiring data from any channel, analyzing and storing data |
CN113343159B (en) * | 2021-08-06 | 2021-11-12 | 万商云集(成都)科技股份有限公司 | Method and system for rapidly acquiring data from any channel, analyzing and storing data |
CN114039782A (en) * | 2021-11-10 | 2022-02-11 | 深圳安巽科技有限公司 | Method, system and storage medium for monitoring hidden network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106126747A (en) | Data capture method based on reptile and device | |
Iqbal et al. | Adgraph: A graph-based approach to ad and tracker blocking | |
Powell et al. | Web site engineering: beyond Web page design | |
CN105243159A (en) | Visual script editor-based distributed web crawler system | |
CN104956362B (en) | Analyze the structure of weblication | |
Li et al. | Here's what I did: Sharing and reusing web activity with ActionShot | |
CN110069683B (en) | Method and device for crawling data based on browser | |
CN110442811A (en) | A kind of processing method of the page, device, computer equipment and storage medium | |
CN106484383A (en) | page rendering method, device and equipment | |
CN103412890A (en) | Webpage loading method and device | |
CN101443751A (en) | Method and apparatus for an application crawler | |
CN107391775A (en) | A kind of general web crawlers model implementation method and system | |
CN102833212A (en) | Webpage visitor identity identification method and system | |
CN103577427A (en) | Browser kernel based web page crawling method and device and browser containing device | |
CN107590236B (en) | Big data acquisition method and system for building construction enterprises | |
CN109960491A (en) | Application program generation method, generating means, electronic equipment and storage medium | |
CN110275705A (en) | Generate method, apparatus, equipment and the storage medium for preloading page code | |
CN107807937A (en) | A kind of website SEO processing methods, apparatus and system | |
CN102880679B (en) | A kind of info web storage means and device | |
CN102185830B (en) | A kind of method and system of security filtration of network television browser | |
KR101287371B1 (en) | Method and Device for Collecting Web Contents and Computer-readable Recording Medium for the same | |
Choudhary et al. | Solving some modeling challenges when testing rich internet applications for security | |
CN103336693B (en) | The creation method of refer chain, device and security detection equipment | |
CN103020179A (en) | Method, device and equipment for extracting webpage contents | |
CN105930385A (en) | Data crawling method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161116 |
|
RJ01 | Rejection of invention patent application after publication |