CN104252530A - Single-computer crawler grabbing method and system - Google Patents

Single-computer crawler grabbing method and system Download PDF

Info

Publication number
CN104252530A
CN104252530A CN201410458191.6A CN201410458191A CN104252530A CN 104252530 A CN104252530 A CN 104252530A CN 201410458191 A CN201410458191 A CN 201410458191A CN 104252530 A CN104252530 A CN 104252530A
Authority
CN
China
Prior art keywords
url
web data
capturing
data
described current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410458191.6A
Other languages
Chinese (zh)
Other versions
CN104252530B (en
Inventor
廖耀华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201410458191.6A priority Critical patent/CN104252530B/en
Publication of CN104252530A publication Critical patent/CN104252530A/en
Application granted granted Critical
Publication of CN104252530B publication Critical patent/CN104252530B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Abstract

The invention discloses a single-computer crawler grabbing method and system. The single-computer crawler grabbing method includes acquiring at least one seed including a URL (uniform resource locator), a website number and a type, taking the URLs of the seeds as current URLs, taking the website numbers of the seeds as current website numbers, and taking the types of the seeds as current types; acquiring at least one strategy, and determining at least one crawler grabbing parameter according to the strategies; acquiring rules corresponding to the current types according to the current types; grabbing website data from the current URLs according to the crawler grabbing parameters, and analyzing the website data according to the rules to acquire analysis data. The crawler grabbing parameters are determined through the strategies so as to solve the problems in the process of grabbing, so that working efficiency is improved, grabbing time is increased, and the single-computer crawler grabbing method and system is suitable for websites of various types.

Description

A kind of unit crawler capturing method and system
Technical field
The present invention relates to web crawlers correlation technique, particularly a kind of unit crawler capturing method and system.
Background technology
Internet has data and the information of magnanimity, how these data and information is converted to the thing oneself wanted, and then to carry out treatment and analysis be a more thorny thing.The appearance of web crawlers solves these all problems.
The reptile device of current majority is all the function simply achieving and crawl webpage, but crawls for repeating, be absorbed in all not good embodiment in the aspect such as endless loop trap, formulation anti-creep strategy (extending the crawl time).In addition, current unit network compatibility is bad, can not solve the crawl demand simultaneously capturing multiple website.
Summary of the invention
Based on this, be necessary at the bottom of the existing unit web crawlers grasping mechanism work efficiency for prior art, capture the time short, and the technical matters of polytype website can not be captured simultaneously, a kind of unit crawler capturing method and system are provided.
A kind of unit crawler capturing method, comprising:
Obtain the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;
Obtain at least one strategy, determine at least one crawler capturing parameter according to described strategy;
The rule corresponding with described current type is obtained according to described current type;
Capture web data according to described crawler capturing parameter from described current URL, according to described rule, parsing is carried out to described web data and obtain resolution data.
A kind of unit crawler capturing system, comprising:
Seed receiver module, for obtaining the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;
Policy module, for obtaining at least one strategy, determines at least one crawler capturing parameter according to described strategy;
Rule module, for obtaining the rule corresponding with described current type according to described current type;
Parsing module, for capturing web data according to described crawler capturing parameter from described current URL, carrying out parsing according to described rule to described web data and obtaining resolution data.
The present invention determines crawler capturing parameter by strategy, to overcome produced problem in crawl process in time, thus increases work efficiency, and extends the crawl time, and adapts to polytype website.
Accompanying drawing explanation
Fig. 1 is the workflow diagram of a kind of unit crawler capturing of the present invention method;
Fig. 2 is the construction module figure of a kind of unit crawler capturing of the present invention system;
Fig. 3 is the construction module figure of the most preferred embodiment of a kind of unit crawler capturing of the present invention system.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention will be further described in detail.
Be illustrated in figure 1 the workflow diagram of a kind of unit crawler capturing of the present invention method, comprise:
Step 11, obtains the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;
Step 12, obtains at least one strategy, determines at least one crawler capturing parameter according to described strategy;
Step 13, obtains the rule corresponding with described current type according to described current type;
Step 14, captures web data according to described crawler capturing parameter from described current URL, carries out parsing obtain resolution data according to described rule to described web data.
Strategy in step 12, for determining crawler capturing parameter, by different strategies, determines different crawler capturing parameters, thus at step 14, adopts and carry out Webpage data capturing through the determined crawler capturing parameter of step 12.Because crawler capturing parameter is determined by the strategy of step 12, therefore, by setting different strategies, to meet different crawl demands, thus can increase work efficiency, extending the crawl time, and adapting to polytype website.
Wherein in an embodiment, in described step 14, if capturing described web data or occurring abnormal conditions in analyzing described web data, then preserve described abnormal conditions.
By monitoring in the abnormal conditions capturing described web data or occur in analyzing described web data, timely abnormal conditions can be fed back to user, prevent the wasting of resources.
Wherein in an embodiment, described strategy comprises: seed is absorbed in endless loop processing policy, browser mark switchover policy, cookie dynamically update strategy and/or Agent IP switchover policy.
In the strategy of the present embodiment, seed is absorbed in endless loop processing policy and is used for preventing from repeating to crawl, be absorbed in endless loop trap, and browser mark switchover policy, cookie dynamically update strategy and/or Agent IP switchover policy then can extend the crawl time.
Wherein in an embodiment:
Described seed is absorbed in endless loop processing policy and is specially: described crawler capturing parameter is for allowing or refusing to capture web data from described current URL, if described abnormal conditions are current URL be absorbed in endless loop, then described crawler capturing optimum configurations is for refusal is from described current URL crawl web data, otherwise described crawler capturing optimum configurations captures web data for allowing from described current URL;
Described browser mark switchover policy is specially: the browser mark of described crawler capturing parameter for adopting when described current URL captures web data, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then the browser mark adopted when described current URL captures web data is updated to another browser mark, otherwise does not upgrade the browser mark adopted when described current URL captures web data;
Described cookie dynamically updates strategy and is specially: described crawler capturing parameter is allow or refuse to upgrade cookie when capturing web data from described current URL, timing is preset if arrived, then described crawler capturing optimum configurations is allow to upgrade cookie when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade cookie when capturing web data from described current URL;
Described Agent IP switchover policy is specially: described crawler capturing parameter is allow or refuse to upgrade Agent IP when capturing web data from described current URL, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then described crawler capturing optimum configurations is allow to upgrade Agent IP when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade Agent IP when capturing web data from described current URL.
The present embodiment further illustrates that seed is absorbed in endless loop processing policy, browser mark switchover policy, cookie dynamically update strategy and Agent IP switchover policy, wherein, seed is absorbed in endless loop processing policy, browser mark switchover policy and Agent IP switchover policy according to abnormal conditions adjustment crawler capturing parameter, and cookie dynamically updates strategy and then adjusts crawler capturing parameter by the mode of timing renewal.
Specifically, seed is absorbed in endless loop processing policy mainly for the endless loop trap solving website.After reptile grabs web data according to URL, from the URL that this web data analysis makes new advances, and capture new web data according to new URL again.But, some website can arrange endless loop trap, namely the new URL analyzed according to web data is existing URL, thus cause crawler capturing to be absorbed in endless loop, affect crawler capturing, and seed being absorbed in endless loop processing policy, is then when monitoring current URL and being absorbed in the abnormal conditions of endless loop, refusal is then set and captures web data from described current URL, thus avoid being absorbed in endless loop.
Specifically, browser mark switchover policy is used for imitating user behavior as far as possible.The browser that different users uses can be different, in order to imitate user behavior as much as possible, need type or the version of changing browser.And the type of browser or version, browser mark (such as: user-agent) is adopted to identify, reptile can simulate a virtual browser when crawling, distinguish with user-agent, the value of use-agent is determined by the type of browser and version number, and the value changing user-agent is equivalent to have switched browser.Therefore, when detect from described current URL capture web data failure abnormal conditions or arrive preset timing time, change browser mark, to extend the crawl time of reptile.
Specifically, cookie dynamically updates strategy and mainly adopts timing update mode to realize, and when namely arriving default timing, then upgrades cookie, upgrades cookie and be equivalent to set up new session with the website of crawled web data, thus can extend the crawl time.
Specifically, Agent IP switchover policy is mainly for the same IP (network address of website to long-time crawl web data, such as: IPv4 address, or IPv6 address, usually adopt: the IPv4 address of XXX.XXX.XXX.XXX) carry out the situation of blocking.For unit crawler capturing, because unit generally only has an IP, the mode of Agent IP is therefore adopted to carry out crawler capturing.Agent IP switchover policy, detect from described current URL capture web data failure abnormal conditions or arrive preset timing time, change its Agent IP, to avoid being blocked.
Wherein in an embodiment, in described step 14, according to described rule, parsing is carried out to described web data and obtains resolution data, specifically comprise:
If described current type is homepage URL, then regular accordingly according to homepage URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of paging URL as type using the URL in resolution data;
If described current type is paging URL, then regular accordingly according to paging URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of details page URL as type using the URL in resolution data;
If described current type is details page URL, then regular accordingly according to homepage URL, parsing is carried out to described web data and obtains resolution data, preserve the web page contents in described resolution data.
The present embodiment is classified to URL, to make dissimilar URL that different rules can be adopted to resolve, thus obtains analysis result more accurately.
Be illustrated in figure 2 the construction module figure of a kind of unit crawler capturing of the present invention system, comprise:
Seed receiver module 201, for obtaining the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;
Policy module 202, for obtaining at least one strategy, determines at least one crawler capturing parameter according to described strategy;
Rule module 203, for obtaining the rule corresponding with described current type according to described current type;
Parsing module 204, for capturing web data according to described crawler capturing parameter from described current URL, carrying out parsing according to described rule to described web data and obtaining resolution data.
Wherein in an embodiment, in described parsing module 204, if capturing described web data or occurring abnormal conditions in analyzing described web data, then preserve described abnormal conditions.
Wherein in an embodiment, described strategy comprises: seed is absorbed in endless loop processing policy, browser mark switchover policy, cookie dynamically update strategy and/or Agent IP switchover policy.
Wherein in an embodiment:
Described seed is absorbed in endless loop processing policy and is specially: described crawler capturing parameter is for allowing or refusing to capture web data from described current URL, if described abnormal conditions are current URL be absorbed in endless loop, then described crawler capturing optimum configurations is for refusal is from described current URL crawl web data, otherwise described crawler capturing optimum configurations captures web data for allowing from described current URL;
Described browser mark switchover policy is specially: the browser mark of described crawler capturing parameter for adopting when described current URL captures web data, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then the browser mark adopted when described current URL captures web data is updated to another browser mark, otherwise does not upgrade the browser mark adopted when described current URL captures web data;
Described cookie dynamically updates strategy and is specially: described crawler capturing parameter is allow or refuse to upgrade cookie when capturing web data from described current URL, timing is preset if arrived, then described crawler capturing optimum configurations is allow to upgrade cookie when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade cookie when capturing web data from described current URL;
Described Agent IP switchover policy is specially: described crawler capturing parameter is allow or refuse to upgrade Agent IP when capturing web data from described current URL, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then described crawler capturing optimum configurations is allow to upgrade Agent IP when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade Agent IP when capturing web data from described current URL.
Wherein in an embodiment, in described parsing module 204, according to described rule, parsing is carried out to described web data and obtains resolution data, specifically comprise:
If described current type is homepage URL, then regular accordingly according to homepage URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of paging URL as type using the URL in resolution data;
If described current type is paging URL, then regular accordingly according to paging URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of details page URL as type using the URL in resolution data;
If described current type is details page URL, then regular accordingly according to homepage URL, parsing is carried out to described web data and obtains resolution data, preserve the web page contents in described resolution data.
Be illustrated in figure 3 the construction module figure of the most preferred embodiment of a kind of unit crawler capturing of the present invention system, comprise seed generation module 310, handling module 320 and data memory module 330.
The Main Function of seed generation module 310 is for handling module provides seed, and seed can be the URL of website or the SKU of commodity.Seed can be kept in text or database, and handling module can obtain seed from text or database batch.
Each seed must have virtual numbering and the type of a website, the virtual numbering in website can tell that handling module calls corresponding rule file parse documents, and type field mainly illustrates what type this seed belongs to, be details page URL, paging URL or homepage URL.
Handling module 320 is cores of whole unit reptile, and it manages submodule 321, document analyzing sub-module 322, policy management sub-module 323 and exception reporting submodule 323 by rule file and forms.The Main Function of rule file management submodule 321 is the document resolution rules managing all kinds of website, for document analyzing sub-module 322 provides resolution rules.Document analyzing sub-module 322 obtains the rule of each website from rule file management submodule 321, by these rule parsing documents, obtains the interested information of user.Policy management sub-module 323, as the optimization submodule of handling module, can be made up of a series of tactical management chain, by analyzing the crawl flow process of handling module, can be used for preventing from repeating to crawl, be absorbed in endless loop trap and extend the crawl time etc.Exception reporting submodule 324 is used for reporting the various problems of handling module 320 in crawl process, feeds back to user in time, prevents the wasting of resources.
After handling module 320 gets seed, analyze the report information of exception reporting, call corresponding policy management module and requested webpage.Policy management module 323 comprises a series of strategy defined, and is kept in multiple tactful chain.Such as seed is absorbed in the processing policy of endless loop; Browser agent switchover policy; Cookie dynamically updates strategy; Agent IP switchover policy etc.These strategies can guarantee that handling module is more efficient when requested webpage.
After getting info web, by the virtual numbering calling rule file management submodule 321 of seed, obtain corresponding rule file, by document analyzing sub-module 322 parse documents.Each seed has a type field, can tell what content is document analyzing sub-module 322 will resolve, and is such as homepage URL, generally can parses paging URL; If paging URL, then need to parse detail page URL; If details page URL, then can separate out content by Directly solution.The content parsed can separately be preserved, if parse or URL, then needs him to stamp type mark, save separately, follow-up for handling module 320, if what parse is content, then can be kept in database or text, directly for user.
Exception reporting runs through whole crawl flow process, is divided into two kinds, and one belongs to system-level mistake, a kind of mistake belonging to user class.System-level errors should be reported to handling module 320, and handling module, once receive such type of error, can be called corresponding policy management sub-module 323 and carry out optimal grasp process.And user class mistake to be system cannot process, must feed back to user, such as parsing module is resolved content and is made mistakes.
Data memory module 330 is used for storing the Various types of data obtained from handling module, and these data can be kept in database or document, and this module also can provide data for handling module 320.
Some data needs to save to reuse to system, and some data can use directly to user.
The present invention designs the sub-module of the handling module of unit reptile, and extendability is very good, and adds policy management sub-module and exception reporting submodule, the whole crawl flow process optimized greatly.
The above embodiment only have expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but therefore can not be interpreted as the restriction to the scope of the claims of the present invention.It should be pointed out that for the person of ordinary skill of the art, without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims (10)

1. a unit crawler capturing method, is characterized in that, comprising:
Step (11), obtains the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;
Step (12), obtains at least one strategy, determines at least one crawler capturing parameter according to described strategy;
Step (13), obtains the rule corresponding with described current type according to described current type;
Step (14), captures web data according to described crawler capturing parameter from described current URL, carries out parsing obtain resolution data according to described rule to described web data.
2. unit crawler capturing method according to claim 1, is characterized in that, in described step (14), if capturing described web data or occurring abnormal conditions in analyzing described web data, then preserve described abnormal conditions.
3. unit crawler capturing method according to claim 2, it is characterized in that, described strategy comprises: seed is absorbed in endless loop processing policy, browser mark switchover policy, cookie dynamically update strategy and/or Agent IP switchover policy.
4. unit crawler capturing method according to claim 3, is characterized in that:
Described seed is absorbed in endless loop processing policy and is specially: described crawler capturing parameter is for allowing or refusing to capture web data from described current URL, if described abnormal conditions are current URL be absorbed in endless loop, then described crawler capturing optimum configurations is for refusal is from described current URL crawl web data, otherwise described crawler capturing optimum configurations captures web data for allowing from described current URL;
Described browser mark switchover policy is specially: the browser mark of described crawler capturing parameter for adopting when described current URL captures web data, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then the browser mark adopted when described current URL captures web data is updated to another browser mark, otherwise does not upgrade the browser mark adopted when described current URL captures web data;
Described cookie dynamically updates strategy and is specially: described crawler capturing parameter is allow or refuse to upgrade cookie when capturing web data from described current URL, timing is preset if arrived, then described crawler capturing optimum configurations is allow to upgrade cookie when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade cookie when capturing web data from described current URL;
Described Agent IP switchover policy is specially: described crawler capturing parameter is allow or refuse to upgrade Agent IP when capturing web data from described current URL, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then described crawler capturing optimum configurations is allow to upgrade Agent IP when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade Agent IP when capturing web data from described current URL.
5. unit crawler capturing method according to claim 1, is characterized in that, in described step (14), carries out parsing and obtains resolution data, specifically comprise according to described rule to described web data:
If described current type is homepage URL, then regular accordingly according to homepage URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of paging URL as type using the URL in resolution data;
If described current type is paging URL, then regular accordingly according to paging URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of details page URL as type using the URL in resolution data;
If described current type is details page URL, then regular accordingly according to homepage URL, parsing is carried out to described web data and obtains resolution data, preserve the web page contents in described resolution data.
6. a unit crawler capturing system, is characterized in that, comprising:
Seed receiver module, for obtaining the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;
Policy module, for obtaining at least one strategy, determines at least one crawler capturing parameter according to described strategy;
Rule module, for obtaining the rule corresponding with described current type according to described current type;
Parsing module, for capturing web data according to described crawler capturing parameter from described current URL, carrying out parsing according to described rule to described web data and obtaining resolution data.
7. unit crawler capturing system according to claim 6, is characterized in that, in described parsing module, if capturing described web data or occurring abnormal conditions in analyzing described web data, then preserve described abnormal conditions.
8. unit crawler capturing system according to claim 7, it is characterized in that, described strategy comprises: seed is absorbed in endless loop processing policy, browser mark switchover policy, cookie dynamically update strategy and/or Agent IP switchover policy.
9. unit crawler capturing system according to claim 8, is characterized in that:
Described seed is absorbed in endless loop processing policy and is specially: described crawler capturing parameter is for allowing or refusing to capture web data from described current URL, if described abnormal conditions are current URL be absorbed in endless loop, then described crawler capturing optimum configurations is for refusal is from described current URL crawl web data, otherwise described crawler capturing optimum configurations captures web data for allowing from described current URL;
Described browser mark switchover policy is specially: the browser mark of described crawler capturing parameter for adopting when described current URL captures web data, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then the browser mark adopted when described current URL captures web data is updated to another browser mark, otherwise does not upgrade the browser mark adopted when described current URL captures web data;
Described cookie dynamically updates strategy and is specially: described crawler capturing parameter is allow or refuse to upgrade cookie when capturing web data from described current URL, timing is preset if arrived, then described crawler capturing optimum configurations is allow to upgrade cookie when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade cookie when capturing web data from described current URL;
Described Agent IP switchover policy is specially: described crawler capturing parameter is allow or refuse to upgrade Agent IP when capturing web data from described current URL, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then described crawler capturing optimum configurations is allow to upgrade Agent IP when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade Agent IP when capturing web data from described current URL.
10. unit crawler capturing system according to claim 6, is characterized in that, in described parsing module, carries out parsing and obtains resolution data, specifically comprise according to described rule to described web data:
If described current type is homepage URL, then regular accordingly according to homepage URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of paging URL as type using the URL in resolution data;
If described current type is paging URL, then regular accordingly according to paging URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of details page URL as type using the URL in resolution data;
If described current type is details page URL, then regular accordingly according to homepage URL, parsing is carried out to described web data and obtains resolution data, preserve the web page contents in described resolution data.
CN201410458191.6A 2014-09-10 2014-09-10 A kind of unit crawler capturing method and system Active CN104252530B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410458191.6A CN104252530B (en) 2014-09-10 2014-09-10 A kind of unit crawler capturing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410458191.6A CN104252530B (en) 2014-09-10 2014-09-10 A kind of unit crawler capturing method and system

Publications (2)

Publication Number Publication Date
CN104252530A true CN104252530A (en) 2014-12-31
CN104252530B CN104252530B (en) 2017-09-15

Family

ID=52187420

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410458191.6A Active CN104252530B (en) 2014-09-10 2014-09-10 A kind of unit crawler capturing method and system

Country Status (1)

Country Link
CN (1) CN104252530B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956175A (en) * 2016-05-24 2016-09-21 考拉征信服务有限公司 Webpage content crawling method and device
CN105989151A (en) * 2015-03-02 2016-10-05 阿里巴巴集团控股有限公司 Webpage crawling method and apparatus
CN106021257A (en) * 2015-12-31 2016-10-12 广州华多网络科技有限公司 Method, device, and system for crawler to capture data supporting online programming
CN106599270A (en) * 2016-12-23 2017-04-26 浙江省公众信息产业有限公司 Network data capturing method and crawler
CN107045507A (en) * 2016-02-05 2017-08-15 北京国双科技有限公司 Web page crawl method and device
CN107451046A (en) * 2016-05-30 2017-12-08 腾讯科技(深圳)有限公司 A kind of method and terminal for detecting thread
CN107957939A (en) * 2016-10-14 2018-04-24 北京京东尚科信息技术有限公司 Webpage interactive interface test method and system
CN108536788A (en) * 2018-03-29 2018-09-14 合肥俊刚机械科技有限公司 A kind of data capture method and its system based on distributed reptile
CN109213912A (en) * 2018-08-16 2019-01-15 北京神州泰岳软件股份有限公司 A kind of method and network data crawl dispatching device of crawl network data
CN111881337A (en) * 2020-08-06 2020-11-03 成都信息工程大学 Data acquisition method and system based on Scapy framework and storage medium
CN112528120A (en) * 2020-12-21 2021-03-19 北京中安智达科技有限公司 Method for web data crawler to use browser to divide body and proxy

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178736A (en) * 2007-12-11 2008-05-14 腾讯科技(深圳)有限公司 Web page collecting method and web page collecting server
CN102254014A (en) * 2011-07-21 2011-11-23 华中科技大学 Adaptive information extraction method for webpage characteristics
CN102347930A (en) * 2010-07-26 2012-02-08 中国电信股份有限公司 Method and system for obtaining webpage content
CN103279507A (en) * 2013-05-16 2013-09-04 北京尚友通达信息技术有限公司 Webpage spider operational method and system
CN103942309A (en) * 2014-04-18 2014-07-23 乐得科技有限公司 Network data acquisition device and method and implementation method of acquisition process

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178736A (en) * 2007-12-11 2008-05-14 腾讯科技(深圳)有限公司 Web page collecting method and web page collecting server
CN102347930A (en) * 2010-07-26 2012-02-08 中国电信股份有限公司 Method and system for obtaining webpage content
CN102254014A (en) * 2011-07-21 2011-11-23 华中科技大学 Adaptive information extraction method for webpage characteristics
CN103279507A (en) * 2013-05-16 2013-09-04 北京尚友通达信息技术有限公司 Webpage spider operational method and system
CN103942309A (en) * 2014-04-18 2014-07-23 乐得科技有限公司 Network data acquisition device and method and implementation method of acquisition process

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989151B (en) * 2015-03-02 2019-09-06 阿里巴巴集团控股有限公司 Webpage capture method and device
CN105989151A (en) * 2015-03-02 2016-10-05 阿里巴巴集团控股有限公司 Webpage crawling method and apparatus
CN106021257A (en) * 2015-12-31 2016-10-12 广州华多网络科技有限公司 Method, device, and system for crawler to capture data supporting online programming
CN106021257B (en) * 2015-12-31 2019-10-18 广州华多网络科技有限公司 A kind of crawler capturing data method, apparatus and system for supporting online programming
CN107045507B (en) * 2016-02-05 2020-08-21 北京国双科技有限公司 Webpage crawling method and device
CN107045507A (en) * 2016-02-05 2017-08-15 北京国双科技有限公司 Web page crawl method and device
CN105956175A (en) * 2016-05-24 2016-09-21 考拉征信服务有限公司 Webpage content crawling method and device
CN107451046A (en) * 2016-05-30 2017-12-08 腾讯科技(深圳)有限公司 A kind of method and terminal for detecting thread
CN107451046B (en) * 2016-05-30 2020-11-17 腾讯科技(深圳)有限公司 Method and terminal for detecting threads
CN107957939A (en) * 2016-10-14 2018-04-24 北京京东尚科信息技术有限公司 Webpage interactive interface test method and system
CN107957939B (en) * 2016-10-14 2021-02-26 北京京东尚科信息技术有限公司 Webpage interaction interface testing method and system
CN106599270A (en) * 2016-12-23 2017-04-26 浙江省公众信息产业有限公司 Network data capturing method and crawler
CN108536788A (en) * 2018-03-29 2018-09-14 合肥俊刚机械科技有限公司 A kind of data capture method and its system based on distributed reptile
CN109213912A (en) * 2018-08-16 2019-01-15 北京神州泰岳软件股份有限公司 A kind of method and network data crawl dispatching device of crawl network data
CN111881337A (en) * 2020-08-06 2020-11-03 成都信息工程大学 Data acquisition method and system based on Scapy framework and storage medium
CN112528120A (en) * 2020-12-21 2021-03-19 北京中安智达科技有限公司 Method for web data crawler to use browser to divide body and proxy

Also Published As

Publication number Publication date
CN104252530B (en) 2017-09-15

Similar Documents

Publication Publication Date Title
CN104252530A (en) Single-computer crawler grabbing method and system
CN105243159B (en) A kind of distributed network crawler system based on visualization script editing machine
US10084815B2 (en) Remediating computer security threats using distributed sensor computers
US9124622B1 (en) Detecting computer security threats in electronic documents based on structure
CN103179132B (en) A kind of method and device detecting and defend CC attack
CN103023905B (en) A kind of equipment, method and system for detection of malicious link
CN103023906B (en) Method and system aiming at remote procedure calling conventions to perform status tracking
CN105956175A (en) Webpage content crawling method and device
US20140095427A1 (en) Seo results analysis based on first order data
CN103279507B (en) Webpage spider operational method and system
US9167021B2 (en) Measuring web browsing quality of experience in real-time at an intermediate network node
CN105610993B (en) A kind of domain name analytic method, apparatus and system
CN103326947B (en) The learning method of PMTU, the sending method of data message and the network equipment
CN102870118B (en) Access method, device and system to user behavior
EP1713010A3 (en) Using attribute inheritance to identify crawl paths
CN107580052B (en) Self-evolution network self-adaptive crawler method and system
CN105302815A (en) Web page uniform resource locator URL filtering method and apparatus
CN108206769A (en) Method, apparatus, equipment and the medium of screen quality alarm
US20120047248A1 (en) Method and System for Monitoring Flows in Network Traffic
CN104462242A (en) Webpage reflow quantity counting method and device
US20140137250A1 (en) System and method for detecting final distribution site and landing site of malicious code
CN106657422A (en) Method, apparatus and system for crawling website page
CN105516114B (en) Method and device for scanning vulnerability based on webpage hash value and electronic equipment
CN103117892B (en) Add method and the device of website visiting record
US10013262B2 (en) Method and device for adding indicative icon in interactive application

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant