CN104252530A - Single-computer crawler grabbing method and system - Google Patents
Single-computer crawler grabbing method and system Download PDFInfo
- Publication number
- CN104252530A CN104252530A CN201410458191.6A CN201410458191A CN104252530A CN 104252530 A CN104252530 A CN 104252530A CN 201410458191 A CN201410458191 A CN 201410458191A CN 104252530 A CN104252530 A CN 104252530A
- Authority
- CN
- China
- Prior art keywords
- url
- web data
- capturing
- data
- described current
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
Abstract
The invention discloses a single-computer crawler grabbing method and system. The single-computer crawler grabbing method includes acquiring at least one seed including a URL (uniform resource locator), a website number and a type, taking the URLs of the seeds as current URLs, taking the website numbers of the seeds as current website numbers, and taking the types of the seeds as current types; acquiring at least one strategy, and determining at least one crawler grabbing parameter according to the strategies; acquiring rules corresponding to the current types according to the current types; grabbing website data from the current URLs according to the crawler grabbing parameters, and analyzing the website data according to the rules to acquire analysis data. The crawler grabbing parameters are determined through the strategies so as to solve the problems in the process of grabbing, so that working efficiency is improved, grabbing time is increased, and the single-computer crawler grabbing method and system is suitable for websites of various types.
Description
Technical field
The present invention relates to web crawlers correlation technique, particularly a kind of unit crawler capturing method and system.
Background technology
Internet has data and the information of magnanimity, how these data and information is converted to the thing oneself wanted, and then to carry out treatment and analysis be a more thorny thing.The appearance of web crawlers solves these all problems.
The reptile device of current majority is all the function simply achieving and crawl webpage, but crawls for repeating, be absorbed in all not good embodiment in the aspect such as endless loop trap, formulation anti-creep strategy (extending the crawl time).In addition, current unit network compatibility is bad, can not solve the crawl demand simultaneously capturing multiple website.
Summary of the invention
Based on this, be necessary at the bottom of the existing unit web crawlers grasping mechanism work efficiency for prior art, capture the time short, and the technical matters of polytype website can not be captured simultaneously, a kind of unit crawler capturing method and system are provided.
A kind of unit crawler capturing method, comprising:
Obtain the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;
Obtain at least one strategy, determine at least one crawler capturing parameter according to described strategy;
The rule corresponding with described current type is obtained according to described current type;
Capture web data according to described crawler capturing parameter from described current URL, according to described rule, parsing is carried out to described web data and obtain resolution data.
A kind of unit crawler capturing system, comprising:
Seed receiver module, for obtaining the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;
Policy module, for obtaining at least one strategy, determines at least one crawler capturing parameter according to described strategy;
Rule module, for obtaining the rule corresponding with described current type according to described current type;
Parsing module, for capturing web data according to described crawler capturing parameter from described current URL, carrying out parsing according to described rule to described web data and obtaining resolution data.
The present invention determines crawler capturing parameter by strategy, to overcome produced problem in crawl process in time, thus increases work efficiency, and extends the crawl time, and adapts to polytype website.
Accompanying drawing explanation
Fig. 1 is the workflow diagram of a kind of unit crawler capturing of the present invention method;
Fig. 2 is the construction module figure of a kind of unit crawler capturing of the present invention system;
Fig. 3 is the construction module figure of the most preferred embodiment of a kind of unit crawler capturing of the present invention system.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention will be further described in detail.
Be illustrated in figure 1 the workflow diagram of a kind of unit crawler capturing of the present invention method, comprise:
Step 11, obtains the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;
Step 12, obtains at least one strategy, determines at least one crawler capturing parameter according to described strategy;
Step 13, obtains the rule corresponding with described current type according to described current type;
Step 14, captures web data according to described crawler capturing parameter from described current URL, carries out parsing obtain resolution data according to described rule to described web data.
Strategy in step 12, for determining crawler capturing parameter, by different strategies, determines different crawler capturing parameters, thus at step 14, adopts and carry out Webpage data capturing through the determined crawler capturing parameter of step 12.Because crawler capturing parameter is determined by the strategy of step 12, therefore, by setting different strategies, to meet different crawl demands, thus can increase work efficiency, extending the crawl time, and adapting to polytype website.
Wherein in an embodiment, in described step 14, if capturing described web data or occurring abnormal conditions in analyzing described web data, then preserve described abnormal conditions.
By monitoring in the abnormal conditions capturing described web data or occur in analyzing described web data, timely abnormal conditions can be fed back to user, prevent the wasting of resources.
Wherein in an embodiment, described strategy comprises: seed is absorbed in endless loop processing policy, browser mark switchover policy, cookie dynamically update strategy and/or Agent IP switchover policy.
In the strategy of the present embodiment, seed is absorbed in endless loop processing policy and is used for preventing from repeating to crawl, be absorbed in endless loop trap, and browser mark switchover policy, cookie dynamically update strategy and/or Agent IP switchover policy then can extend the crawl time.
Wherein in an embodiment:
Described seed is absorbed in endless loop processing policy and is specially: described crawler capturing parameter is for allowing or refusing to capture web data from described current URL, if described abnormal conditions are current URL be absorbed in endless loop, then described crawler capturing optimum configurations is for refusal is from described current URL crawl web data, otherwise described crawler capturing optimum configurations captures web data for allowing from described current URL;
Described browser mark switchover policy is specially: the browser mark of described crawler capturing parameter for adopting when described current URL captures web data, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then the browser mark adopted when described current URL captures web data is updated to another browser mark, otherwise does not upgrade the browser mark adopted when described current URL captures web data;
Described cookie dynamically updates strategy and is specially: described crawler capturing parameter is allow or refuse to upgrade cookie when capturing web data from described current URL, timing is preset if arrived, then described crawler capturing optimum configurations is allow to upgrade cookie when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade cookie when capturing web data from described current URL;
Described Agent IP switchover policy is specially: described crawler capturing parameter is allow or refuse to upgrade Agent IP when capturing web data from described current URL, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then described crawler capturing optimum configurations is allow to upgrade Agent IP when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade Agent IP when capturing web data from described current URL.
The present embodiment further illustrates that seed is absorbed in endless loop processing policy, browser mark switchover policy, cookie dynamically update strategy and Agent IP switchover policy, wherein, seed is absorbed in endless loop processing policy, browser mark switchover policy and Agent IP switchover policy according to abnormal conditions adjustment crawler capturing parameter, and cookie dynamically updates strategy and then adjusts crawler capturing parameter by the mode of timing renewal.
Specifically, seed is absorbed in endless loop processing policy mainly for the endless loop trap solving website.After reptile grabs web data according to URL, from the URL that this web data analysis makes new advances, and capture new web data according to new URL again.But, some website can arrange endless loop trap, namely the new URL analyzed according to web data is existing URL, thus cause crawler capturing to be absorbed in endless loop, affect crawler capturing, and seed being absorbed in endless loop processing policy, is then when monitoring current URL and being absorbed in the abnormal conditions of endless loop, refusal is then set and captures web data from described current URL, thus avoid being absorbed in endless loop.
Specifically, browser mark switchover policy is used for imitating user behavior as far as possible.The browser that different users uses can be different, in order to imitate user behavior as much as possible, need type or the version of changing browser.And the type of browser or version, browser mark (such as: user-agent) is adopted to identify, reptile can simulate a virtual browser when crawling, distinguish with user-agent, the value of use-agent is determined by the type of browser and version number, and the value changing user-agent is equivalent to have switched browser.Therefore, when detect from described current URL capture web data failure abnormal conditions or arrive preset timing time, change browser mark, to extend the crawl time of reptile.
Specifically, cookie dynamically updates strategy and mainly adopts timing update mode to realize, and when namely arriving default timing, then upgrades cookie, upgrades cookie and be equivalent to set up new session with the website of crawled web data, thus can extend the crawl time.
Specifically, Agent IP switchover policy is mainly for the same IP (network address of website to long-time crawl web data, such as: IPv4 address, or IPv6 address, usually adopt: the IPv4 address of XXX.XXX.XXX.XXX) carry out the situation of blocking.For unit crawler capturing, because unit generally only has an IP, the mode of Agent IP is therefore adopted to carry out crawler capturing.Agent IP switchover policy, detect from described current URL capture web data failure abnormal conditions or arrive preset timing time, change its Agent IP, to avoid being blocked.
Wherein in an embodiment, in described step 14, according to described rule, parsing is carried out to described web data and obtains resolution data, specifically comprise:
If described current type is homepage URL, then regular accordingly according to homepage URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of paging URL as type using the URL in resolution data;
If described current type is paging URL, then regular accordingly according to paging URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of details page URL as type using the URL in resolution data;
If described current type is details page URL, then regular accordingly according to homepage URL, parsing is carried out to described web data and obtains resolution data, preserve the web page contents in described resolution data.
The present embodiment is classified to URL, to make dissimilar URL that different rules can be adopted to resolve, thus obtains analysis result more accurately.
Be illustrated in figure 2 the construction module figure of a kind of unit crawler capturing of the present invention system, comprise:
Seed receiver module 201, for obtaining the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;
Policy module 202, for obtaining at least one strategy, determines at least one crawler capturing parameter according to described strategy;
Rule module 203, for obtaining the rule corresponding with described current type according to described current type;
Parsing module 204, for capturing web data according to described crawler capturing parameter from described current URL, carrying out parsing according to described rule to described web data and obtaining resolution data.
Wherein in an embodiment, in described parsing module 204, if capturing described web data or occurring abnormal conditions in analyzing described web data, then preserve described abnormal conditions.
Wherein in an embodiment, described strategy comprises: seed is absorbed in endless loop processing policy, browser mark switchover policy, cookie dynamically update strategy and/or Agent IP switchover policy.
Wherein in an embodiment:
Described seed is absorbed in endless loop processing policy and is specially: described crawler capturing parameter is for allowing or refusing to capture web data from described current URL, if described abnormal conditions are current URL be absorbed in endless loop, then described crawler capturing optimum configurations is for refusal is from described current URL crawl web data, otherwise described crawler capturing optimum configurations captures web data for allowing from described current URL;
Described browser mark switchover policy is specially: the browser mark of described crawler capturing parameter for adopting when described current URL captures web data, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then the browser mark adopted when described current URL captures web data is updated to another browser mark, otherwise does not upgrade the browser mark adopted when described current URL captures web data;
Described cookie dynamically updates strategy and is specially: described crawler capturing parameter is allow or refuse to upgrade cookie when capturing web data from described current URL, timing is preset if arrived, then described crawler capturing optimum configurations is allow to upgrade cookie when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade cookie when capturing web data from described current URL;
Described Agent IP switchover policy is specially: described crawler capturing parameter is allow or refuse to upgrade Agent IP when capturing web data from described current URL, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then described crawler capturing optimum configurations is allow to upgrade Agent IP when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade Agent IP when capturing web data from described current URL.
Wherein in an embodiment, in described parsing module 204, according to described rule, parsing is carried out to described web data and obtains resolution data, specifically comprise:
If described current type is homepage URL, then regular accordingly according to homepage URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of paging URL as type using the URL in resolution data;
If described current type is paging URL, then regular accordingly according to paging URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of details page URL as type using the URL in resolution data;
If described current type is details page URL, then regular accordingly according to homepage URL, parsing is carried out to described web data and obtains resolution data, preserve the web page contents in described resolution data.
Be illustrated in figure 3 the construction module figure of the most preferred embodiment of a kind of unit crawler capturing of the present invention system, comprise seed generation module 310, handling module 320 and data memory module 330.
The Main Function of seed generation module 310 is for handling module provides seed, and seed can be the URL of website or the SKU of commodity.Seed can be kept in text or database, and handling module can obtain seed from text or database batch.
Each seed must have virtual numbering and the type of a website, the virtual numbering in website can tell that handling module calls corresponding rule file parse documents, and type field mainly illustrates what type this seed belongs to, be details page URL, paging URL or homepage URL.
Handling module 320 is cores of whole unit reptile, and it manages submodule 321, document analyzing sub-module 322, policy management sub-module 323 and exception reporting submodule 323 by rule file and forms.The Main Function of rule file management submodule 321 is the document resolution rules managing all kinds of website, for document analyzing sub-module 322 provides resolution rules.Document analyzing sub-module 322 obtains the rule of each website from rule file management submodule 321, by these rule parsing documents, obtains the interested information of user.Policy management sub-module 323, as the optimization submodule of handling module, can be made up of a series of tactical management chain, by analyzing the crawl flow process of handling module, can be used for preventing from repeating to crawl, be absorbed in endless loop trap and extend the crawl time etc.Exception reporting submodule 324 is used for reporting the various problems of handling module 320 in crawl process, feeds back to user in time, prevents the wasting of resources.
After handling module 320 gets seed, analyze the report information of exception reporting, call corresponding policy management module and requested webpage.Policy management module 323 comprises a series of strategy defined, and is kept in multiple tactful chain.Such as seed is absorbed in the processing policy of endless loop; Browser agent switchover policy; Cookie dynamically updates strategy; Agent IP switchover policy etc.These strategies can guarantee that handling module is more efficient when requested webpage.
After getting info web, by the virtual numbering calling rule file management submodule 321 of seed, obtain corresponding rule file, by document analyzing sub-module 322 parse documents.Each seed has a type field, can tell what content is document analyzing sub-module 322 will resolve, and is such as homepage URL, generally can parses paging URL; If paging URL, then need to parse detail page URL; If details page URL, then can separate out content by Directly solution.The content parsed can separately be preserved, if parse or URL, then needs him to stamp type mark, save separately, follow-up for handling module 320, if what parse is content, then can be kept in database or text, directly for user.
Exception reporting runs through whole crawl flow process, is divided into two kinds, and one belongs to system-level mistake, a kind of mistake belonging to user class.System-level errors should be reported to handling module 320, and handling module, once receive such type of error, can be called corresponding policy management sub-module 323 and carry out optimal grasp process.And user class mistake to be system cannot process, must feed back to user, such as parsing module is resolved content and is made mistakes.
Data memory module 330 is used for storing the Various types of data obtained from handling module, and these data can be kept in database or document, and this module also can provide data for handling module 320.
Some data needs to save to reuse to system, and some data can use directly to user.
The present invention designs the sub-module of the handling module of unit reptile, and extendability is very good, and adds policy management sub-module and exception reporting submodule, the whole crawl flow process optimized greatly.
The above embodiment only have expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but therefore can not be interpreted as the restriction to the scope of the claims of the present invention.It should be pointed out that for the person of ordinary skill of the art, without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.
Claims (10)
1. a unit crawler capturing method, is characterized in that, comprising:
Step (11), obtains the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;
Step (12), obtains at least one strategy, determines at least one crawler capturing parameter according to described strategy;
Step (13), obtains the rule corresponding with described current type according to described current type;
Step (14), captures web data according to described crawler capturing parameter from described current URL, carries out parsing obtain resolution data according to described rule to described web data.
2. unit crawler capturing method according to claim 1, is characterized in that, in described step (14), if capturing described web data or occurring abnormal conditions in analyzing described web data, then preserve described abnormal conditions.
3. unit crawler capturing method according to claim 2, it is characterized in that, described strategy comprises: seed is absorbed in endless loop processing policy, browser mark switchover policy, cookie dynamically update strategy and/or Agent IP switchover policy.
4. unit crawler capturing method according to claim 3, is characterized in that:
Described seed is absorbed in endless loop processing policy and is specially: described crawler capturing parameter is for allowing or refusing to capture web data from described current URL, if described abnormal conditions are current URL be absorbed in endless loop, then described crawler capturing optimum configurations is for refusal is from described current URL crawl web data, otherwise described crawler capturing optimum configurations captures web data for allowing from described current URL;
Described browser mark switchover policy is specially: the browser mark of described crawler capturing parameter for adopting when described current URL captures web data, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then the browser mark adopted when described current URL captures web data is updated to another browser mark, otherwise does not upgrade the browser mark adopted when described current URL captures web data;
Described cookie dynamically updates strategy and is specially: described crawler capturing parameter is allow or refuse to upgrade cookie when capturing web data from described current URL, timing is preset if arrived, then described crawler capturing optimum configurations is allow to upgrade cookie when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade cookie when capturing web data from described current URL;
Described Agent IP switchover policy is specially: described crawler capturing parameter is allow or refuse to upgrade Agent IP when capturing web data from described current URL, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then described crawler capturing optimum configurations is allow to upgrade Agent IP when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade Agent IP when capturing web data from described current URL.
5. unit crawler capturing method according to claim 1, is characterized in that, in described step (14), carries out parsing and obtains resolution data, specifically comprise according to described rule to described web data:
If described current type is homepage URL, then regular accordingly according to homepage URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of paging URL as type using the URL in resolution data;
If described current type is paging URL, then regular accordingly according to paging URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of details page URL as type using the URL in resolution data;
If described current type is details page URL, then regular accordingly according to homepage URL, parsing is carried out to described web data and obtains resolution data, preserve the web page contents in described resolution data.
6. a unit crawler capturing system, is characterized in that, comprising:
Seed receiver module, for obtaining the seed that at least one comprises URL, website numbering and type, using the URL of described seed as current URL, using the website of described seed numbering as current site numbering, using the type of described seed as current type;
Policy module, for obtaining at least one strategy, determines at least one crawler capturing parameter according to described strategy;
Rule module, for obtaining the rule corresponding with described current type according to described current type;
Parsing module, for capturing web data according to described crawler capturing parameter from described current URL, carrying out parsing according to described rule to described web data and obtaining resolution data.
7. unit crawler capturing system according to claim 6, is characterized in that, in described parsing module, if capturing described web data or occurring abnormal conditions in analyzing described web data, then preserve described abnormal conditions.
8. unit crawler capturing system according to claim 7, it is characterized in that, described strategy comprises: seed is absorbed in endless loop processing policy, browser mark switchover policy, cookie dynamically update strategy and/or Agent IP switchover policy.
9. unit crawler capturing system according to claim 8, is characterized in that:
Described seed is absorbed in endless loop processing policy and is specially: described crawler capturing parameter is for allowing or refusing to capture web data from described current URL, if described abnormal conditions are current URL be absorbed in endless loop, then described crawler capturing optimum configurations is for refusal is from described current URL crawl web data, otherwise described crawler capturing optimum configurations captures web data for allowing from described current URL;
Described browser mark switchover policy is specially: the browser mark of described crawler capturing parameter for adopting when described current URL captures web data, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then the browser mark adopted when described current URL captures web data is updated to another browser mark, otherwise does not upgrade the browser mark adopted when described current URL captures web data;
Described cookie dynamically updates strategy and is specially: described crawler capturing parameter is allow or refuse to upgrade cookie when capturing web data from described current URL, timing is preset if arrived, then described crawler capturing optimum configurations is allow to upgrade cookie when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade cookie when capturing web data from described current URL;
Described Agent IP switchover policy is specially: described crawler capturing parameter is allow or refuse to upgrade Agent IP when capturing web data from described current URL, if described abnormal conditions are for capturing web data failure from described current URL or reach default timing, then described crawler capturing optimum configurations is allow to upgrade Agent IP when capturing web data from described current URL, otherwise described crawler capturing optimum configurations is refuse to upgrade Agent IP when capturing web data from described current URL.
10. unit crawler capturing system according to claim 6, is characterized in that, in described parsing module, carries out parsing and obtains resolution data, specifically comprise according to described rule to described web data:
If described current type is homepage URL, then regular accordingly according to homepage URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of paging URL as type using the URL in resolution data;
If described current type is paging URL, then regular accordingly according to paging URL, carrying out parsing to described web data and obtain resolution data, if resolution data comprises URL, is then the Seed storage of details page URL as type using the URL in resolution data;
If described current type is details page URL, then regular accordingly according to homepage URL, parsing is carried out to described web data and obtains resolution data, preserve the web page contents in described resolution data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410458191.6A CN104252530B (en) | 2014-09-10 | 2014-09-10 | A kind of unit crawler capturing method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410458191.6A CN104252530B (en) | 2014-09-10 | 2014-09-10 | A kind of unit crawler capturing method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104252530A true CN104252530A (en) | 2014-12-31 |
CN104252530B CN104252530B (en) | 2017-09-15 |
Family
ID=52187420
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410458191.6A Active CN104252530B (en) | 2014-09-10 | 2014-09-10 | A kind of unit crawler capturing method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104252530B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105956175A (en) * | 2016-05-24 | 2016-09-21 | 考拉征信服务有限公司 | Webpage content crawling method and device |
CN105989151A (en) * | 2015-03-02 | 2016-10-05 | 阿里巴巴集团控股有限公司 | Webpage crawling method and apparatus |
CN106021257A (en) * | 2015-12-31 | 2016-10-12 | 广州华多网络科技有限公司 | Method, device, and system for crawler to capture data supporting online programming |
CN106599270A (en) * | 2016-12-23 | 2017-04-26 | 浙江省公众信息产业有限公司 | Network data capturing method and crawler |
CN107045507A (en) * | 2016-02-05 | 2017-08-15 | 北京国双科技有限公司 | Web page crawl method and device |
CN107451046A (en) * | 2016-05-30 | 2017-12-08 | 腾讯科技(深圳)有限公司 | A kind of method and terminal for detecting thread |
CN107957939A (en) * | 2016-10-14 | 2018-04-24 | 北京京东尚科信息技术有限公司 | Webpage interactive interface test method and system |
CN108536788A (en) * | 2018-03-29 | 2018-09-14 | 合肥俊刚机械科技有限公司 | A kind of data capture method and its system based on distributed reptile |
CN109213912A (en) * | 2018-08-16 | 2019-01-15 | 北京神州泰岳软件股份有限公司 | A kind of method and network data crawl dispatching device of crawl network data |
CN111881337A (en) * | 2020-08-06 | 2020-11-03 | 成都信息工程大学 | Data acquisition method and system based on Scapy framework and storage medium |
CN112528120A (en) * | 2020-12-21 | 2021-03-19 | 北京中安智达科技有限公司 | Method for web data crawler to use browser to divide body and proxy |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101178736A (en) * | 2007-12-11 | 2008-05-14 | 腾讯科技(深圳)有限公司 | Web page collecting method and web page collecting server |
CN102254014A (en) * | 2011-07-21 | 2011-11-23 | 华中科技大学 | Adaptive information extraction method for webpage characteristics |
CN102347930A (en) * | 2010-07-26 | 2012-02-08 | 中国电信股份有限公司 | Method and system for obtaining webpage content |
CN103279507A (en) * | 2013-05-16 | 2013-09-04 | 北京尚友通达信息技术有限公司 | Webpage spider operational method and system |
CN103942309A (en) * | 2014-04-18 | 2014-07-23 | 乐得科技有限公司 | Network data acquisition device and method and implementation method of acquisition process |
-
2014
- 2014-09-10 CN CN201410458191.6A patent/CN104252530B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101178736A (en) * | 2007-12-11 | 2008-05-14 | 腾讯科技(深圳)有限公司 | Web page collecting method and web page collecting server |
CN102347930A (en) * | 2010-07-26 | 2012-02-08 | 中国电信股份有限公司 | Method and system for obtaining webpage content |
CN102254014A (en) * | 2011-07-21 | 2011-11-23 | 华中科技大学 | Adaptive information extraction method for webpage characteristics |
CN103279507A (en) * | 2013-05-16 | 2013-09-04 | 北京尚友通达信息技术有限公司 | Webpage spider operational method and system |
CN103942309A (en) * | 2014-04-18 | 2014-07-23 | 乐得科技有限公司 | Network data acquisition device and method and implementation method of acquisition process |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105989151B (en) * | 2015-03-02 | 2019-09-06 | 阿里巴巴集团控股有限公司 | Webpage capture method and device |
CN105989151A (en) * | 2015-03-02 | 2016-10-05 | 阿里巴巴集团控股有限公司 | Webpage crawling method and apparatus |
CN106021257A (en) * | 2015-12-31 | 2016-10-12 | 广州华多网络科技有限公司 | Method, device, and system for crawler to capture data supporting online programming |
CN106021257B (en) * | 2015-12-31 | 2019-10-18 | 广州华多网络科技有限公司 | A kind of crawler capturing data method, apparatus and system for supporting online programming |
CN107045507B (en) * | 2016-02-05 | 2020-08-21 | 北京国双科技有限公司 | Webpage crawling method and device |
CN107045507A (en) * | 2016-02-05 | 2017-08-15 | 北京国双科技有限公司 | Web page crawl method and device |
CN105956175A (en) * | 2016-05-24 | 2016-09-21 | 考拉征信服务有限公司 | Webpage content crawling method and device |
CN107451046A (en) * | 2016-05-30 | 2017-12-08 | 腾讯科技(深圳)有限公司 | A kind of method and terminal for detecting thread |
CN107451046B (en) * | 2016-05-30 | 2020-11-17 | 腾讯科技(深圳)有限公司 | Method and terminal for detecting threads |
CN107957939A (en) * | 2016-10-14 | 2018-04-24 | 北京京东尚科信息技术有限公司 | Webpage interactive interface test method and system |
CN107957939B (en) * | 2016-10-14 | 2021-02-26 | 北京京东尚科信息技术有限公司 | Webpage interaction interface testing method and system |
CN106599270A (en) * | 2016-12-23 | 2017-04-26 | 浙江省公众信息产业有限公司 | Network data capturing method and crawler |
CN108536788A (en) * | 2018-03-29 | 2018-09-14 | 合肥俊刚机械科技有限公司 | A kind of data capture method and its system based on distributed reptile |
CN109213912A (en) * | 2018-08-16 | 2019-01-15 | 北京神州泰岳软件股份有限公司 | A kind of method and network data crawl dispatching device of crawl network data |
CN111881337A (en) * | 2020-08-06 | 2020-11-03 | 成都信息工程大学 | Data acquisition method and system based on Scapy framework and storage medium |
CN112528120A (en) * | 2020-12-21 | 2021-03-19 | 北京中安智达科技有限公司 | Method for web data crawler to use browser to divide body and proxy |
Also Published As
Publication number | Publication date |
---|---|
CN104252530B (en) | 2017-09-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104252530A (en) | Single-computer crawler grabbing method and system | |
CN105243159B (en) | A kind of distributed network crawler system based on visualization script editing machine | |
US10084815B2 (en) | Remediating computer security threats using distributed sensor computers | |
US9124622B1 (en) | Detecting computer security threats in electronic documents based on structure | |
CN103179132B (en) | A kind of method and device detecting and defend CC attack | |
CN103023905B (en) | A kind of equipment, method and system for detection of malicious link | |
CN103023906B (en) | Method and system aiming at remote procedure calling conventions to perform status tracking | |
CN105956175A (en) | Webpage content crawling method and device | |
US20140095427A1 (en) | Seo results analysis based on first order data | |
CN103279507B (en) | Webpage spider operational method and system | |
US9167021B2 (en) | Measuring web browsing quality of experience in real-time at an intermediate network node | |
CN105610993B (en) | A kind of domain name analytic method, apparatus and system | |
CN103326947B (en) | The learning method of PMTU, the sending method of data message and the network equipment | |
CN102870118B (en) | Access method, device and system to user behavior | |
EP1713010A3 (en) | Using attribute inheritance to identify crawl paths | |
CN107580052B (en) | Self-evolution network self-adaptive crawler method and system | |
CN105302815A (en) | Web page uniform resource locator URL filtering method and apparatus | |
CN108206769A (en) | Method, apparatus, equipment and the medium of screen quality alarm | |
US20120047248A1 (en) | Method and System for Monitoring Flows in Network Traffic | |
CN104462242A (en) | Webpage reflow quantity counting method and device | |
US20140137250A1 (en) | System and method for detecting final distribution site and landing site of malicious code | |
CN106657422A (en) | Method, apparatus and system for crawling website page | |
CN105516114B (en) | Method and device for scanning vulnerability based on webpage hash value and electronic equipment | |
CN103117892B (en) | Add method and the device of website visiting record | |
US10013262B2 (en) | Method and device for adding indicative icon in interactive application |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |