CN105740384A - Crawler agent automatic switching method and device - Google Patents

Crawler agent automatic switching method and device Download PDF

Info

Publication number
CN105740384A
CN105740384A CN201610056419.8A CN201610056419A CN105740384A CN 105740384 A CN105740384 A CN 105740384A CN 201610056419 A CN201610056419 A CN 201610056419A CN 105740384 A CN105740384 A CN 105740384A
Authority
CN
China
Prior art keywords
target proxy
described target
candidate queue
web page
download
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610056419.8A
Other languages
Chinese (zh)
Inventor
毛立花
王传超
孙海峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Group Co Ltd
Original Assignee
Inspur Software Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Software Group Co Ltd filed Critical Inspur Software Group Co Ltd
Priority to CN201610056419.8A priority Critical patent/CN105740384A/en
Publication of CN105740384A publication Critical patent/CN105740384A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention provides a crawler agent automatic switching method and a crawler agent automatic switching device, wherein the method comprises the following steps: s1: selecting a target agent from a candidate queue, wherein at least one agent in a candidate state is stored in the candidate queue; s2: sending a crawler instruction to the target agent so that the target agent downloads a webpage according to the crawler instruction; s3: judging whether the target agent successfully downloads the current webpage, and executing S4 when the judgment result comprises successful downloading; when the determination result includes the download failure, performing S5; s4: continuing to use the target agent to perform the download of the next web page, taking the next web page as the current web page and performing S3; s5: the target agent is placed into the candidate queue and S1 is performed. According to the scheme, the time for switching the proxy can be saved, and the efficiency of webpage downloading is improved.

Description

A kind of reptile acts on behalf of automatic switching method and device
Technical field
The present invention relates to Internet technical field, act on behalf of automatic switching method and device particularly to a kind of reptile.
Background technology
Along with the development of Internet technology, the collection of internet data, have become as the important step of the big data industry of enterprise development.Enterprise is generally adopted crawler technology and realizes the collection of internet data, but the system pressure that many websites are brought to prevent reptile, take anti-crawler technology, do not allow reptile to carry out high-frequency data acquisition.
At present, in order to tackle anti-crawler technology, process can initiate request to agency, agency realize page download, and so, website then can not detect the real machine gathering webpage.
In the prior art, agency often downloads a webpage, then switch and once act on behalf of, and acts on behalf of, to reduce, the frequency carrying out page download, but, need to expend the longer time when switching every time is acted on behalf of, thus impacting to the collecting efficiency of internet data.
Summary of the invention
Embodiments provide a kind of reptile and act on behalf of automatic switching method and device, so that being automatically obtained proxy-switching meeting current agent when the condition that download webpage is failed.
Embodiments provide a kind of reptile and act on behalf of automatic switching method, including:
S1: select target proxy from candidate queue, wherein, in described candidate queue, storage has at least one to be in the agency of candidate state;
S2: send reptile instruction to described target proxy, so that described target proxy carries out page download according to described reptile instruction;
S3: judge whether described target proxy is downloaded successfully to current web page, when judged result includes downloading successful, then performs S4;When judged result includes failed download, then perform S5;
S4: be continuing with described target proxy and perform the download to next webpage, next one webpage as current web page and is performed S3;
S5: described target proxy is placed in described candidate queue, and performs S1.
Wherein,
Described selection target proxy from candidate queue, including: will be located in the agency of described candidate queue header/trailer as the described target proxy selected;
Described described target proxy is placed in described candidate queue, including: described target proxy is placed into the afterbody/head of described candidate queue.
Wherein, before described target proxy is placed in described candidate queue, farther include: the current number of attempt of described target proxy is performed to subtract 1 operation, and whether the current number of attempt of the described target proxy after the operation that judges to perform to subtract 1 is 0, if 0, then described target proxy it is set to disarmed state and abandons;If not 0, then perform the described target proxy after performing the operation that subtracts 1 described to be placed in described candidate queue by described target proxy.
Wherein, described be continuing with the download that described target proxy performs next webpage before, farther include: the current number of attempt of described target proxy is returned to the maximum attempts arranged for described target proxy in advance.
Wherein, after described judged result includes failed download, farther include: judging that described current web page is downloaded the maximum the frequency of failure whether number of times of failure reaches to arrange for described current web page in advance, if reaching, then abandoning the download to described current web page.
The embodiment of the present invention additionally provides a kind of reptile and acts on behalf of automatic switching control equipment, including:
Selecting unit, for selecting target proxy from candidate queue, wherein, in described candidate queue, storage has at least one to be in the agency of candidate state;
Transmitting element, for sending reptile instruction to described target proxy, so that described target proxy carries out page download according to described reptile instruction;
First judging unit, is used for judging whether described target proxy is downloaded successfully to current web page, when judged result includes downloading successful, then triggers the first processing unit and performs corresponding operating;When judged result includes failed download, then trigger placement unit and perform corresponding operating;
First processing unit, performs the download to next webpage for being continuing with described target proxy, and as current web page described first judging unit, next one webpage is performed corresponding operating;
Placement unit, for being placed in described candidate queue by described target proxy, and triggers described selection unit execution corresponding operating.
Wherein,
Described selection unit, specifically for will be located in the agency's described target proxy as selection of described candidate queue header/trailer;
Described placement unit, specifically for being placed into the afterbody/head of described candidate queue by described target proxy.
Wherein, farther including: the second processing unit, for the current number of attempt of described target proxy is performed subtract 1 operation, and whether the current number of attempt of the described target proxy after the operation that judges to perform to subtract 1 is 0, if 0, then described target proxy it is set to disarmed state and abandons;If not 0, then perform the described target proxy after performing the operation that subtracts 1 described to be placed in described candidate queue by described target proxy.
Wherein, farther include: recovery unit, for the current number of attempt of described target proxy being returned to the maximum attempts arranged in advance for described target proxy.
Wherein, farther including: the second judging unit, for judging that described current web page is downloaded the maximum the frequency of failure whether number of times of failure reaches to arrange for described current web page in advance, if reaching, then abandoning the download to described current web page.
Embodiments provide a kind of reptile and act on behalf of automatic switching method and device, only when target proxy is to current web page failed download, just automatically perform the switching to agency, and when current web page is downloaded successfully by target proxy, this target proxy can be continuing with, it is performed without the switching to agency, such that it is able to save the time of proxy-switching, improves the efficiency of page download.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is a kind of method flow diagram that one embodiment of the invention provides;
Fig. 2 is the another kind of method flow diagram that one embodiment of the invention provides;
Fig. 3 is the hardware structure figure of the device place equipment that one embodiment of the invention provides;
Fig. 4 is a kind of apparatus structure schematic diagram that one embodiment of the invention provides;
Fig. 5 is the another kind of apparatus structure schematic diagram that one embodiment of the invention provides.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearly; below in conjunction with the accompanying drawing in the embodiment of the present invention; technical scheme in the embodiment of the present invention is clearly and completely described; obviously; described embodiment is a part of embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention; the every other embodiment that those of ordinary skill in the art obtain under the premise not making creative work, broadly falls into the scope of protection of the invention.
As it is shown in figure 1, embodiments provide a kind of reptile to act on behalf of automatic switching method, the method may comprise steps of:
Step 101: select target proxy from candidate queue, wherein, in described candidate queue, storage has at least one to be in the agency of candidate state;
Step 102: send reptile instruction to described target proxy, so that described target proxy carries out page download according to described reptile instruction;
Step 103: judge whether described target proxy is downloaded successfully current web page;When judged result includes downloading successful, then perform step 104;When judged result includes failed download, then perform step 105;
Step 104: be continuing with described target proxy and perform the download to next webpage, next one webpage as current web page and is performed step 103;
Step 105: described target proxy is placed in described candidate queue, and performs step 101.
Visible above-described embodiment, only when target proxy is to current web page failed download, just automatically perform the switching to agency, and when current web page is downloaded successfully by target proxy, this target proxy can be continuing with, it is performed without the switching to agency, such that it is able to save the time of proxy-switching, improves the efficiency of page download.
In the ordinary course of things, one process is only using an agency to carry out page download, it is accomplished by returning in candidate queue by this surrogate placement when the agency of this use is to current web page failed download, this process only need to reselect an agency in candidate queue, and be placed back into the agency in candidate queue and can be reselected by this process or other processes.
Owing to candidate queue can include the agency being in candidate state in a large number, and the characteristic according to queue, when queue selecting agency or surrogate placement is returned queue, can only select to be positioned at the agency of queue head or tail position, and can only by the head of surrogate placement to queue or tail position.If when target proxy is placed in candidate queue, for instance, target proxy is placed into the head position of candidate queue, other processes when selecting to act on behalf of from candidate queue, may selecting to be positioned at the agency of the head position of candidate queue, now this target proxy continues selected, but, in this case, the high-frequency page download operation of target proxy can be caused, it is easy to cause the anti-reptile mechanism of website, therefore, in an embodiment of the invention, it is possible to carry out defined below:
Described selection target proxy from candidate queue, including: will be located in the agency of described candidate queue header/trailer as the described target proxy selected;
Described described target proxy is placed in described candidate queue, including: described target proxy is placed into the afterbody/head of described candidate queue.
So, the position that target proxy is placed into candidate queue can be contrary with the position selecting agency from candidate queue, such that it is able to reduce repeatedly the situation of continuous selected same agency.
In an embodiment of the invention, if target proxy is chosen by identical or different process continuously, and when downloading continuously maximum attempts that the failed number of times of webpage is reached for the setting of this target proxy, then determine that this target proxy cannot realize the down operation to webpage, then this target proxy it is set to disarmed state and discards, therefore, the program includes: before being placed in described candidate queue by described target proxy, farther include: the current number of attempt of described target proxy is performed to subtract 1 operation, and whether the current number of attempt of the described target proxy after the operation that judges to perform to subtract 1 is 0, if 0, then described target proxy it is set to disarmed state and abandons;If not 0, then perform the described target proxy after performing the operation that subtracts 1 described to be placed in described candidate queue by described target proxy.
In an embodiment of the invention, when current web page is downloaded successfully by target proxy, if now the current number of attempt of this target proxy is less than the maximum attempts arranged for this target proxy, then show that the state of this target proxy can realize the down operation to webpage, so the current number of attempt of this target proxy can be returned to maximum attempts, therefore, the program may include that described be continuing with the download that described target proxy performs next webpage before, farther include: the current number of attempt of described target proxy is returned to the maximum attempts arranged for described target proxy in advance.
In an embodiment of the invention, if current web page is downloaded the number of times of failure when reaching the maximum frequency of failure arranged for this current web page, then show to realize the down operation to this current web page, so can abandon the download to current web page, perform the process of this current page download operation can utilize when reselecting out and acting on behalf of the agency reselected perform the download to next webpage.
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with drawings and the specific embodiments, the present invention is described in further detail.
As in figure 2 it is shown, embodiments provide a kind of reptile to act on behalf of automatic switching method, the method may comprise steps of:
Step 201: act on behalf of under being in original state and be filtered, obtains may be used for realizing the agency of page download.
Agency refers to the server that may be used for realizing page download.Wherein, agency can be loaded in internal memory from configuration file or data base, and the agency being now loaded in internal memory enters original state.
In an initial condition, system needs to check the initial availability of each agency, and this inspection object can include whether the address of this agency can lead to by ping, and whether this agency can download formulation webpage, such as this named web page is Baidu's homepage, and filters some invalid addresses.
Step 202: putting in candidate queue by the agency obtained after filtration, wherein, each agency includes the maximum attempts for its setting.
Wherein, this maximum attempts refers to that this agency downloads the number of times that webpage is failed continuously.For the setting of the corresponding maximum attempts of each agency in candidate queue, it is possible to rule of thumb different types of agency is configured respectively, it is also possible to all agencies are arranged identical maximum attempts, and the present embodiment is not especially limited herein.
Step 203: when current process needs reptile, selects target proxy from candidate queue, and sends reptile instruction to this target proxy.
When selecting target proxy from candidate queue, it is possible to select the agency being positioned at the head of candidate queue as target proxy.
In an embodiment of the invention, when agency by certain gather process application to after, this agency enters use state, the agency entering use state can only be used by a process, to prevent this agency increase of load within the unit interval using same agency to cause because of multiple processes simultaneously, and prevent agents from the phenomenon of instability and reduce the probability that agency is sealed.
Step 204: target proxy carries out page download according to this reptile instruction.
Step 205: judge whether target proxy successfully downloads current web page, if so, then performs step 206, otherwise, performs step 208.
In an embodiment of the invention, due to after to current web page failed download, it is also possible to judge that this current web page is downloaded the number of times of failure, therefore, if target proxy downloads current web page failure, then step 212 can be performed.
Step 206: judge the current number of attempt of target proxy, if current number of attempt is less than the maximum attempts arranged for this target proxy, then the current number of attempt of this target proxy is returned to maximum attempts, and perform step 207, if current number of attempt is equal to this maximum attempts, then directly perform step 207.
Step 207: be continuing with this target proxy and perform the download to next webpage, and using this next one webpage as current web page, and perform step 205.
Step 208: the current number of attempt of this target proxy is performed to subtract 1 operation, and perform step 209.
Step 209: judge to perform whether the current number of attempt of this target proxy after the operation that subtracts 1 is 0, if 0, then perform step 210, otherwise, perform step 211.
Step 210: this target proxy is set to disarmed state, and abandons, terminates.
Step 211: the target proxy after this performs the operation that subtracts 1 is placed in candidate queue, and performs step 203.
Wherein, when target proxy is placed in candidate queue, it is possible to this target proxy is placed into the afterbody of candidate queue.
Step 212: judge whether the number of times that current web page is unsuccessfully downloaded reaches the maximum frequency of failure arranged for this current web page, if reaching, then performs step 213;Otherwise, returning step 203 selects target proxy to continue current web page is downloaded.
Step 213: abandon the download to current web page, and return step 203 and select the target proxy download to next webpage.
As shown in Figure 3, Figure 4, embodiments provide a kind of reptile and act on behalf of automatic switching control equipment.Device embodiment can be realized by software, it is also possible to is realized by the mode of hardware or software and hardware combining.Say from hardware view; as shown in Figure 3; the reptile provided for the embodiment of the present invention acts on behalf of a kind of hardware structure diagram of automatic switching control equipment place equipment; except the processor shown in Fig. 3, internal memory, network interface and nonvolatile memory; in embodiment, the equipment at device place generally can also include other hardware, such as the forwarding chip etc. of responsible process message.Implemented in software for example, as shown in Figure 4, as the device on a logical meaning, it is that computer program instructions corresponding in nonvolatile memory is read to run in internal memory and formed by the CPU by its place equipment.The reptile that the present embodiment provides acts on behalf of automatic switching control equipment, including:
Selecting unit 401, for selecting target proxy from candidate queue, wherein, in described candidate queue, storage has at least one to be in the agency of candidate state;
Transmitting element 402, for sending reptile instruction to described target proxy, so that described target proxy carries out page download according to described reptile instruction;
First judging unit 403, is used for judging whether described target proxy is downloaded successfully to current web page, when judged result includes downloading successful, then triggers the first processing unit 404 and performs corresponding operating;When judged result includes failed download, then trigger placement unit 405 and perform corresponding operating;
First processing unit 404, performs the download to next webpage for being continuing with described target proxy, as current web page described first judging unit 403, next one webpage is performed corresponding operating;
Placement unit 405, for being placed in described candidate queue by described target proxy, and triggers described selection unit 401 and performs corresponding operating.
In an embodiment of the invention, described selection unit 401, specifically for will be located in the agency's described target proxy as selection of described candidate queue header/trailer;
Described placement unit 405, specifically for being placed into the afterbody/head of described candidate queue by described target proxy.
In an embodiment of the invention, refer to Fig. 5, this reptile is acted on behalf of automatic switching control equipment and may further include: the second processing unit 501, for the current number of attempt of described target proxy is performed the operation that subtracts 1, and whether the current number of attempt of the described target proxy after the operation that judges to perform to subtract 1 is 0, if 0, then described target proxy it is set to disarmed state and abandons;If not 0, then perform the described target proxy after performing the operation that subtracts 1 described to be placed in described candidate queue by described target proxy.
In an embodiment of the invention, refer to Fig. 5, this reptile is acted on behalf of automatic switching control equipment and may further include: recovery unit 502, for the current number of attempt of described target proxy returns to the maximum attempts arranged for described target proxy in advance.
In an embodiment of the invention, refer to Fig. 5, this reptile is acted on behalf of automatic switching control equipment and may further include: the second judging unit 503, for judging that described current web page is downloaded the maximum the frequency of failure whether number of times of failure reaches to arrange in advance for described current web page, if reaching, then abandon the download to described current web page.
To sum up, the embodiment of the present invention at least can realize following beneficial effect:
1, in embodiments of the present invention, only when target proxy is to current web page failed download, just automatically perform the switching to agency, and when current web page is downloaded successfully by target proxy, this target proxy can be continuing with, it is performed without the switching to agency, such that it is able to save the time of proxy-switching, improves the efficiency of page download.
2, in embodiments of the present invention, by will be located in the agency's described target proxy as selection of candidate queue header/trailer, and target proxy is placed into the afterbody/head of described candidate queue, the difference of position when making selection and place, can ensure that the agency being just placed back candidate queue will not be chosen to immediately, such that it is able to reduce repeatedly the situation of continuous selected same agency.
3, in embodiments of the present invention, chosen by identical or different process continuously at target proxy, and when downloading continuously maximum attempts that the failed number of times of webpage is reached for the setting of this target proxy, then determine that this target proxy cannot realize the down operation to webpage, then this target proxy it is set to disarmed state and discards, thereby may be ensured that the agency chosen in candidate queue is capable of the down operation of webpage.
4, in embodiments of the present invention, current web page be downloaded the number of times of failure reach the maximum frequency of failure arranged for this current web page time, then show to realize the down operation to this current web page, so can abandon the download to current web page, perform the process of this current page download operation can utilize when reselecting out and acting on behalf of the agency reselected perform the download to next webpage, such that it is able to improve the efficiency of page download.
The contents such as the information between each unit in said apparatus is mutual, execution process, due to the inventive method embodiment based on same design, particular content referring to the narration in the inventive method embodiment, can repeat no more herein.
It should be noted that, in this article, the relational terms of such as first and second etc is used merely to separate an entity or operation with another entity or operating space, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " includes ", " comprising " or its any other variant are intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or equipment not only include those key elements, but also include other key elements being not expressly set out, or also include the key element intrinsic for this process, method, article or equipment.When there is no more restriction, statement " including ... " key element limited, it is not excluded that there is also other same factor in including the process of described key element, method, article or equipment.
One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can be completed by the hardware that programmed instruction is relevant, aforesaid program can be stored in the storage medium of embodied on computer readable, this program upon execution, performs to include the step of said method embodiment;And aforesaid storage medium includes: in the various media that can store program code such as ROM, RAM, magnetic disc or CD.
Last it should be understood that the foregoing is only presently preferred embodiments of the present invention, it is merely to illustrate technical scheme, is not intended to limit protection scope of the present invention.All make within the spirit and principles in the present invention any amendment, equivalent replacement, improvement etc., be all contained in protection scope of the present invention.

Claims (10)

1. a reptile acts on behalf of automatic switching method, it is characterised in that including:
S1: select target proxy from candidate queue, wherein, in described candidate queue, storage has at least one to be in the agency of candidate state;
S2: send reptile instruction to described target proxy, so that described target proxy carries out page download according to described reptile instruction;
S3: judge whether described target proxy is downloaded successfully to current web page, when judged result includes downloading successful, then performs S4;When judged result includes failed download, then perform S5;
S4: be continuing with described target proxy and perform the download to next webpage, next one webpage as current web page and is performed S3;
S5: described target proxy is placed in described candidate queue, and performs S1.
2. method according to claim 1, it is characterised in that
Described selection target proxy from candidate queue, including: will be located in the agency of described candidate queue header/trailer as the described target proxy selected;
Described described target proxy is placed in described candidate queue, including: described target proxy is placed into the afterbody/head of described candidate queue.
3. method according to claim 1, it is characterized in that, before described target proxy is placed in described candidate queue, farther include: the current number of attempt of described target proxy is performed to subtract 1 operation, and whether the current number of attempt of the described target proxy after the operation that judges to perform to subtract 1 is 0, if 0, then described target proxy it is set to disarmed state and abandons;If not 0, then perform the described target proxy after performing the operation that subtracts 1 described to be placed in described candidate queue by described target proxy.
4. method according to claim 1, it is characterized in that, described be continuing with the download that described target proxy performs next webpage before, farther include: the current number of attempt of described target proxy is returned to the maximum attempts arranged for described target proxy in advance.
5. according to described method arbitrary in claim 1-4, it is characterized in that, after described judged result includes failed download, farther include: judge that described current web page is downloaded the maximum the frequency of failure whether number of times of failure reaches to arrange in advance for described current web page, if reaching, then abandon the download to described current web page.
6. a reptile acts on behalf of automatic switching control equipment, it is characterised in that including:
Selecting unit, for selecting target proxy from candidate queue, wherein, in described candidate queue, storage has at least one to be in the agency of candidate state;
Transmitting element, for sending reptile instruction to described target proxy, so that described target proxy carries out page download according to described reptile instruction;
First judging unit, is used for judging whether described target proxy is downloaded successfully to current web page, when judged result includes downloading successful, then triggers the first processing unit and performs corresponding operating;When judged result includes failed download, then trigger placement unit and perform corresponding operating;
First processing unit, performs the download to next webpage for being continuing with described target proxy, and as current web page described first judging unit, next one webpage is performed corresponding operating;
Placement unit, for being placed in described candidate queue by described target proxy, and triggers described selection unit execution corresponding operating.
7. reptile according to claim 6 acts on behalf of automatic switching control equipment, it is characterised in that
Described selection unit, specifically for will be located in the agency's described target proxy as selection of described candidate queue header/trailer;
Described placement unit, specifically for being placed into the afterbody/head of described candidate queue by described target proxy.
8. reptile according to claim 6 acts on behalf of automatic switching control equipment, it is characterized in that, farther include: the second processing unit, for the current number of attempt of described target proxy is performed the operation that subtracts 1, and whether the current number of attempt of the described target proxy after the operation that judges to perform to subtract 1 is 0, if 0, then described target proxy it is set to disarmed state and abandons;If not 0, then perform the described target proxy after performing the operation that subtracts 1 described to be placed in described candidate queue by described target proxy.
9. reptile according to claim 6 acts on behalf of automatic switching control equipment, it is characterised in that farther include: recovery unit, for the current number of attempt of described target proxy returns to the maximum attempts arranged for described target proxy in advance.
10. act on behalf of automatic switching control equipment according to described reptile arbitrary in claim 6-9, it is characterized in that, farther include: the second judging unit, for judging that described current web page is downloaded the maximum the frequency of failure whether number of times of failure reaches to arrange in advance for described current web page, if reaching, then abandon the download to described current web page.
CN201610056419.8A 2016-01-27 2016-01-27 Crawler agent automatic switching method and device Pending CN105740384A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610056419.8A CN105740384A (en) 2016-01-27 2016-01-27 Crawler agent automatic switching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610056419.8A CN105740384A (en) 2016-01-27 2016-01-27 Crawler agent automatic switching method and device

Publications (1)

Publication Number Publication Date
CN105740384A true CN105740384A (en) 2016-07-06

Family

ID=56247252

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610056419.8A Pending CN105740384A (en) 2016-01-27 2016-01-27 Crawler agent automatic switching method and device

Country Status (1)

Country Link
CN (1) CN105740384A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169006A (en) * 2017-03-31 2017-09-15 北京奇艺世纪科技有限公司 A kind of method and device for managing reptile agency
CN107832355A (en) * 2017-10-23 2018-03-23 北京金堤科技有限公司 The method and device that a kind of agency of crawlers obtains
CN107957999A (en) * 2016-10-14 2018-04-24 北京国双科技有限公司 A kind of web crawlers obtains the method and device of website data
CN110062025A (en) * 2019-03-14 2019-07-26 深圳绿米联创科技有限公司 Method, apparatus, server and the storage medium of data acquisition
CN111641664A (en) * 2019-03-01 2020-09-08 北京京东尚科信息技术有限公司 Crawler equipment service request method, device and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003005240A1 (en) * 2001-07-03 2003-01-16 Wide Computing As Apparatus for searching on internet
CN103902386A (en) * 2014-04-11 2014-07-02 复旦大学 Multi-thread network crawler processing method based on connection proxy optimal management
CN103914568A (en) * 2014-04-24 2014-07-09 厦门市美亚柏科信息股份有限公司 Method and device for dispatching HTTP proxy

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003005240A1 (en) * 2001-07-03 2003-01-16 Wide Computing As Apparatus for searching on internet
CN103902386A (en) * 2014-04-11 2014-07-02 复旦大学 Multi-thread network crawler processing method based on connection proxy optimal management
CN103914568A (en) * 2014-04-24 2014-07-09 厦门市美亚柏科信息股份有限公司 Method and device for dispatching HTTP proxy

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
付华峥等: "《分布式大数据采集关键技术研究与实现》", 《广东通信技术》 *
王星等: "《基于移动代理(Agent)的智能爬虫系统的设计与实现》", 《科技资讯》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107957999A (en) * 2016-10-14 2018-04-24 北京国双科技有限公司 A kind of web crawlers obtains the method and device of website data
CN107169006A (en) * 2017-03-31 2017-09-15 北京奇艺世纪科技有限公司 A kind of method and device for managing reptile agency
CN107832355A (en) * 2017-10-23 2018-03-23 北京金堤科技有限公司 The method and device that a kind of agency of crawlers obtains
CN107832355B (en) * 2017-10-23 2019-03-26 北京金堤科技有限公司 A kind of method and device that the agency of crawlers obtains
CN111641664A (en) * 2019-03-01 2020-09-08 北京京东尚科信息技术有限公司 Crawler equipment service request method, device and system
CN111641664B (en) * 2019-03-01 2023-12-05 北京京东尚科信息技术有限公司 Crawler equipment service request method, device and system and storage medium
CN110062025A (en) * 2019-03-14 2019-07-26 深圳绿米联创科技有限公司 Method, apparatus, server and the storage medium of data acquisition

Similar Documents

Publication Publication Date Title
CN105740384A (en) Crawler agent automatic switching method and device
CN106933733B (en) Method and device for determining memory leak position
CN104219316A (en) Method and device for processing call request in distributed system
CN103034575B (en) Collapse analytical approach and device
CN105607986A (en) Acquisition method and device of user behavior log data
CN103049373B (en) A kind of localization method of collapse and device
CN107480260B (en) Big data real-time analysis method and device, computing equipment and computer storage medium
EP3179370A1 (en) Webpage automatic test method and apparatus
WO2020232887A1 (en) Configuration modification method and apparatus for container application, and computer device and storage medium
CN110502366A (en) Case executes method, apparatus, equipment and computer readable storage medium
CN111538883A (en) Data crawling method, system and equipment
CN111756573A (en) CTDB double-network-card fault monitoring method in distributed cluster and related equipment
CN104750536A (en) Virtual machine introspection (VMI) implementation method and device
WO2023055405A1 (en) Static and dynamic non-deterministic finite automata tree structure application apparatus and method
US9686310B2 (en) Method and apparatus for repairing a file
CN113918438A (en) Method and device for detecting server abnormality, server and storage medium
CN111444412B (en) Method and device for scheduling web crawler tasks
CN111026947B (en) Crawler method and embedded crawler implementation method based on browser
CN110764711B (en) IO data classification deleting method and device and computer readable storage medium
CN110784364B (en) Data monitoring method and device, storage medium and terminal
CN102917053B (en) A kind of method, apparatus and system for judging webpage urlrewriting
KR20210132545A (en) Apparatus and method for detecting abnormal behavior and system having the same
CN105740028A (en) Access control method and device
CN110597573A (en) Warehouse entry request data processing method and device
CN113850664A (en) Data anomaly detection method and data reporting service

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160706