CN105740384A - Crawler agent automatic switching method and device - Google Patents
Crawler agent automatic switching method and device Download PDFInfo
- Publication number
- CN105740384A CN105740384A CN201610056419.8A CN201610056419A CN105740384A CN 105740384 A CN105740384 A CN 105740384A CN 201610056419 A CN201610056419 A CN 201610056419A CN 105740384 A CN105740384 A CN 105740384A
- Authority
- CN
- China
- Prior art keywords
- target proxy
- described target
- candidate queue
- web page
- download
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 241000270322 Lepidosauria Species 0.000 claims description 37
- 238000011084 recovery Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 230000003116 impacting effect Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Computer And Data Communications (AREA)
Abstract
The invention provides a crawler agent automatic switching method and a crawler agent automatic switching device, wherein the method comprises the following steps: s1: selecting a target agent from a candidate queue, wherein at least one agent in a candidate state is stored in the candidate queue; s2: sending a crawler instruction to the target agent so that the target agent downloads a webpage according to the crawler instruction; s3: judging whether the target agent successfully downloads the current webpage, and executing S4 when the judgment result comprises successful downloading; when the determination result includes the download failure, performing S5; s4: continuing to use the target agent to perform the download of the next web page, taking the next web page as the current web page and performing S3; s5: the target agent is placed into the candidate queue and S1 is performed. According to the scheme, the time for switching the proxy can be saved, and the efficiency of webpage downloading is improved.
Description
Technical field
The present invention relates to Internet technical field, act on behalf of automatic switching method and device particularly to a kind of reptile.
Background technology
Along with the development of Internet technology, the collection of internet data, have become as the important step of the big data industry of enterprise development.Enterprise is generally adopted crawler technology and realizes the collection of internet data, but the system pressure that many websites are brought to prevent reptile, take anti-crawler technology, do not allow reptile to carry out high-frequency data acquisition.
At present, in order to tackle anti-crawler technology, process can initiate request to agency, agency realize page download, and so, website then can not detect the real machine gathering webpage.
In the prior art, agency often downloads a webpage, then switch and once act on behalf of, and acts on behalf of, to reduce, the frequency carrying out page download, but, need to expend the longer time when switching every time is acted on behalf of, thus impacting to the collecting efficiency of internet data.
Summary of the invention
Embodiments provide a kind of reptile and act on behalf of automatic switching method and device, so that being automatically obtained proxy-switching meeting current agent when the condition that download webpage is failed.
Embodiments provide a kind of reptile and act on behalf of automatic switching method, including:
S1: select target proxy from candidate queue, wherein, in described candidate queue, storage has at least one to be in the agency of candidate state;
S2: send reptile instruction to described target proxy, so that described target proxy carries out page download according to described reptile instruction;
S3: judge whether described target proxy is downloaded successfully to current web page, when judged result includes downloading successful, then performs S4;When judged result includes failed download, then perform S5;
S4: be continuing with described target proxy and perform the download to next webpage, next one webpage as current web page and is performed S3;
S5: described target proxy is placed in described candidate queue, and performs S1.
Wherein,
Described selection target proxy from candidate queue, including: will be located in the agency of described candidate queue header/trailer as the described target proxy selected;
Described described target proxy is placed in described candidate queue, including: described target proxy is placed into the afterbody/head of described candidate queue.
Wherein, before described target proxy is placed in described candidate queue, farther include: the current number of attempt of described target proxy is performed to subtract 1 operation, and whether the current number of attempt of the described target proxy after the operation that judges to perform to subtract 1 is 0, if 0, then described target proxy it is set to disarmed state and abandons;If not 0, then perform the described target proxy after performing the operation that subtracts 1 described to be placed in described candidate queue by described target proxy.
Wherein, described be continuing with the download that described target proxy performs next webpage before, farther include: the current number of attempt of described target proxy is returned to the maximum attempts arranged for described target proxy in advance.
Wherein, after described judged result includes failed download, farther include: judging that described current web page is downloaded the maximum the frequency of failure whether number of times of failure reaches to arrange for described current web page in advance, if reaching, then abandoning the download to described current web page.
The embodiment of the present invention additionally provides a kind of reptile and acts on behalf of automatic switching control equipment, including:
Selecting unit, for selecting target proxy from candidate queue, wherein, in described candidate queue, storage has at least one to be in the agency of candidate state;
Transmitting element, for sending reptile instruction to described target proxy, so that described target proxy carries out page download according to described reptile instruction;
First judging unit, is used for judging whether described target proxy is downloaded successfully to current web page, when judged result includes downloading successful, then triggers the first processing unit and performs corresponding operating;When judged result includes failed download, then trigger placement unit and perform corresponding operating;
First processing unit, performs the download to next webpage for being continuing with described target proxy, and as current web page described first judging unit, next one webpage is performed corresponding operating;
Placement unit, for being placed in described candidate queue by described target proxy, and triggers described selection unit execution corresponding operating.
Wherein,
Described selection unit, specifically for will be located in the agency's described target proxy as selection of described candidate queue header/trailer;
Described placement unit, specifically for being placed into the afterbody/head of described candidate queue by described target proxy.
Wherein, farther including: the second processing unit, for the current number of attempt of described target proxy is performed subtract 1 operation, and whether the current number of attempt of the described target proxy after the operation that judges to perform to subtract 1 is 0, if 0, then described target proxy it is set to disarmed state and abandons;If not 0, then perform the described target proxy after performing the operation that subtracts 1 described to be placed in described candidate queue by described target proxy.
Wherein, farther include: recovery unit, for the current number of attempt of described target proxy being returned to the maximum attempts arranged in advance for described target proxy.
Wherein, farther including: the second judging unit, for judging that described current web page is downloaded the maximum the frequency of failure whether number of times of failure reaches to arrange for described current web page in advance, if reaching, then abandoning the download to described current web page.
Embodiments provide a kind of reptile and act on behalf of automatic switching method and device, only when target proxy is to current web page failed download, just automatically perform the switching to agency, and when current web page is downloaded successfully by target proxy, this target proxy can be continuing with, it is performed without the switching to agency, such that it is able to save the time of proxy-switching, improves the efficiency of page download.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is a kind of method flow diagram that one embodiment of the invention provides;
Fig. 2 is the another kind of method flow diagram that one embodiment of the invention provides;
Fig. 3 is the hardware structure figure of the device place equipment that one embodiment of the invention provides;
Fig. 4 is a kind of apparatus structure schematic diagram that one embodiment of the invention provides;
Fig. 5 is the another kind of apparatus structure schematic diagram that one embodiment of the invention provides.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearly; below in conjunction with the accompanying drawing in the embodiment of the present invention; technical scheme in the embodiment of the present invention is clearly and completely described; obviously; described embodiment is a part of embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention; the every other embodiment that those of ordinary skill in the art obtain under the premise not making creative work, broadly falls into the scope of protection of the invention.
As it is shown in figure 1, embodiments provide a kind of reptile to act on behalf of automatic switching method, the method may comprise steps of:
Step 101: select target proxy from candidate queue, wherein, in described candidate queue, storage has at least one to be in the agency of candidate state;
Step 102: send reptile instruction to described target proxy, so that described target proxy carries out page download according to described reptile instruction;
Step 103: judge whether described target proxy is downloaded successfully current web page;When judged result includes downloading successful, then perform step 104;When judged result includes failed download, then perform step 105;
Step 104: be continuing with described target proxy and perform the download to next webpage, next one webpage as current web page and is performed step 103;
Step 105: described target proxy is placed in described candidate queue, and performs step 101.
Visible above-described embodiment, only when target proxy is to current web page failed download, just automatically perform the switching to agency, and when current web page is downloaded successfully by target proxy, this target proxy can be continuing with, it is performed without the switching to agency, such that it is able to save the time of proxy-switching, improves the efficiency of page download.
In the ordinary course of things, one process is only using an agency to carry out page download, it is accomplished by returning in candidate queue by this surrogate placement when the agency of this use is to current web page failed download, this process only need to reselect an agency in candidate queue, and be placed back into the agency in candidate queue and can be reselected by this process or other processes.
Owing to candidate queue can include the agency being in candidate state in a large number, and the characteristic according to queue, when queue selecting agency or surrogate placement is returned queue, can only select to be positioned at the agency of queue head or tail position, and can only by the head of surrogate placement to queue or tail position.If when target proxy is placed in candidate queue, for instance, target proxy is placed into the head position of candidate queue, other processes when selecting to act on behalf of from candidate queue, may selecting to be positioned at the agency of the head position of candidate queue, now this target proxy continues selected, but, in this case, the high-frequency page download operation of target proxy can be caused, it is easy to cause the anti-reptile mechanism of website, therefore, in an embodiment of the invention, it is possible to carry out defined below:
Described selection target proxy from candidate queue, including: will be located in the agency of described candidate queue header/trailer as the described target proxy selected;
Described described target proxy is placed in described candidate queue, including: described target proxy is placed into the afterbody/head of described candidate queue.
So, the position that target proxy is placed into candidate queue can be contrary with the position selecting agency from candidate queue, such that it is able to reduce repeatedly the situation of continuous selected same agency.
In an embodiment of the invention, if target proxy is chosen by identical or different process continuously, and when downloading continuously maximum attempts that the failed number of times of webpage is reached for the setting of this target proxy, then determine that this target proxy cannot realize the down operation to webpage, then this target proxy it is set to disarmed state and discards, therefore, the program includes: before being placed in described candidate queue by described target proxy, farther include: the current number of attempt of described target proxy is performed to subtract 1 operation, and whether the current number of attempt of the described target proxy after the operation that judges to perform to subtract 1 is 0, if 0, then described target proxy it is set to disarmed state and abandons;If not 0, then perform the described target proxy after performing the operation that subtracts 1 described to be placed in described candidate queue by described target proxy.
In an embodiment of the invention, when current web page is downloaded successfully by target proxy, if now the current number of attempt of this target proxy is less than the maximum attempts arranged for this target proxy, then show that the state of this target proxy can realize the down operation to webpage, so the current number of attempt of this target proxy can be returned to maximum attempts, therefore, the program may include that described be continuing with the download that described target proxy performs next webpage before, farther include: the current number of attempt of described target proxy is returned to the maximum attempts arranged for described target proxy in advance.
In an embodiment of the invention, if current web page is downloaded the number of times of failure when reaching the maximum frequency of failure arranged for this current web page, then show to realize the down operation to this current web page, so can abandon the download to current web page, perform the process of this current page download operation can utilize when reselecting out and acting on behalf of the agency reselected perform the download to next webpage.
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with drawings and the specific embodiments, the present invention is described in further detail.
As in figure 2 it is shown, embodiments provide a kind of reptile to act on behalf of automatic switching method, the method may comprise steps of:
Step 201: act on behalf of under being in original state and be filtered, obtains may be used for realizing the agency of page download.
Agency refers to the server that may be used for realizing page download.Wherein, agency can be loaded in internal memory from configuration file or data base, and the agency being now loaded in internal memory enters original state.
In an initial condition, system needs to check the initial availability of each agency, and this inspection object can include whether the address of this agency can lead to by ping, and whether this agency can download formulation webpage, such as this named web page is Baidu's homepage, and filters some invalid addresses.
Step 202: putting in candidate queue by the agency obtained after filtration, wherein, each agency includes the maximum attempts for its setting.
Wherein, this maximum attempts refers to that this agency downloads the number of times that webpage is failed continuously.For the setting of the corresponding maximum attempts of each agency in candidate queue, it is possible to rule of thumb different types of agency is configured respectively, it is also possible to all agencies are arranged identical maximum attempts, and the present embodiment is not especially limited herein.
Step 203: when current process needs reptile, selects target proxy from candidate queue, and sends reptile instruction to this target proxy.
When selecting target proxy from candidate queue, it is possible to select the agency being positioned at the head of candidate queue as target proxy.
In an embodiment of the invention, when agency by certain gather process application to after, this agency enters use state, the agency entering use state can only be used by a process, to prevent this agency increase of load within the unit interval using same agency to cause because of multiple processes simultaneously, and prevent agents from the phenomenon of instability and reduce the probability that agency is sealed.
Step 204: target proxy carries out page download according to this reptile instruction.
Step 205: judge whether target proxy successfully downloads current web page, if so, then performs step 206, otherwise, performs step 208.
In an embodiment of the invention, due to after to current web page failed download, it is also possible to judge that this current web page is downloaded the number of times of failure, therefore, if target proxy downloads current web page failure, then step 212 can be performed.
Step 206: judge the current number of attempt of target proxy, if current number of attempt is less than the maximum attempts arranged for this target proxy, then the current number of attempt of this target proxy is returned to maximum attempts, and perform step 207, if current number of attempt is equal to this maximum attempts, then directly perform step 207.
Step 207: be continuing with this target proxy and perform the download to next webpage, and using this next one webpage as current web page, and perform step 205.
Step 208: the current number of attempt of this target proxy is performed to subtract 1 operation, and perform step 209.
Step 209: judge to perform whether the current number of attempt of this target proxy after the operation that subtracts 1 is 0, if 0, then perform step 210, otherwise, perform step 211.
Step 210: this target proxy is set to disarmed state, and abandons, terminates.
Step 211: the target proxy after this performs the operation that subtracts 1 is placed in candidate queue, and performs step 203.
Wherein, when target proxy is placed in candidate queue, it is possible to this target proxy is placed into the afterbody of candidate queue.
Step 212: judge whether the number of times that current web page is unsuccessfully downloaded reaches the maximum frequency of failure arranged for this current web page, if reaching, then performs step 213;Otherwise, returning step 203 selects target proxy to continue current web page is downloaded.
Step 213: abandon the download to current web page, and return step 203 and select the target proxy download to next webpage.
As shown in Figure 3, Figure 4, embodiments provide a kind of reptile and act on behalf of automatic switching control equipment.Device embodiment can be realized by software, it is also possible to is realized by the mode of hardware or software and hardware combining.Say from hardware view; as shown in Figure 3; the reptile provided for the embodiment of the present invention acts on behalf of a kind of hardware structure diagram of automatic switching control equipment place equipment; except the processor shown in Fig. 3, internal memory, network interface and nonvolatile memory; in embodiment, the equipment at device place generally can also include other hardware, such as the forwarding chip etc. of responsible process message.Implemented in software for example, as shown in Figure 4, as the device on a logical meaning, it is that computer program instructions corresponding in nonvolatile memory is read to run in internal memory and formed by the CPU by its place equipment.The reptile that the present embodiment provides acts on behalf of automatic switching control equipment, including:
Selecting unit 401, for selecting target proxy from candidate queue, wherein, in described candidate queue, storage has at least one to be in the agency of candidate state;
Transmitting element 402, for sending reptile instruction to described target proxy, so that described target proxy carries out page download according to described reptile instruction;
First judging unit 403, is used for judging whether described target proxy is downloaded successfully to current web page, when judged result includes downloading successful, then triggers the first processing unit 404 and performs corresponding operating;When judged result includes failed download, then trigger placement unit 405 and perform corresponding operating;
First processing unit 404, performs the download to next webpage for being continuing with described target proxy, as current web page described first judging unit 403, next one webpage is performed corresponding operating;
Placement unit 405, for being placed in described candidate queue by described target proxy, and triggers described selection unit 401 and performs corresponding operating.
In an embodiment of the invention, described selection unit 401, specifically for will be located in the agency's described target proxy as selection of described candidate queue header/trailer;
Described placement unit 405, specifically for being placed into the afterbody/head of described candidate queue by described target proxy.
In an embodiment of the invention, refer to Fig. 5, this reptile is acted on behalf of automatic switching control equipment and may further include: the second processing unit 501, for the current number of attempt of described target proxy is performed the operation that subtracts 1, and whether the current number of attempt of the described target proxy after the operation that judges to perform to subtract 1 is 0, if 0, then described target proxy it is set to disarmed state and abandons;If not 0, then perform the described target proxy after performing the operation that subtracts 1 described to be placed in described candidate queue by described target proxy.
In an embodiment of the invention, refer to Fig. 5, this reptile is acted on behalf of automatic switching control equipment and may further include: recovery unit 502, for the current number of attempt of described target proxy returns to the maximum attempts arranged for described target proxy in advance.
In an embodiment of the invention, refer to Fig. 5, this reptile is acted on behalf of automatic switching control equipment and may further include: the second judging unit 503, for judging that described current web page is downloaded the maximum the frequency of failure whether number of times of failure reaches to arrange in advance for described current web page, if reaching, then abandon the download to described current web page.
To sum up, the embodiment of the present invention at least can realize following beneficial effect:
1, in embodiments of the present invention, only when target proxy is to current web page failed download, just automatically perform the switching to agency, and when current web page is downloaded successfully by target proxy, this target proxy can be continuing with, it is performed without the switching to agency, such that it is able to save the time of proxy-switching, improves the efficiency of page download.
2, in embodiments of the present invention, by will be located in the agency's described target proxy as selection of candidate queue header/trailer, and target proxy is placed into the afterbody/head of described candidate queue, the difference of position when making selection and place, can ensure that the agency being just placed back candidate queue will not be chosen to immediately, such that it is able to reduce repeatedly the situation of continuous selected same agency.
3, in embodiments of the present invention, chosen by identical or different process continuously at target proxy, and when downloading continuously maximum attempts that the failed number of times of webpage is reached for the setting of this target proxy, then determine that this target proxy cannot realize the down operation to webpage, then this target proxy it is set to disarmed state and discards, thereby may be ensured that the agency chosen in candidate queue is capable of the down operation of webpage.
4, in embodiments of the present invention, current web page be downloaded the number of times of failure reach the maximum frequency of failure arranged for this current web page time, then show to realize the down operation to this current web page, so can abandon the download to current web page, perform the process of this current page download operation can utilize when reselecting out and acting on behalf of the agency reselected perform the download to next webpage, such that it is able to improve the efficiency of page download.
The contents such as the information between each unit in said apparatus is mutual, execution process, due to the inventive method embodiment based on same design, particular content referring to the narration in the inventive method embodiment, can repeat no more herein.
It should be noted that, in this article, the relational terms of such as first and second etc is used merely to separate an entity or operation with another entity or operating space, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " includes ", " comprising " or its any other variant are intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or equipment not only include those key elements, but also include other key elements being not expressly set out, or also include the key element intrinsic for this process, method, article or equipment.When there is no more restriction, statement " including ... " key element limited, it is not excluded that there is also other same factor in including the process of described key element, method, article or equipment.
One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can be completed by the hardware that programmed instruction is relevant, aforesaid program can be stored in the storage medium of embodied on computer readable, this program upon execution, performs to include the step of said method embodiment;And aforesaid storage medium includes: in the various media that can store program code such as ROM, RAM, magnetic disc or CD.
Last it should be understood that the foregoing is only presently preferred embodiments of the present invention, it is merely to illustrate technical scheme, is not intended to limit protection scope of the present invention.All make within the spirit and principles in the present invention any amendment, equivalent replacement, improvement etc., be all contained in protection scope of the present invention.
Claims (10)
1. a reptile acts on behalf of automatic switching method, it is characterised in that including:
S1: select target proxy from candidate queue, wherein, in described candidate queue, storage has at least one to be in the agency of candidate state;
S2: send reptile instruction to described target proxy, so that described target proxy carries out page download according to described reptile instruction;
S3: judge whether described target proxy is downloaded successfully to current web page, when judged result includes downloading successful, then performs S4;When judged result includes failed download, then perform S5;
S4: be continuing with described target proxy and perform the download to next webpage, next one webpage as current web page and is performed S3;
S5: described target proxy is placed in described candidate queue, and performs S1.
2. method according to claim 1, it is characterised in that
Described selection target proxy from candidate queue, including: will be located in the agency of described candidate queue header/trailer as the described target proxy selected;
Described described target proxy is placed in described candidate queue, including: described target proxy is placed into the afterbody/head of described candidate queue.
3. method according to claim 1, it is characterized in that, before described target proxy is placed in described candidate queue, farther include: the current number of attempt of described target proxy is performed to subtract 1 operation, and whether the current number of attempt of the described target proxy after the operation that judges to perform to subtract 1 is 0, if 0, then described target proxy it is set to disarmed state and abandons;If not 0, then perform the described target proxy after performing the operation that subtracts 1 described to be placed in described candidate queue by described target proxy.
4. method according to claim 1, it is characterized in that, described be continuing with the download that described target proxy performs next webpage before, farther include: the current number of attempt of described target proxy is returned to the maximum attempts arranged for described target proxy in advance.
5. according to described method arbitrary in claim 1-4, it is characterized in that, after described judged result includes failed download, farther include: judge that described current web page is downloaded the maximum the frequency of failure whether number of times of failure reaches to arrange in advance for described current web page, if reaching, then abandon the download to described current web page.
6. a reptile acts on behalf of automatic switching control equipment, it is characterised in that including:
Selecting unit, for selecting target proxy from candidate queue, wherein, in described candidate queue, storage has at least one to be in the agency of candidate state;
Transmitting element, for sending reptile instruction to described target proxy, so that described target proxy carries out page download according to described reptile instruction;
First judging unit, is used for judging whether described target proxy is downloaded successfully to current web page, when judged result includes downloading successful, then triggers the first processing unit and performs corresponding operating;When judged result includes failed download, then trigger placement unit and perform corresponding operating;
First processing unit, performs the download to next webpage for being continuing with described target proxy, and as current web page described first judging unit, next one webpage is performed corresponding operating;
Placement unit, for being placed in described candidate queue by described target proxy, and triggers described selection unit execution corresponding operating.
7. reptile according to claim 6 acts on behalf of automatic switching control equipment, it is characterised in that
Described selection unit, specifically for will be located in the agency's described target proxy as selection of described candidate queue header/trailer;
Described placement unit, specifically for being placed into the afterbody/head of described candidate queue by described target proxy.
8. reptile according to claim 6 acts on behalf of automatic switching control equipment, it is characterized in that, farther include: the second processing unit, for the current number of attempt of described target proxy is performed the operation that subtracts 1, and whether the current number of attempt of the described target proxy after the operation that judges to perform to subtract 1 is 0, if 0, then described target proxy it is set to disarmed state and abandons;If not 0, then perform the described target proxy after performing the operation that subtracts 1 described to be placed in described candidate queue by described target proxy.
9. reptile according to claim 6 acts on behalf of automatic switching control equipment, it is characterised in that farther include: recovery unit, for the current number of attempt of described target proxy returns to the maximum attempts arranged for described target proxy in advance.
10. act on behalf of automatic switching control equipment according to described reptile arbitrary in claim 6-9, it is characterized in that, farther include: the second judging unit, for judging that described current web page is downloaded the maximum the frequency of failure whether number of times of failure reaches to arrange in advance for described current web page, if reaching, then abandon the download to described current web page.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610056419.8A CN105740384A (en) | 2016-01-27 | 2016-01-27 | Crawler agent automatic switching method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610056419.8A CN105740384A (en) | 2016-01-27 | 2016-01-27 | Crawler agent automatic switching method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105740384A true CN105740384A (en) | 2016-07-06 |
Family
ID=56247252
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610056419.8A Pending CN105740384A (en) | 2016-01-27 | 2016-01-27 | Crawler agent automatic switching method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105740384A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107169006A (en) * | 2017-03-31 | 2017-09-15 | 北京奇艺世纪科技有限公司 | A kind of method and device for managing reptile agency |
CN107832355A (en) * | 2017-10-23 | 2018-03-23 | 北京金堤科技有限公司 | The method and device that a kind of agency of crawlers obtains |
CN107957999A (en) * | 2016-10-14 | 2018-04-24 | 北京国双科技有限公司 | A kind of web crawlers obtains the method and device of website data |
CN110062025A (en) * | 2019-03-14 | 2019-07-26 | 深圳绿米联创科技有限公司 | Method, apparatus, server and the storage medium of data acquisition |
CN111641664A (en) * | 2019-03-01 | 2020-09-08 | 北京京东尚科信息技术有限公司 | Crawler equipment service request method, device and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2003005240A1 (en) * | 2001-07-03 | 2003-01-16 | Wide Computing As | Apparatus for searching on internet |
CN103902386A (en) * | 2014-04-11 | 2014-07-02 | 复旦大学 | Multi-thread network crawler processing method based on connection proxy optimal management |
CN103914568A (en) * | 2014-04-24 | 2014-07-09 | 厦门市美亚柏科信息股份有限公司 | Method and device for dispatching HTTP proxy |
-
2016
- 2016-01-27 CN CN201610056419.8A patent/CN105740384A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2003005240A1 (en) * | 2001-07-03 | 2003-01-16 | Wide Computing As | Apparatus for searching on internet |
CN103902386A (en) * | 2014-04-11 | 2014-07-02 | 复旦大学 | Multi-thread network crawler processing method based on connection proxy optimal management |
CN103914568A (en) * | 2014-04-24 | 2014-07-09 | 厦门市美亚柏科信息股份有限公司 | Method and device for dispatching HTTP proxy |
Non-Patent Citations (2)
Title |
---|
付华峥等: "《分布式大数据采集关键技术研究与实现》", 《广东通信技术》 * |
王星等: "《基于移动代理(Agent)的智能爬虫系统的设计与实现》", 《科技资讯》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107957999A (en) * | 2016-10-14 | 2018-04-24 | 北京国双科技有限公司 | A kind of web crawlers obtains the method and device of website data |
CN107169006A (en) * | 2017-03-31 | 2017-09-15 | 北京奇艺世纪科技有限公司 | A kind of method and device for managing reptile agency |
CN107832355A (en) * | 2017-10-23 | 2018-03-23 | 北京金堤科技有限公司 | The method and device that a kind of agency of crawlers obtains |
CN107832355B (en) * | 2017-10-23 | 2019-03-26 | 北京金堤科技有限公司 | A kind of method and device that the agency of crawlers obtains |
CN111641664A (en) * | 2019-03-01 | 2020-09-08 | 北京京东尚科信息技术有限公司 | Crawler equipment service request method, device and system |
CN111641664B (en) * | 2019-03-01 | 2023-12-05 | 北京京东尚科信息技术有限公司 | Crawler equipment service request method, device and system and storage medium |
CN110062025A (en) * | 2019-03-14 | 2019-07-26 | 深圳绿米联创科技有限公司 | Method, apparatus, server and the storage medium of data acquisition |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105740384A (en) | Crawler agent automatic switching method and device | |
CN106933733B (en) | Method and device for determining memory leak position | |
CN104219316A (en) | Method and device for processing call request in distributed system | |
CN103034575B (en) | Collapse analytical approach and device | |
CN105607986A (en) | Acquisition method and device of user behavior log data | |
CN103049373B (en) | A kind of localization method of collapse and device | |
CN107480260B (en) | Big data real-time analysis method and device, computing equipment and computer storage medium | |
EP3179370A1 (en) | Webpage automatic test method and apparatus | |
WO2020232887A1 (en) | Configuration modification method and apparatus for container application, and computer device and storage medium | |
CN110502366A (en) | Case executes method, apparatus, equipment and computer readable storage medium | |
CN111538883A (en) | Data crawling method, system and equipment | |
CN111756573A (en) | CTDB double-network-card fault monitoring method in distributed cluster and related equipment | |
CN104750536A (en) | Virtual machine introspection (VMI) implementation method and device | |
WO2023055405A1 (en) | Static and dynamic non-deterministic finite automata tree structure application apparatus and method | |
US9686310B2 (en) | Method and apparatus for repairing a file | |
CN113918438A (en) | Method and device for detecting server abnormality, server and storage medium | |
CN111444412B (en) | Method and device for scheduling web crawler tasks | |
CN111026947B (en) | Crawler method and embedded crawler implementation method based on browser | |
CN110764711B (en) | IO data classification deleting method and device and computer readable storage medium | |
CN110784364B (en) | Data monitoring method and device, storage medium and terminal | |
CN102917053B (en) | A kind of method, apparatus and system for judging webpage urlrewriting | |
KR20210132545A (en) | Apparatus and method for detecting abnormal behavior and system having the same | |
CN105740028A (en) | Access control method and device | |
CN110597573A (en) | Warehouse entry request data processing method and device | |
CN113850664A (en) | Data anomaly detection method and data reporting service |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160706 |