CN109508422A

CN109508422A - The height of multithreading intelligent scheduling is hidden crawler system

Info

Publication number: CN109508422A
Application number: CN201811481201.2A
Authority: CN
Inventors: 汪云霄; 朱弘扬; 徐惟康; 刘峥
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2018-12-05
Filing date: 2018-12-05
Publication date: 2019-03-22

Abstract

It hides crawler system the present invention provides a kind of height of multithreading intelligent scheduling, it mainly include six modules: Agent IP pond module, the pond Cookies module, scheduling of resource module, multithreading crawler module, task queue generation module and Back Administration Module, interconnection/cooperation between each module, efficiency, robustness and the stability under distributed reptile system environments are crawled so as to improve crawler, and then cluster web pages information rapidly and efficiently and constructs huge search library.

Description

The height of multithreading intelligent scheduling is hidden crawler system

Technical field

It hides crawler system the present invention relates to a kind of height of multithreading intelligent scheduling, belongs to technical field of the computer network.

Background technique

Human society comes into big data era, with the swift and violent hair of internet, mobile Internet, social networks etc. Exhibition, various substantial amounts, many kinds of, generation whenever and wherever possible and the big data updated, contains unprecedented social value And commercial value.To the acquisition of big data, processing and analysis and based on the intelligent use of big data, have become raising not Carry out the key element of enterprise competitiveness.

Web crawlers is a kind of efficient information collection sharp weapon, can quickly and accurately acquire what we wanted using it Various data resources.Traditional web crawlers method often is easy to be sealed when website has certain " counter to climb " strategy, especially It is our accessible part webpages and requested part interface when crawling the website such as GitHub, microblogging etc. for needing to log in, but It is not log in directly to crawl to have some drawbacks: first is that being provided with the part webpage of load right can not normally crawl；Second is that Frequent requests are easy to be limited by website in the case where not logging in or IP is directly sealed；Third is that an account frequently access or Regular request access to of person can be identified as crawler script by website and account is caused to be sealed.It therefore, is accurately and efficiently to acquire To required data, it would be desirable to which taking has targetedly counterattacking measure.

In view of this, hideing crawler system it is necessory to the height for providing a kind of multithreading intelligent scheduling, to solve the above problems.

Summary of the invention

It hides crawler system the purpose of the present invention is to provide a kind of height of multithreading intelligent scheduling, which can be into Row efficiently crawls, and that improves crawler crawls efficiency, robustness and the stability under distributed reptile system environments, thus Cluster web pages information rapidly and efficiently and construct huge search library.

For achieving the above object, it hides crawler system the present invention provides a kind of height of multithreading intelligent scheduling, comprising:

Agent IP pond module, for obtaining effective Agent IP；

The pond Cookies module, for obtaining effective Cookies information；

Scheduling of resource module is connected with Agent IP pond module and the pond Cookies module respectively, is used for real-time control institute The intelligent scheduling of Agent IP pond module and the pond Cookies module is stated, with the highest Agent IP of scheduling priority value and newest Cookie information；

Multithreading crawler module is connected with the scheduling of resource module, for initiating resource to the scheduling of resource module Request, and receive effective Agent IP and cookie information that the scheduling of resource module issues；

Task queue generation module is connected with the multithreading crawler module, so that the multithreading crawler module is from URL Effective URL is randomly choosed in task queue crawls targeted website；

Back Administration Module, including database update module and background control module, the background control module with it is described Multithreading crawler module is connected, and to monitor, multiple crawler threads whether there is and whether the number of crawler thread reaches number of threshold values.

As a further improvement of the present invention, the foundation of Agent IP pond module, mainly comprises the steps that

S1, several Agent IPs are obtained, to form the queue of Agent IP source；

S2, each Agent IP in the queue of Agent IP source is detected, judges and stores effective Agent IP；

S3, in each effective generation, is calculated according to the Access Success Rate of effective Agent IP access target website and access response time Manage the preferred value of IP；

S4, encapsulation web application interface choose the highest Agent IP of preferred value as request IP；

S5, within a preset time interval, the pond dynamic renewal agency IP module.

As a further improvement of the present invention, the step S1 specifically: if periodically being crawled from multiple Agent IP websites Dry Agent IP buys multiple Agent IPs by the third party service organization, to form the queue of Agent IP source.

As a further improvement of the present invention, the calculation formula of the preferred value of effective Agent IP is in the step S3 Priority_i=0.7x_i1 ^*+0.3x_i2 ^*, i=1,2 ..., n, x_i1 ^*=(x_i1-min x₁)/(max x₁-min x₁), x_i2 ^*= (x_i2-min x₂)/(max x₂-min x₂), wherein n is the quantity of Agent IP, x_i1 ^*And x_i2 ^*Respectively represent i-th of Agent IP The reciprocal value of Access Success Rate value and access response time, max x₁With max x₂It respectively indicates in the module of the pond current agent IP and visits Ask the maximum value of success ratio values and the reciprocal value of access response time, min x₁With min x₂It indicates in the module of the pond current agent IP The minimum value of Access Success Rate value and the reciprocal value of access response time.

As a further improvement of the present invention, the foundation of the pond Cookies module, mainly comprises the steps that

A, the account pond of targeted website is established；

B, effective account within a preset time interval, is randomly selected, simulation logs in targeted website, and uses deep learning Method identifies logon authentication code, to obtain Cookies；

C, by effective Cookies storage into the pond Cookies module；

D, in the module of the pond periodic detection Cookies each Cookie validity, and by invalid Cookie from Cookies It is deleted in the module of pond.

As a further improvement of the present invention, in step A, the account pond includes for storing having for effective account data Pond and the expired pond for storing expired account data are imitated, one is provided in effective pond for marking the threshold of lowest capacity Value obtains account data from the expired pond automatically when the account data amount in effective pond is lower than the threshold value, to guarantee to have The account data amount in pond is imitated at twice or more of threshold value.

As a further improvement of the present invention, the URL task queue is by the task queue generation module according to target The information architectures such as webpage framework, rendering mode and the web page contents of website are formed.

As a further improvement of the present invention, the multithreading crawler module is opened multiple crawler threads simultaneously and is climbed It takes, the multithreading crawler module includes message queue module, and the message queue module is crawled by multiple crawler threads When the unexpected message composition that returns, unexpected message is sent to the scheduling of resource module by the message queue module, and to institute It states scheduling of resource module and initiates resource request.

As a further improvement of the present invention, message queue listening thread is equipped in the scheduling of resource module, with real-time It monitors the message that the message queue module issues and scheduling of resource is carried out according to message content, the scheduling of resource module is based on First in first out provides effective Agent IP and cookie information to the multithreading crawler module.

As a further improvement of the present invention, the database update module is crawled for receiving multiple crawler threads When the normal information that returns, and the normal information is updated storage in relevant database.

The beneficial effects of the present invention are: the height of multithreading intelligent scheduling of the invention is hidden, there are six moulds for crawler system setting Block: Agent IP pond module, the pond Cookies module, scheduling of resource module, multithreading crawler module, task queue generation module with And Back Administration Module, and interconnection/cooperation between each module, so as to improve crawler crawl efficiency, robustness with And the stability under distributed reptile system environments, and then cluster web pages information rapidly and efficiently and construct huge retrieval Library.

Detailed description of the invention

Fig. 1 is that the height of multithreading intelligent scheduling of the present invention is hidden the structure function figure of crawler system.

Fig. 2 is the building flow chart of Agent IP pond module in Fig. 1.

Fig. 3 is the building flow chart of the pond Cookies module in Fig. 1.

Fig. 4 is the structural schematic diagram of scheduling of resource module in Fig. 1.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, right in the following with reference to the drawings and specific embodiments The present invention is described in detail.

Refering to Figure 1, hideing crawler system present invention discloses a kind of height of multithreading intelligent scheduling, in target In the case that website has certain " counter to climb strategy ", efficiently crawled, improve crawler crawl efficiency, robustness and Stability under distributed reptile system environments, cluster web pages information then rapidly and efficiently and constructs huge retrieval Library.

The height of multithreading intelligent scheduling crawler system of hideing mainly includes following six module: Agent IP pond module, The pond Cookies module, scheduling of resource module, multithreading crawler module, task queue generation module and Back Administration Module.

Wherein, Agent IP pond module is for obtaining effective Agent IP；The pond Cookies module is effective for obtaining Cookies information；Scheduling of resource module is connected with Agent IP pond module and the pond Cookies module respectively, is used for real-time control generation The intelligent scheduling of the pond IP module and the pond Cookies module is managed, is believed with the highest Agent IP of scheduling priority value and newest Cookie Breath.

Multithreading crawler module is connected with scheduling of resource module, for initiating resource request to scheduling of resource module, and connects Receive effective Agent IP and cookie information that scheduling of resource module issues.

Task queue generation module is connected with multithreading crawler module, so that multithreading crawler module is from URL task queue The middle effective URL of random selection crawls targeted website.Wherein, the URL task queue is by the task queue generation module root It is formed according to information architectures such as the webpage framework of targeted website, rendering mode and web page contents.

Back Administration Module includes database update module and background control module, background control module and multithreading crawler Module is connected, and to monitor, multiple crawler threads whether there is and whether the number of crawler thread reaches number of threshold values.

Above-mentioned six modules will be described in detail respectively below.

It please join Fig. 2 and as shown in connection with fig. 1, the foundation of Agent IP pond module mainly includes following five steps:

S1, several Agent IPs are obtained, to form the queue of Agent IP source；

S5, within a preset time interval, the pond dynamic renewal agency IP module.

Specifically: step S1 specifically: periodically crawl several Agent IPs from multiple Agent IP websites or pass through The third party service organization buys multiple Agent IPs, to form the queue of Agent IP source.Preferably, in order to guarantee the effective of Agent IP Property, should hide agency from the acquisition of different sources and as far as possible crawl height, and the queue of Agent IP source is added in the Agent IP that will acquire.

Step S2 specifically: the maximum capacity of default Agent IP pond module is T, the queue of Agent IP source is traversed, in queue Each Agent IP detected, judge whether acquired Agent IP effective, and effective Agent IP is stored in Agent IP In the module of pond.Specifically, access target website is gone to the Agent IP for each Agent IP, if accessed successfully, it is determined that The Agent IP effectively and be put into the database of Agent IP pond module, if access failure, it is determined that the Agent IP be it is invalid, no It is put into the module of Agent IP pond.

In step S3, in order to safeguard the validity of Agent IP pond module, every preset time to the Agent IP in database It is detected, and targeted website is set and is linked as detection link.In order to use most effective agency every time during crawling IP rather than the Agent IP randomly selected all should be according to the visit of effective Agent IP access target website in each detection process Ask that success rate and access response time calculate the preferred value of each effectively Agent IP.Wherein, Access Success Rate refers to default Time in the successful total access times of number Zhan of access ratio, the access response time, which refers to issue a request to, receives target The time that website is responded, preferred value here reflect a possibility that each Agent IP is scheduled size.

The specific calculation of preferred value are as follows:

1, achievement data standardizes.Because Access Success Rate is direct index for preferred value, i.e. Access Success Rate is bigger, Agent IP is more effective, and preferred value is bigger；The access response time is negative index for preferred value, i.e. the access response time gets over Short, Agent IP is more effective, and preferred value is bigger；Therefore the inverse of access response time is taken as New Set to replace access response Time, such two classes index is direct index for acting on behalf of ip-precedence value, is denoted as x respectively₁、x₂。

2, achievement data standardizes.In order to eliminate the influence that dimension calculates preferred value, standardized using 0-1, by two classes Index is normalized into respectively in [0,1] section, it may be assumed that

x_i1 ^*=(x_i1-min x₁)/(max x₁-min x₁), x_i2 ^*=(x_i2-min x₂)/(max x₂-min x₂)

Wherein, i=1,2 ..., n, n are the quantity of Agent IP, x_i1And x *_i2* respectively represent the access of i-th of Agent IP at The reciprocal value of performance number and access response time, max x₁With max x₂It respectively indicates and is accessed successfully in the module of the pond current agent IP The maximum value of rate value and the reciprocal value of access response time, min x₁With min x₂Indicate the pond current agent IP module in access at The minimum value of performance number and the reciprocal value of access response time.

3, since the importance of Access Success Rate is greater than the access response time, so taking the two class index weight values to be respectively respectively 0.7 and 0.3, therefore, the calculation formula that can define the preferred value of each effective Agent IP in the module of the pond current agent IP is Priority_i=0.7x_i1 ^*+0.3x_i2 ^*, i=1,2 ..., n, wherein n is the quantity of Agent IP.

In step S4, corresponding Agent IP data are obtained due to being directly connected to database and need to configure link information, this Sample is easy the link information of exposure database, therefore can pass through the web application interface of the encapsulation pond access agent IP module (Web api interface) obtains effective Agent IP.In addition, in access agent IP, in order to guarantee selected Agent IP tool There is higher validity, chooses preferred value highest Agent IP from the module of Agent IP pond every time as requesting IP (i.e. crawler Agent IP).

In step S5, in order to ensure the Agent IP validity with higher in the module of Agent IP pond, therefore when default Between T₀Update an Agent IP pond module.Specifically, every preset time T₀Obtain an Agent IP source queue, the Agent IP Agent IP, the Agent IP from third party service organization's purchase and the attached Agent IP pond that source queue is crawled by multiple Agent IP websites In Agent IP composition, each Agent IP in queue is detected, invalid Agent IP is rejected, for effective Agent IP Its preferred value is calculated one by one according to step S3；If the capacity of the pond current agent IP module less than 2/3, adds effective agency Otherwise IP replaces the lower Agent IP of preferred value, to construct efficient Agent IP pond module.

Fig. 3 and as shown in connection with fig. 1 is please referred to, the foundation of the pond Cookies module mainly includes following four step:

A, the account pond of targeted website is established；

C, by effective Cookies storage into the pond Cookies module；

In step A, the account pond includes effective pond for storing effective account data and for storing expired account The expired pond of data is provided with one in effective pond for marking the threshold value of lowest capacity, when the account number in effective pond When according to amount lower than the threshold value, account data is obtained from the expired pond automatically, and guarantee that the account data amount in effective pond exists Twice or more of threshold value.Certainly, in order to guarantee that the dynamic of the validity of account data and the pond Cookies module is more in account pond New property replaces the account data in an account pond every some cycles.

Step B specifically: every preset time, open multiple account crawl threads, randomly grab one from effective pond A account, including username and password information, then simulation logs in targeted website, designs suitable deep learning method, intelligence It identifies logon authentication code, if logging in success, the Cookies information of acquisition is just returned to and given memory module storage, if Failure is logged in, then switches account from effective pond at random and continues to simulate log on request, and the account immigration for logging in failure is expired Expired time is arranged in pond.

Step C specifically: together by the effective Cookies information obtained in step B, the username and password information of account It stores in the module of the pond Cookies, while the account for reaching expired time in expired pond is moved into again in effective pond, realize account The effective use of number resource.Certainly, it is operated in order to facilitate queried access etc., should also provide some data access interfaces.

Step D specifically: every the validity of each cookie information in the preset time detection pond Cookies module, such as Fruit is invalid, then deletes invalid Cookie from the module of the pond Cookies.Specifically, a timing detection module can be increased And corresponding detection link is set, all Cookies request detection links are traversed, invalid then deletion effectively then retains As Cookies to be taken.

Task queue generation module is responsible for according to information structures such as the webpage framework of targeted website, rendering mode and web page contents Build the URL task queue crawled.In general, the URL of website has certain rule, " can be spelled according to this rule in advance Connect " go out to need to crawl the URL of website.Such as: Baidu search engine generate search result can carry many pages, wherein page i-th and The URL composition of i+1 page is most of be it is identical, only pn parameter is different, and pn parameter is for controlling the page such as i-th The pn value of page is i*10, and the pn value of i+1 page is (i+1) * 10.Certainly, such rule is also had as remaining web page class.

Therefore, according to this rule, task queue generation module can set the URL rule for the webpage to be crawled in advance, And the number of pages for the website for needing to crawl, a URL task queue is generated with a cyclic program again later, database is written In, and the state that crawls that these URL task queues are arranged is not crawl.

Multithreading crawler module opens multiple crawler threads simultaneously and carries out crawling for targeted website, and receive crawler thread/ What targeted website returned crawls information.The multithreading crawler module includes message queue module, the message queue module by The exception information crawled in information that multiple crawler threads return when crawling encapsulates the unexpected message to be formed composition, message queue mould Unexpected message is sent to scheduling of resource module by block, and initiates resource request to scheduling of resource module.Scheduling of resource module according to The content of unexpected message simultaneously provides effective Agent IP and cookie information to multithreading crawler module based on first in first out.

Specifically, what multithreading crawler module received that multiple crawler threads return crawls information, and will crawl in information Exception information be packaged into unexpected message, form message queue, subsequent message queue initiates resource request to scheduling of resource module, Based on first in first out (FIFO), scheduling of resource module every time according to scheduling strategy and the particular content of unexpected message be in The crawler thread of head of the queue provides effective Agent IP and cookie information.After each crawler thread obtains resource, from URL task team Effective URL is randomly choosed in column and crawls targeted website, and self-test is carried out according to the information that crawls that targeted website returns, if returned Return crawl information be it is normal, then continue to use identical resource information and crawled, and crawl information by normal and be sent to Database update module；If what is returned crawls information to be abnormal, exception information is packaged into unexpected message, message team is added Column are requested scheduling of resource module replacing Agent IP or cookie information, are crawled with re-starting.

Wherein, the content of multithreading crawler module self-test includes whether the secondary request is primary correct network request, i.e., The responsive state code of detection service device；Further include the secondary request whether by server be identified as crawler request, the method master of detection Whether what if detection returned crawls comprising certain keywords in information, such as: " identifying code ", " request is too fast " etc..

As it can be seen that multithreading crawler module, which can be realized multithreading, crawls targeted website, and message queue mechanism is combined, mentioned The efficiency that high crawler crawls, enhances flexibility and the robustness of crawler.

Scheduling of resource module is used for the intelligent scheduling of real-time control Agent IP pond module and the pond Cookies module.

Fig. 4 and as shown in connection with fig. 1 is please referred to, message queue listening thread is equipped in scheduling of resource module, to monitor in real time The message of message queue module sending simultaneously carries out scheduling of resource according to message content.Specifically: if message queue listening-in line Journey listen to crawler thread return exception information for failure, then according to the failure information judgement be Agent IP failure or Cookie failure then uses preferred value dispatching algorithm if Agent IP fails, and it is current to choose the highest Agent IP replacement of preferred value Crawler IP；If Cookie fails, then chooses newest Cookie and be replaced.The message queue listening thread possesses one It is main thread, multiple from thread, when service is monitored on starting backstage, if main thread cannot work, it can be used and substituted from thread.

Agent IP scheduling thread, Cookies scheduling thread and resource transmission thread are additionally provided in scheduling of resource module, it is described Agent IP scheduling thread is used to call the highest Agent IP of preferred value, the Cookies scheduling thread from the module of Agent IP pond For calling newest cookie information, the preferred value that the resource transmission thread is used to call from the module of the pond Cookies Highest Agent IP and newest cookie information send multithreading crawler module to, so that the progress of multithreading crawler module is multiple Crawler thread crawls.

Back Administration Module includes database update module and background control module.Database update module is more for receiving What a crawler thread returned normally crawls information, and this is normally crawled information update storage to relevant database In MySQL.

The function of background control module is to guarantee that crawler thread can be stable in running background.Specifically, should Multiple crawler threads whether there is background control module and whether the number of crawler thread reaches number of threshold values for monitoring, if climbed The number of worm thread is maintained at number of threshold values or more, then normally exits；Otherwise start new crawler thread, so that of crawler thread Number reaches number of threshold values.

In order to guarantee that the background control module can take in running background using the crontab of (SuSE) Linux OS Business, crontab order is common among the operating system of Unix and class Unix, for the instruction being periodically performed to be arranged.It should Crontab order is read from standard input device and is instructed, and is deposited in " crontab " file, for reading later and It executes.Timed task is added in crontab, is realized and is executed a Background control script per minute.

When the height using multithreading intelligent scheduling of the invention hides crawler system, firstly, module acquisition in Agent IP pond has The Agent IP of effect, the pond Cookies module obtain effective Cookies, and scheduling of resource module is adjusted from the module of Agent IP pond later The highest Agent IP of preferred value is spent, while dispatching newest cookie information from the module of the pond Cookies.

Then, message queue module is initialized, if there are message in message queue module, to scheduling of resource mould Block sends message (it can be appreciated that sending resource request), and scheduling of resource module is listened to according to message queue listening thread Message content carries out scheduling of resource, that is, returns to optimal resource information (including the highest Agent IP of preferred value and newest Cookie information) give multithreading crawler module.

Finally, multithreading crawler module obtains URL task queue from task queue generation module, and randomly choose effective URL crawl targeted website, crawled simultaneously at this point, multithreading crawler module opens multiple crawler threads, and according to crawler thread/ Targeted website return crawl information carry out self-test, if return crawl information be it is normal, continue to use identical resource Information is crawled, and normally will be crawled information and be sent to database update module, will be normal by database update module Information update storage is crawled into relevant database MySQL；If what is returned crawls information as exception, exception information is sealed It dresses up unexpected message to be added in message queue module, scheduling of resource module replacing Agent IP or Cookie is requested, to re-start It crawls.

It should be understood that the Agent IP pond module in the present invention can be based on preferred value management, to be scheduling of resource mould Block provides most effective Agent IP；The pond Cookies module uses and builds account pond, the effective account of selection carries out simulation and logs in target The mode of website can obtain effective, newest cookie information, so as to scheduling of resource module calling；Scheduling of resource module It can be realized the real-time calling to preferred value highest Agent IP and newest cookie information；Multithreading crawler module not only can be real Existing multiple threads crawl targeted website, can also carry out real-time update to Agent IP and cookie information based on message queue mechanism, The efficiency that crawler crawls is improved, flexibility and the robustness of crawler are enhanced.

The crawler system in conclusion height of multithreading intelligent scheduling of the invention is hidden solves and needs to step in targeted website Land and in the case where having certain " counter to climb " measure, can construct the Agent IP pond and the pond Cookies of stability and high efficiency, based on preferential Value scheduling strategy and message queue mechanism realize the real-time update of selected Agent IP and cookie information in crawler thread, improve Crawler crawls efficiency, robustness and the stability under distributed reptile system environments, thus polymerization rapidly and efficiently Webpage information and construct huge search library.

The above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although referring to preferred embodiment to this hair It is bright to be described in detail, those skilled in the art should understand that, it can modify to technical solution of the present invention Or equivalent replacement, without departing from the spirit and scope of the technical solution of the present invention.

Claims

The crawler system 1. a kind of height of multithreading intelligent scheduling is hidden characterized by comprising

Agent IP pond module, for obtaining effective Agent IP；

The pond Cookies module, for obtaining effective Cookies information；

Scheduling of resource module is connected with Agent IP pond module and the pond Cookies module respectively, for generation described in real-time control The intelligent scheduling of the pond IP module and the pond Cookies module is managed, is believed with the highest Agent IP of scheduling priority value and newest Cookie Breath；

Multithreading crawler module is connected with the scheduling of resource module, for initiating resource request to the scheduling of resource module, And receive effective Agent IP and cookie information that the scheduling of resource module issues；

Task queue generation module is connected with the multithreading crawler module, so that the multithreading crawler module is from URL task Effective URL is randomly choosed in queue crawls targeted website；

Back Administration Module, including database update module and background control module, the background control module with it is described multi-thread Journey crawler module is connected, and to monitor, multiple crawler threads whether there is and whether the number of crawler thread reaches number of threshold values.
The crawler system 2. height of multithreading intelligent scheduling according to claim 1 is hidden, which is characterized in that the Agent IP pond The foundation of module, mainly comprises the steps that

S1, several Agent IPs are obtained, to form the queue of Agent IP source；

S2, each Agent IP in the queue of Agent IP source is detected, judges and stores effective Agent IP；

S3, each effective Agent IP is calculated according to the Access Success Rate of effective Agent IP access target website and access response time Preferred value；

S4, encapsulation web application interface choose the highest Agent IP of preferred value as request IP；

S5, within a preset time interval, the pond dynamic renewal agency IP module.
The crawler system 3. height of multithreading intelligent scheduling according to claim 2 is hidden, which is characterized in that the step S1 tool Body are as follows: periodically crawl several Agent IPs from multiple Agent IP websites or multiple agencies are bought by the third party service organization IP, to form the queue of Agent IP source.
The crawler system 4. height of multithreading intelligent scheduling according to claim 2 is hidden, it is characterised in that: in the step S3 The calculation formula of the preferred value of effective Agent IP is Priority_i=0.7x_i1 ^*+0.3x_i2 ^*, i=1,2 ..., n, x_i1 ^*=(x_i1- min x₁)/(max x₁-min x₁), x_i2 ^*=(x_i2-min x₂)/(max x₂-min x₂), wherein n is the quantity of Agent IP, x_i1 ^*And x_i2 ^*Respectively represent the Access Success Rate value and the reciprocal value of access response time of i-th of Agent IP, max x₁With max x₂ Respectively indicate the maximum value of Access Success Rate value and the reciprocal value of access response time in the module of the pond current agent IP, min x₁With min x₂Indicate the minimum value of Access Success Rate value and the reciprocal value of access response time in the module of the pond current agent IP.
The crawler system 5. height of multithreading intelligent scheduling according to claim 1 is hidden, which is characterized in that the Cookies The foundation of pond module, mainly comprises the steps that

A, the account pond of targeted website is established；

B, effective account within a preset time interval, is randomly selected, simulation logs in targeted website, and uses deep learning method Logon authentication code is identified, to obtain Cookies；

C, by effective Cookies storage into the pond Cookies module；

D, in the module of the pond periodic detection Cookies each Cookie validity, and by invalid Cookie from the pond Cookies mould It is deleted in block.
The crawler system 6. height of multithreading intelligent scheduling according to claim 5 is hidden, it is characterised in that: described in step A Account pond includes effective pond for storing effective account data and the expired pond for storing expired account data, it is described effectively One is provided in pond for marking the threshold value of lowest capacity, when the account data amount in effective pond is lower than the threshold value, automatically from Account data is obtained in the expired pond, to guarantee the account data amount in effective pond at twice or more of threshold value.
The crawler system 7. height of multithreading intelligent scheduling according to claim 1 is hidden, it is characterised in that: the URL task Webpage framework, the rendering information architectures such as mode and web page contents of the queue by the task queue generation module according to targeted website It is formed.
The crawler system 8. height of multithreading intelligent scheduling according to claim 1 is hidden, it is characterised in that: the multithreading is climbed Erpoglyph block is opened multiple crawler threads simultaneously and is crawled, and the multithreading crawler module includes message queue module, described to disappear The unexpected message that breath Queue module returns when being crawled by multiple crawler threads forms, and the message queue module will disappear extremely Breath is sent to the scheduling of resource module, and initiates resource request to the scheduling of resource module.
The crawler system 9. height of multithreading intelligent scheduling according to claim 8 is hidden, it is characterised in that: the scheduling of resource Message queue listening thread is equipped in module, to monitor the message of the message queue module sending in real time and according to message content Scheduling of resource is carried out, the scheduling of resource module is based on first in first out and provides effective generation to the multithreading crawler module Manage IP and cookie information.
The crawler system 10. height of multithreading intelligent scheduling according to claim 8 is hidden, it is characterised in that: the database Update module updates storage the normal information for receiving the normal information returned when multiple crawler threads are crawled In relevant database.