CN109508422A - The height of multithreading intelligent scheduling is hidden crawler system - Google Patents

The height of multithreading intelligent scheduling is hidden crawler system Download PDF

Info

Publication number
CN109508422A
CN109508422A CN201811481201.2A CN201811481201A CN109508422A CN 109508422 A CN109508422 A CN 109508422A CN 201811481201 A CN201811481201 A CN 201811481201A CN 109508422 A CN109508422 A CN 109508422A
Authority
CN
China
Prior art keywords
module
agent
pond
crawler
multithreading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811481201.2A
Other languages
Chinese (zh)
Inventor
汪云霄
朱弘扬
徐惟康
刘峥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201811481201.2A priority Critical patent/CN109508422A/en
Publication of CN109508422A publication Critical patent/CN109508422A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system

Abstract

It hides crawler system the present invention provides a kind of height of multithreading intelligent scheduling, it mainly include six modules: Agent IP pond module, the pond Cookies module, scheduling of resource module, multithreading crawler module, task queue generation module and Back Administration Module, interconnection/cooperation between each module, efficiency, robustness and the stability under distributed reptile system environments are crawled so as to improve crawler, and then cluster web pages information rapidly and efficiently and constructs huge search library.

Description

The height of multithreading intelligent scheduling is hidden crawler system
Technical field
It hides crawler system the present invention relates to a kind of height of multithreading intelligent scheduling, belongs to technical field of the computer network.
Background technique
Human society comes into big data era, with the swift and violent hair of internet, mobile Internet, social networks etc. Exhibition, various substantial amounts, many kinds of, generation whenever and wherever possible and the big data updated, contains unprecedented social value And commercial value.To the acquisition of big data, processing and analysis and based on the intelligent use of big data, have become raising not Carry out the key element of enterprise competitiveness.
Web crawlers is a kind of efficient information collection sharp weapon, can quickly and accurately acquire what we wanted using it Various data resources.Traditional web crawlers method often is easy to be sealed when website has certain " counter to climb " strategy, especially It is our accessible part webpages and requested part interface when crawling the website such as GitHub, microblogging etc. for needing to log in, but It is not log in directly to crawl to have some drawbacks: first is that being provided with the part webpage of load right can not normally crawl;Second is that Frequent requests are easy to be limited by website in the case where not logging in or IP is directly sealed;Third is that an account frequently access or Regular request access to of person can be identified as crawler script by website and account is caused to be sealed.It therefore, is accurately and efficiently to acquire To required data, it would be desirable to which taking has targetedly counterattacking measure.
In view of this, hideing crawler system it is necessory to the height for providing a kind of multithreading intelligent scheduling, to solve the above problems.
Summary of the invention
It hides crawler system the purpose of the present invention is to provide a kind of height of multithreading intelligent scheduling, which can be into Row efficiently crawls, and that improves crawler crawls efficiency, robustness and the stability under distributed reptile system environments, thus Cluster web pages information rapidly and efficiently and construct huge search library.
For achieving the above object, it hides crawler system the present invention provides a kind of height of multithreading intelligent scheduling, comprising:
Agent IP pond module, for obtaining effective Agent IP;
The pond Cookies module, for obtaining effective Cookies information;
Scheduling of resource module is connected with Agent IP pond module and the pond Cookies module respectively, is used for real-time control institute The intelligent scheduling of Agent IP pond module and the pond Cookies module is stated, with the highest Agent IP of scheduling priority value and newest Cookie information;
Multithreading crawler module is connected with the scheduling of resource module, for initiating resource to the scheduling of resource module Request, and receive effective Agent IP and cookie information that the scheduling of resource module issues;
Task queue generation module is connected with the multithreading crawler module, so that the multithreading crawler module is from URL Effective URL is randomly choosed in task queue crawls targeted website;
Back Administration Module, including database update module and background control module, the background control module with it is described Multithreading crawler module is connected, and to monitor, multiple crawler threads whether there is and whether the number of crawler thread reaches number of threshold values.
As a further improvement of the present invention, the foundation of Agent IP pond module, mainly comprises the steps that
S1, several Agent IPs are obtained, to form the queue of Agent IP source;
S2, each Agent IP in the queue of Agent IP source is detected, judges and stores effective Agent IP;
S3, in each effective generation, is calculated according to the Access Success Rate of effective Agent IP access target website and access response time Manage the preferred value of IP;
S4, encapsulation web application interface choose the highest Agent IP of preferred value as request IP;
S5, within a preset time interval, the pond dynamic renewal agency IP module.
As a further improvement of the present invention, the step S1 specifically: if periodically being crawled from multiple Agent IP websites Dry Agent IP buys multiple Agent IPs by the third party service organization, to form the queue of Agent IP source.
As a further improvement of the present invention, the calculation formula of the preferred value of effective Agent IP is in the step S3 Priorityi=0.7xi1 *+0.3xi2 *, i=1,2 ..., n, xi1 *=(xi1-min x1)/(max x1-min x1), xi2 *= (xi2-min x2)/(max x2-min x2), wherein n is the quantity of Agent IP, xi1 *And xi2 *Respectively represent i-th of Agent IP The reciprocal value of Access Success Rate value and access response time, max x1With max x2It respectively indicates in the module of the pond current agent IP and visits Ask the maximum value of success ratio values and the reciprocal value of access response time, min x1With min x2It indicates in the module of the pond current agent IP The minimum value of Access Success Rate value and the reciprocal value of access response time.
As a further improvement of the present invention, the foundation of the pond Cookies module, mainly comprises the steps that
A, the account pond of targeted website is established;
B, effective account within a preset time interval, is randomly selected, simulation logs in targeted website, and uses deep learning Method identifies logon authentication code, to obtain Cookies;
C, by effective Cookies storage into the pond Cookies module;
D, in the module of the pond periodic detection Cookies each Cookie validity, and by invalid Cookie from Cookies It is deleted in the module of pond.
As a further improvement of the present invention, in step A, the account pond includes for storing having for effective account data Pond and the expired pond for storing expired account data are imitated, one is provided in effective pond for marking the threshold of lowest capacity Value obtains account data from the expired pond automatically when the account data amount in effective pond is lower than the threshold value, to guarantee to have The account data amount in pond is imitated at twice or more of threshold value.
As a further improvement of the present invention, the URL task queue is by the task queue generation module according to target The information architectures such as webpage framework, rendering mode and the web page contents of website are formed.
As a further improvement of the present invention, the multithreading crawler module is opened multiple crawler threads simultaneously and is climbed It takes, the multithreading crawler module includes message queue module, and the message queue module is crawled by multiple crawler threads When the unexpected message composition that returns, unexpected message is sent to the scheduling of resource module by the message queue module, and to institute It states scheduling of resource module and initiates resource request.
As a further improvement of the present invention, message queue listening thread is equipped in the scheduling of resource module, with real-time It monitors the message that the message queue module issues and scheduling of resource is carried out according to message content, the scheduling of resource module is based on First in first out provides effective Agent IP and cookie information to the multithreading crawler module.
As a further improvement of the present invention, the database update module is crawled for receiving multiple crawler threads When the normal information that returns, and the normal information is updated storage in relevant database.
The beneficial effects of the present invention are: the height of multithreading intelligent scheduling of the invention is hidden, there are six moulds for crawler system setting Block: Agent IP pond module, the pond Cookies module, scheduling of resource module, multithreading crawler module, task queue generation module with And Back Administration Module, and interconnection/cooperation between each module, so as to improve crawler crawl efficiency, robustness with And the stability under distributed reptile system environments, and then cluster web pages information rapidly and efficiently and construct huge retrieval Library.
Detailed description of the invention
Fig. 1 is that the height of multithreading intelligent scheduling of the present invention is hidden the structure function figure of crawler system.
Fig. 2 is the building flow chart of Agent IP pond module in Fig. 1.
Fig. 3 is the building flow chart of the pond Cookies module in Fig. 1.
Fig. 4 is the structural schematic diagram of scheduling of resource module in Fig. 1.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, right in the following with reference to the drawings and specific embodiments The present invention is described in detail.
Refering to Figure 1, hideing crawler system present invention discloses a kind of height of multithreading intelligent scheduling, in target In the case that website has certain " counter to climb strategy ", efficiently crawled, improve crawler crawl efficiency, robustness and Stability under distributed reptile system environments, cluster web pages information then rapidly and efficiently and constructs huge retrieval Library.
The height of multithreading intelligent scheduling crawler system of hideing mainly includes following six module: Agent IP pond module, The pond Cookies module, scheduling of resource module, multithreading crawler module, task queue generation module and Back Administration Module.
Wherein, Agent IP pond module is for obtaining effective Agent IP;The pond Cookies module is effective for obtaining Cookies information;Scheduling of resource module is connected with Agent IP pond module and the pond Cookies module respectively, is used for real-time control generation The intelligent scheduling of the pond IP module and the pond Cookies module is managed, is believed with the highest Agent IP of scheduling priority value and newest Cookie Breath.
Multithreading crawler module is connected with scheduling of resource module, for initiating resource request to scheduling of resource module, and connects Receive effective Agent IP and cookie information that scheduling of resource module issues.
Task queue generation module is connected with multithreading crawler module, so that multithreading crawler module is from URL task queue The middle effective URL of random selection crawls targeted website.Wherein, the URL task queue is by the task queue generation module root It is formed according to information architectures such as the webpage framework of targeted website, rendering mode and web page contents.
Back Administration Module includes database update module and background control module, background control module and multithreading crawler Module is connected, and to monitor, multiple crawler threads whether there is and whether the number of crawler thread reaches number of threshold values.
Above-mentioned six modules will be described in detail respectively below.
It please join Fig. 2 and as shown in connection with fig. 1, the foundation of Agent IP pond module mainly includes following five steps:
S1, several Agent IPs are obtained, to form the queue of Agent IP source;
S2, each Agent IP in the queue of Agent IP source is detected, judges and stores effective Agent IP;
S3, in each effective generation, is calculated according to the Access Success Rate of effective Agent IP access target website and access response time Manage the preferred value of IP;
S4, encapsulation web application interface choose the highest Agent IP of preferred value as request IP;
S5, within a preset time interval, the pond dynamic renewal agency IP module.
Specifically: step S1 specifically: periodically crawl several Agent IPs from multiple Agent IP websites or pass through The third party service organization buys multiple Agent IPs, to form the queue of Agent IP source.Preferably, in order to guarantee the effective of Agent IP Property, should hide agency from the acquisition of different sources and as far as possible crawl height, and the queue of Agent IP source is added in the Agent IP that will acquire.
Step S2 specifically: the maximum capacity of default Agent IP pond module is T, the queue of Agent IP source is traversed, in queue Each Agent IP detected, judge whether acquired Agent IP effective, and effective Agent IP is stored in Agent IP In the module of pond.Specifically, access target website is gone to the Agent IP for each Agent IP, if accessed successfully, it is determined that The Agent IP effectively and be put into the database of Agent IP pond module, if access failure, it is determined that the Agent IP be it is invalid, no It is put into the module of Agent IP pond.
In step S3, in order to safeguard the validity of Agent IP pond module, every preset time to the Agent IP in database It is detected, and targeted website is set and is linked as detection link.In order to use most effective agency every time during crawling IP rather than the Agent IP randomly selected all should be according to the visit of effective Agent IP access target website in each detection process Ask that success rate and access response time calculate the preferred value of each effectively Agent IP.Wherein, Access Success Rate refers to default Time in the successful total access times of number Zhan of access ratio, the access response time, which refers to issue a request to, receives target The time that website is responded, preferred value here reflect a possibility that each Agent IP is scheduled size.
The specific calculation of preferred value are as follows:
1, achievement data standardizes.Because Access Success Rate is direct index for preferred value, i.e. Access Success Rate is bigger, Agent IP is more effective, and preferred value is bigger;The access response time is negative index for preferred value, i.e. the access response time gets over Short, Agent IP is more effective, and preferred value is bigger;Therefore the inverse of access response time is taken as New Set to replace access response Time, such two classes index is direct index for acting on behalf of ip-precedence value, is denoted as x respectively1、x2
2, achievement data standardizes.In order to eliminate the influence that dimension calculates preferred value, standardized using 0-1, by two classes Index is normalized into respectively in [0,1] section, it may be assumed that
xi1 *=(xi1-min x1)/(max x1-min x1), xi2 *=(xi2-min x2)/(max x2-min x2)
Wherein, i=1,2 ..., n, n are the quantity of Agent IP, xi1And x *i2* respectively represent the access of i-th of Agent IP at The reciprocal value of performance number and access response time, max x1With max x2It respectively indicates and is accessed successfully in the module of the pond current agent IP The maximum value of rate value and the reciprocal value of access response time, min x1With min x2Indicate the pond current agent IP module in access at The minimum value of performance number and the reciprocal value of access response time.
3, since the importance of Access Success Rate is greater than the access response time, so taking the two class index weight values to be respectively respectively 0.7 and 0.3, therefore, the calculation formula that can define the preferred value of each effective Agent IP in the module of the pond current agent IP is Priorityi=0.7xi1 *+0.3xi2 *, i=1,2 ..., n, wherein n is the quantity of Agent IP.
In step S4, corresponding Agent IP data are obtained due to being directly connected to database and need to configure link information, this Sample is easy the link information of exposure database, therefore can pass through the web application interface of the encapsulation pond access agent IP module (Web api interface) obtains effective Agent IP.In addition, in access agent IP, in order to guarantee selected Agent IP tool There is higher validity, chooses preferred value highest Agent IP from the module of Agent IP pond every time as requesting IP (i.e. crawler Agent IP).
In step S5, in order to ensure the Agent IP validity with higher in the module of Agent IP pond, therefore when default Between T0Update an Agent IP pond module.Specifically, every preset time T0Obtain an Agent IP source queue, the Agent IP Agent IP, the Agent IP from third party service organization's purchase and the attached Agent IP pond that source queue is crawled by multiple Agent IP websites In Agent IP composition, each Agent IP in queue is detected, invalid Agent IP is rejected, for effective Agent IP Its preferred value is calculated one by one according to step S3;If the capacity of the pond current agent IP module less than 2/3, adds effective agency Otherwise IP replaces the lower Agent IP of preferred value, to construct efficient Agent IP pond module.
Fig. 3 and as shown in connection with fig. 1 is please referred to, the foundation of the pond Cookies module mainly includes following four step:
A, the account pond of targeted website is established;
B, effective account within a preset time interval, is randomly selected, simulation logs in targeted website, and uses deep learning Method identifies logon authentication code, to obtain Cookies;
C, by effective Cookies storage into the pond Cookies module;
D, in the module of the pond periodic detection Cookies each Cookie validity, and by invalid Cookie from Cookies It is deleted in the module of pond.
In step A, the account pond includes effective pond for storing effective account data and for storing expired account The expired pond of data is provided with one in effective pond for marking the threshold value of lowest capacity, when the account number in effective pond When according to amount lower than the threshold value, account data is obtained from the expired pond automatically, and guarantee that the account data amount in effective pond exists Twice or more of threshold value.Certainly, in order to guarantee that the dynamic of the validity of account data and the pond Cookies module is more in account pond New property replaces the account data in an account pond every some cycles.
Step B specifically: every preset time, open multiple account crawl threads, randomly grab one from effective pond A account, including username and password information, then simulation logs in targeted website, designs suitable deep learning method, intelligence It identifies logon authentication code, if logging in success, the Cookies information of acquisition is just returned to and given memory module storage, if Failure is logged in, then switches account from effective pond at random and continues to simulate log on request, and the account immigration for logging in failure is expired Expired time is arranged in pond.
Step C specifically: together by the effective Cookies information obtained in step B, the username and password information of account It stores in the module of the pond Cookies, while the account for reaching expired time in expired pond is moved into again in effective pond, realize account The effective use of number resource.Certainly, it is operated in order to facilitate queried access etc., should also provide some data access interfaces.
Step D specifically: every the validity of each cookie information in the preset time detection pond Cookies module, such as Fruit is invalid, then deletes invalid Cookie from the module of the pond Cookies.Specifically, a timing detection module can be increased And corresponding detection link is set, all Cookies request detection links are traversed, invalid then deletion effectively then retains As Cookies to be taken.
Task queue generation module is responsible for according to information structures such as the webpage framework of targeted website, rendering mode and web page contents Build the URL task queue crawled.In general, the URL of website has certain rule, " can be spelled according to this rule in advance Connect " go out to need to crawl the URL of website.Such as: Baidu search engine generate search result can carry many pages, wherein page i-th and The URL composition of i+1 page is most of be it is identical, only pn parameter is different, and pn parameter is for controlling the page such as i-th The pn value of page is i*10, and the pn value of i+1 page is (i+1) * 10.Certainly, such rule is also had as remaining web page class.
Therefore, according to this rule, task queue generation module can set the URL rule for the webpage to be crawled in advance, And the number of pages for the website for needing to crawl, a URL task queue is generated with a cyclic program again later, database is written In, and the state that crawls that these URL task queues are arranged is not crawl.
Multithreading crawler module opens multiple crawler threads simultaneously and carries out crawling for targeted website, and receive crawler thread/ What targeted website returned crawls information.The multithreading crawler module includes message queue module, the message queue module by The exception information crawled in information that multiple crawler threads return when crawling encapsulates the unexpected message to be formed composition, message queue mould Unexpected message is sent to scheduling of resource module by block, and initiates resource request to scheduling of resource module.Scheduling of resource module according to The content of unexpected message simultaneously provides effective Agent IP and cookie information to multithreading crawler module based on first in first out.
Specifically, what multithreading crawler module received that multiple crawler threads return crawls information, and will crawl in information Exception information be packaged into unexpected message, form message queue, subsequent message queue initiates resource request to scheduling of resource module, Based on first in first out (FIFO), scheduling of resource module every time according to scheduling strategy and the particular content of unexpected message be in The crawler thread of head of the queue provides effective Agent IP and cookie information.After each crawler thread obtains resource, from URL task team Effective URL is randomly choosed in column and crawls targeted website, and self-test is carried out according to the information that crawls that targeted website returns, if returned Return crawl information be it is normal, then continue to use identical resource information and crawled, and crawl information by normal and be sent to Database update module;If what is returned crawls information to be abnormal, exception information is packaged into unexpected message, message team is added Column are requested scheduling of resource module replacing Agent IP or cookie information, are crawled with re-starting.
Wherein, the content of multithreading crawler module self-test includes whether the secondary request is primary correct network request, i.e., The responsive state code of detection service device;Further include the secondary request whether by server be identified as crawler request, the method master of detection Whether what if detection returned crawls comprising certain keywords in information, such as: " identifying code ", " request is too fast " etc..
As it can be seen that multithreading crawler module, which can be realized multithreading, crawls targeted website, and message queue mechanism is combined, mentioned The efficiency that high crawler crawls, enhances flexibility and the robustness of crawler.
Scheduling of resource module is used for the intelligent scheduling of real-time control Agent IP pond module and the pond Cookies module.
Fig. 4 and as shown in connection with fig. 1 is please referred to, message queue listening thread is equipped in scheduling of resource module, to monitor in real time The message of message queue module sending simultaneously carries out scheduling of resource according to message content.Specifically: if message queue listening-in line Journey listen to crawler thread return exception information for failure, then according to the failure information judgement be Agent IP failure or Cookie failure then uses preferred value dispatching algorithm if Agent IP fails, and it is current to choose the highest Agent IP replacement of preferred value Crawler IP;If Cookie fails, then chooses newest Cookie and be replaced.The message queue listening thread possesses one It is main thread, multiple from thread, when service is monitored on starting backstage, if main thread cannot work, it can be used and substituted from thread.
Agent IP scheduling thread, Cookies scheduling thread and resource transmission thread are additionally provided in scheduling of resource module, it is described Agent IP scheduling thread is used to call the highest Agent IP of preferred value, the Cookies scheduling thread from the module of Agent IP pond For calling newest cookie information, the preferred value that the resource transmission thread is used to call from the module of the pond Cookies Highest Agent IP and newest cookie information send multithreading crawler module to, so that the progress of multithreading crawler module is multiple Crawler thread crawls.
Back Administration Module includes database update module and background control module.Database update module is more for receiving What a crawler thread returned normally crawls information, and this is normally crawled information update storage to relevant database In MySQL.
The function of background control module is to guarantee that crawler thread can be stable in running background.Specifically, should Multiple crawler threads whether there is background control module and whether the number of crawler thread reaches number of threshold values for monitoring, if climbed The number of worm thread is maintained at number of threshold values or more, then normally exits;Otherwise start new crawler thread, so that of crawler thread Number reaches number of threshold values.
In order to guarantee that the background control module can take in running background using the crontab of (SuSE) Linux OS Business, crontab order is common among the operating system of Unix and class Unix, for the instruction being periodically performed to be arranged.It should Crontab order is read from standard input device and is instructed, and is deposited in " crontab " file, for reading later and It executes.Timed task is added in crontab, is realized and is executed a Background control script per minute.
When the height using multithreading intelligent scheduling of the invention hides crawler system, firstly, module acquisition in Agent IP pond has The Agent IP of effect, the pond Cookies module obtain effective Cookies, and scheduling of resource module is adjusted from the module of Agent IP pond later The highest Agent IP of preferred value is spent, while dispatching newest cookie information from the module of the pond Cookies.
Then, message queue module is initialized, if there are message in message queue module, to scheduling of resource mould Block sends message (it can be appreciated that sending resource request), and scheduling of resource module is listened to according to message queue listening thread Message content carries out scheduling of resource, that is, returns to optimal resource information (including the highest Agent IP of preferred value and newest Cookie information) give multithreading crawler module.
Finally, multithreading crawler module obtains URL task queue from task queue generation module, and randomly choose effective URL crawl targeted website, crawled simultaneously at this point, multithreading crawler module opens multiple crawler threads, and according to crawler thread/ Targeted website return crawl information carry out self-test, if return crawl information be it is normal, continue to use identical resource Information is crawled, and normally will be crawled information and be sent to database update module, will be normal by database update module Information update storage is crawled into relevant database MySQL;If what is returned crawls information as exception, exception information is sealed It dresses up unexpected message to be added in message queue module, scheduling of resource module replacing Agent IP or Cookie is requested, to re-start It crawls.
It should be understood that the Agent IP pond module in the present invention can be based on preferred value management, to be scheduling of resource mould Block provides most effective Agent IP;The pond Cookies module uses and builds account pond, the effective account of selection carries out simulation and logs in target The mode of website can obtain effective, newest cookie information, so as to scheduling of resource module calling;Scheduling of resource module It can be realized the real-time calling to preferred value highest Agent IP and newest cookie information;Multithreading crawler module not only can be real Existing multiple threads crawl targeted website, can also carry out real-time update to Agent IP and cookie information based on message queue mechanism, The efficiency that crawler crawls is improved, flexibility and the robustness of crawler are enhanced.
The crawler system in conclusion height of multithreading intelligent scheduling of the invention is hidden solves and needs to step in targeted website Land and in the case where having certain " counter to climb " measure, can construct the Agent IP pond and the pond Cookies of stability and high efficiency, based on preferential Value scheduling strategy and message queue mechanism realize the real-time update of selected Agent IP and cookie information in crawler thread, improve Crawler crawls efficiency, robustness and the stability under distributed reptile system environments, thus polymerization rapidly and efficiently Webpage information and construct huge search library.
The above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although referring to preferred embodiment to this hair It is bright to be described in detail, those skilled in the art should understand that, it can modify to technical solution of the present invention Or equivalent replacement, without departing from the spirit and scope of the technical solution of the present invention.

Claims (10)

  1. The crawler system 1. a kind of height of multithreading intelligent scheduling is hidden characterized by comprising
    Agent IP pond module, for obtaining effective Agent IP;
    The pond Cookies module, for obtaining effective Cookies information;
    Scheduling of resource module is connected with Agent IP pond module and the pond Cookies module respectively, for generation described in real-time control The intelligent scheduling of the pond IP module and the pond Cookies module is managed, is believed with the highest Agent IP of scheduling priority value and newest Cookie Breath;
    Multithreading crawler module is connected with the scheduling of resource module, for initiating resource request to the scheduling of resource module, And receive effective Agent IP and cookie information that the scheduling of resource module issues;
    Task queue generation module is connected with the multithreading crawler module, so that the multithreading crawler module is from URL task Effective URL is randomly choosed in queue crawls targeted website;
    Back Administration Module, including database update module and background control module, the background control module with it is described multi-thread Journey crawler module is connected, and to monitor, multiple crawler threads whether there is and whether the number of crawler thread reaches number of threshold values.
  2. The crawler system 2. height of multithreading intelligent scheduling according to claim 1 is hidden, which is characterized in that the Agent IP pond The foundation of module, mainly comprises the steps that
    S1, several Agent IPs are obtained, to form the queue of Agent IP source;
    S2, each Agent IP in the queue of Agent IP source is detected, judges and stores effective Agent IP;
    S3, each effective Agent IP is calculated according to the Access Success Rate of effective Agent IP access target website and access response time Preferred value;
    S4, encapsulation web application interface choose the highest Agent IP of preferred value as request IP;
    S5, within a preset time interval, the pond dynamic renewal agency IP module.
  3. The crawler system 3. height of multithreading intelligent scheduling according to claim 2 is hidden, which is characterized in that the step S1 tool Body are as follows: periodically crawl several Agent IPs from multiple Agent IP websites or multiple agencies are bought by the third party service organization IP, to form the queue of Agent IP source.
  4. The crawler system 4. height of multithreading intelligent scheduling according to claim 2 is hidden, it is characterised in that: in the step S3 The calculation formula of the preferred value of effective Agent IP is Priorityi=0.7xi1 *+0.3xi2 *, i=1,2 ..., n, xi1 *=(xi1- min x1)/(max x1-min x1), xi2 *=(xi2-min x2)/(max x2-min x2), wherein n is the quantity of Agent IP, xi1 *And xi2 *Respectively represent the Access Success Rate value and the reciprocal value of access response time of i-th of Agent IP, max x1With max x2 Respectively indicate the maximum value of Access Success Rate value and the reciprocal value of access response time in the module of the pond current agent IP, min x1With min x2Indicate the minimum value of Access Success Rate value and the reciprocal value of access response time in the module of the pond current agent IP.
  5. The crawler system 5. height of multithreading intelligent scheduling according to claim 1 is hidden, which is characterized in that the Cookies The foundation of pond module, mainly comprises the steps that
    A, the account pond of targeted website is established;
    B, effective account within a preset time interval, is randomly selected, simulation logs in targeted website, and uses deep learning method Logon authentication code is identified, to obtain Cookies;
    C, by effective Cookies storage into the pond Cookies module;
    D, in the module of the pond periodic detection Cookies each Cookie validity, and by invalid Cookie from the pond Cookies mould It is deleted in block.
  6. The crawler system 6. height of multithreading intelligent scheduling according to claim 5 is hidden, it is characterised in that: described in step A Account pond includes effective pond for storing effective account data and the expired pond for storing expired account data, it is described effectively One is provided in pond for marking the threshold value of lowest capacity, when the account data amount in effective pond is lower than the threshold value, automatically from Account data is obtained in the expired pond, to guarantee the account data amount in effective pond at twice or more of threshold value.
  7. The crawler system 7. height of multithreading intelligent scheduling according to claim 1 is hidden, it is characterised in that: the URL task Webpage framework, the rendering information architectures such as mode and web page contents of the queue by the task queue generation module according to targeted website It is formed.
  8. The crawler system 8. height of multithreading intelligent scheduling according to claim 1 is hidden, it is characterised in that: the multithreading is climbed Erpoglyph block is opened multiple crawler threads simultaneously and is crawled, and the multithreading crawler module includes message queue module, described to disappear The unexpected message that breath Queue module returns when being crawled by multiple crawler threads forms, and the message queue module will disappear extremely Breath is sent to the scheduling of resource module, and initiates resource request to the scheduling of resource module.
  9. The crawler system 9. height of multithreading intelligent scheduling according to claim 8 is hidden, it is characterised in that: the scheduling of resource Message queue listening thread is equipped in module, to monitor the message of the message queue module sending in real time and according to message content Scheduling of resource is carried out, the scheduling of resource module is based on first in first out and provides effective generation to the multithreading crawler module Manage IP and cookie information.
  10. The crawler system 10. height of multithreading intelligent scheduling according to claim 8 is hidden, it is characterised in that: the database Update module updates storage the normal information for receiving the normal information returned when multiple crawler threads are crawled In relevant database.
CN201811481201.2A 2018-12-05 2018-12-05 The height of multithreading intelligent scheduling is hidden crawler system Pending CN109508422A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811481201.2A CN109508422A (en) 2018-12-05 2018-12-05 The height of multithreading intelligent scheduling is hidden crawler system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811481201.2A CN109508422A (en) 2018-12-05 2018-12-05 The height of multithreading intelligent scheduling is hidden crawler system

Publications (1)

Publication Number Publication Date
CN109508422A true CN109508422A (en) 2019-03-22

Family

ID=65752588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811481201.2A Pending CN109508422A (en) 2018-12-05 2018-12-05 The height of multithreading intelligent scheduling is hidden crawler system

Country Status (1)

Country Link
CN (1) CN109508422A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110457556A (en) * 2019-07-04 2019-11-15 重庆金融资产交易所有限责任公司 Distributed reptile system architecture, the method and computer equipment for crawling data
CN110489626A (en) * 2019-08-05 2019-11-22 苏州闻道网络科技股份有限公司 A kind of information collecting method and device
CN111104578A (en) * 2019-12-18 2020-05-05 北京阿尔山区块链联盟科技有限公司 Crawler system, method and server
CN111711617A (en) * 2020-05-29 2020-09-25 北京金山云网络技术有限公司 Method and device for detecting web crawler, electronic equipment and storage medium
CN111741141A (en) * 2020-06-15 2020-10-02 重庆帮企科技集团有限公司 Method and system for realizing efficient IP proxy pool and data acquisition method
CN111881337A (en) * 2020-08-06 2020-11-03 成都信息工程大学 Data acquisition method and system based on Scapy framework and storage medium
CN112416929A (en) * 2020-11-17 2021-02-26 四川长虹电器股份有限公司 Retrieval library management and data retrieval method based on mysql and java
WO2021047004A1 (en) * 2019-09-11 2021-03-18 苏州朗动网络科技有限公司 Ip proxy pool management method and device, and storage medium
CN117633329A (en) * 2024-01-26 2024-03-01 中国人民解放军军事科学院系统工程研究院 Data acquisition method and system for multiple data sources
CN117714537A (en) * 2024-02-06 2024-03-15 湖南四方天箭信息科技有限公司 Access method, device, terminal and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110246531A1 (en) * 2007-12-21 2011-10-06 Mcafee, Inc., A Delaware Corporation System, method, and computer program product for processing a prefix tree file utilizing a selected agent
CN103902386A (en) * 2014-04-11 2014-07-02 复旦大学 Multi-thread network crawler processing method based on connection proxy optimal management
CN105956175A (en) * 2016-05-24 2016-09-21 考拉征信服务有限公司 Webpage content crawling method and device
CN107832355A (en) * 2017-10-23 2018-03-23 北京金堤科技有限公司 The method and device that a kind of agency of crawlers obtains
CN108345642A (en) * 2018-01-12 2018-07-31 深圳壹账通智能科技有限公司 Method, storage medium and the server of website data are crawled using Agent IP

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110246531A1 (en) * 2007-12-21 2011-10-06 Mcafee, Inc., A Delaware Corporation System, method, and computer program product for processing a prefix tree file utilizing a selected agent
CN103902386A (en) * 2014-04-11 2014-07-02 复旦大学 Multi-thread network crawler processing method based on connection proxy optimal management
CN105956175A (en) * 2016-05-24 2016-09-21 考拉征信服务有限公司 Webpage content crawling method and device
CN107832355A (en) * 2017-10-23 2018-03-23 北京金堤科技有限公司 The method and device that a kind of agency of crawlers obtains
CN108345642A (en) * 2018-01-12 2018-07-31 深圳壹账通智能科技有限公司 Method, storage medium and the server of website data are crawled using Agent IP

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110457556A (en) * 2019-07-04 2019-11-15 重庆金融资产交易所有限责任公司 Distributed reptile system architecture, the method and computer equipment for crawling data
CN110457556B (en) * 2019-07-04 2023-11-14 重庆金融资产交易所有限责任公司 Distributed crawler system architecture, method for crawling data and computer equipment
CN110489626A (en) * 2019-08-05 2019-11-22 苏州闻道网络科技股份有限公司 A kind of information collecting method and device
WO2021047004A1 (en) * 2019-09-11 2021-03-18 苏州朗动网络科技有限公司 Ip proxy pool management method and device, and storage medium
CN111104578A (en) * 2019-12-18 2020-05-05 北京阿尔山区块链联盟科技有限公司 Crawler system, method and server
CN111711617A (en) * 2020-05-29 2020-09-25 北京金山云网络技术有限公司 Method and device for detecting web crawler, electronic equipment and storage medium
CN111741141A (en) * 2020-06-15 2020-10-02 重庆帮企科技集团有限公司 Method and system for realizing efficient IP proxy pool and data acquisition method
CN111881337A (en) * 2020-08-06 2020-11-03 成都信息工程大学 Data acquisition method and system based on Scapy framework and storage medium
CN112416929A (en) * 2020-11-17 2021-02-26 四川长虹电器股份有限公司 Retrieval library management and data retrieval method based on mysql and java
CN117633329A (en) * 2024-01-26 2024-03-01 中国人民解放军军事科学院系统工程研究院 Data acquisition method and system for multiple data sources
CN117714537A (en) * 2024-02-06 2024-03-15 湖南四方天箭信息科技有限公司 Access method, device, terminal and storage medium
CN117714537B (en) * 2024-02-06 2024-04-16 湖南四方天箭信息科技有限公司 Access method, device, terminal and storage medium

Similar Documents

Publication Publication Date Title
CN109508422A (en) The height of multithreading intelligent scheduling is hidden crawler system
US20200218658A1 (en) Invalidation and refresh of multi-tier distributed caches
CN106874487A (en) A kind of distributed reptile management system and its method
US10262271B1 (en) Systems and methods for modeling machine learning and data analytics
CN105243159B (en) A kind of distributed network crawler system based on visualization script editing machine
CN103902386B (en) Multi-thread network crawler processing method based on connection proxy optimal management
US10747670B2 (en) Reducing latency by caching derived data at an edge server
US8954971B2 (en) Data collecting method, data collecting apparatus and network management device
CN108322541B (en) Self-adaptive distributed system architecture
CN106778253A (en) Threat context aware information security Initiative Defense model based on big data
US20110264704A1 (en) Methods and Systems for Deleting Large Amounts of Data From a Multitenant Database
CN106664254A (en) Optimizing network traffic management in a mobile network
TW201237653A (en) Sending product information based on determined preference values
CN104767653B (en) A kind of method and apparatus of network interface monitoring
CN106484713A (en) A kind of based on service-oriented Distributed Request Processing system
CN109933701A (en) A kind of microblog data acquisition methods based on more strategy fusions
US20090204575A1 (en) Modular web crawling policies and metrics
Li et al. SEER-MCache: A prefetchable memory object caching system for IoT real-time data processing
CN107844402A (en) A kind of resource monitoring method, device and terminal based on super fusion storage system
WO2019109798A1 (en) Method, device, terminal and storage medium for loading resource
CN108804679A (en) A kind of operation system user's operation monitoring data method for visualizing
US20120084856A1 (en) Gathering, storing and using reputation information
Aldin et al. Strict timed causal consistency as a hybrid consistency model in the cloud environment
CN107491463A (en) The optimization method and system of data query
CN107958052A (en) A kind of access method and device of large scale network crawlers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190322

RJ01 Rejection of invention patent application after publication