CN109508422A - The height of multithreading intelligent scheduling is hidden crawler system - Google Patents
The height of multithreading intelligent scheduling is hidden crawler system Download PDFInfo
- Publication number
- CN109508422A CN109508422A CN201811481201.2A CN201811481201A CN109508422A CN 109508422 A CN109508422 A CN 109508422A CN 201811481201 A CN201811481201 A CN 201811481201A CN 109508422 A CN109508422 A CN 109508422A
- Authority
- CN
- China
- Prior art keywords
- module
- agent
- pond
- crawler
- multithreading
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
Abstract
It hides crawler system the present invention provides a kind of height of multithreading intelligent scheduling, it mainly include six modules: Agent IP pond module, the pond Cookies module, scheduling of resource module, multithreading crawler module, task queue generation module and Back Administration Module, interconnection/cooperation between each module, efficiency, robustness and the stability under distributed reptile system environments are crawled so as to improve crawler, and then cluster web pages information rapidly and efficiently and constructs huge search library.
Description
Technical field
It hides crawler system the present invention relates to a kind of height of multithreading intelligent scheduling, belongs to technical field of the computer network.
Background technique
Human society comes into big data era, with the swift and violent hair of internet, mobile Internet, social networks etc.
Exhibition, various substantial amounts, many kinds of, generation whenever and wherever possible and the big data updated, contains unprecedented social value
And commercial value.To the acquisition of big data, processing and analysis and based on the intelligent use of big data, have become raising not
Carry out the key element of enterprise competitiveness.
Web crawlers is a kind of efficient information collection sharp weapon, can quickly and accurately acquire what we wanted using it
Various data resources.Traditional web crawlers method often is easy to be sealed when website has certain " counter to climb " strategy, especially
It is our accessible part webpages and requested part interface when crawling the website such as GitHub, microblogging etc. for needing to log in, but
It is not log in directly to crawl to have some drawbacks: first is that being provided with the part webpage of load right can not normally crawl;Second is that
Frequent requests are easy to be limited by website in the case where not logging in or IP is directly sealed;Third is that an account frequently access or
Regular request access to of person can be identified as crawler script by website and account is caused to be sealed.It therefore, is accurately and efficiently to acquire
To required data, it would be desirable to which taking has targetedly counterattacking measure.
In view of this, hideing crawler system it is necessory to the height for providing a kind of multithreading intelligent scheduling, to solve the above problems.
Summary of the invention
It hides crawler system the purpose of the present invention is to provide a kind of height of multithreading intelligent scheduling, which can be into
Row efficiently crawls, and that improves crawler crawls efficiency, robustness and the stability under distributed reptile system environments, thus
Cluster web pages information rapidly and efficiently and construct huge search library.
For achieving the above object, it hides crawler system the present invention provides a kind of height of multithreading intelligent scheduling, comprising:
Agent IP pond module, for obtaining effective Agent IP;
The pond Cookies module, for obtaining effective Cookies information;
Scheduling of resource module is connected with Agent IP pond module and the pond Cookies module respectively, is used for real-time control institute
The intelligent scheduling of Agent IP pond module and the pond Cookies module is stated, with the highest Agent IP of scheduling priority value and newest
Cookie information;
Multithreading crawler module is connected with the scheduling of resource module, for initiating resource to the scheduling of resource module
Request, and receive effective Agent IP and cookie information that the scheduling of resource module issues;
Task queue generation module is connected with the multithreading crawler module, so that the multithreading crawler module is from URL
Effective URL is randomly choosed in task queue crawls targeted website;
Back Administration Module, including database update module and background control module, the background control module with it is described
Multithreading crawler module is connected, and to monitor, multiple crawler threads whether there is and whether the number of crawler thread reaches number of threshold values.
As a further improvement of the present invention, the foundation of Agent IP pond module, mainly comprises the steps that
S1, several Agent IPs are obtained, to form the queue of Agent IP source;
S2, each Agent IP in the queue of Agent IP source is detected, judges and stores effective Agent IP;
S3, in each effective generation, is calculated according to the Access Success Rate of effective Agent IP access target website and access response time
Manage the preferred value of IP;
S4, encapsulation web application interface choose the highest Agent IP of preferred value as request IP;
S5, within a preset time interval, the pond dynamic renewal agency IP module.
As a further improvement of the present invention, the step S1 specifically: if periodically being crawled from multiple Agent IP websites
Dry Agent IP buys multiple Agent IPs by the third party service organization, to form the queue of Agent IP source.
As a further improvement of the present invention, the calculation formula of the preferred value of effective Agent IP is in the step S3
Priorityi=0.7xi1 *+0.3xi2 *, i=1,2 ..., n, xi1 *=(xi1-min x1)/(max x1-min x1), xi2 *=
(xi2-min x2)/(max x2-min x2), wherein n is the quantity of Agent IP, xi1 *And xi2 *Respectively represent i-th of Agent IP
The reciprocal value of Access Success Rate value and access response time, max x1With max x2It respectively indicates in the module of the pond current agent IP and visits
Ask the maximum value of success ratio values and the reciprocal value of access response time, min x1With min x2It indicates in the module of the pond current agent IP
The minimum value of Access Success Rate value and the reciprocal value of access response time.
As a further improvement of the present invention, the foundation of the pond Cookies module, mainly comprises the steps that
A, the account pond of targeted website is established;
B, effective account within a preset time interval, is randomly selected, simulation logs in targeted website, and uses deep learning
Method identifies logon authentication code, to obtain Cookies;
C, by effective Cookies storage into the pond Cookies module;
D, in the module of the pond periodic detection Cookies each Cookie validity, and by invalid Cookie from Cookies
It is deleted in the module of pond.
As a further improvement of the present invention, in step A, the account pond includes for storing having for effective account data
Pond and the expired pond for storing expired account data are imitated, one is provided in effective pond for marking the threshold of lowest capacity
Value obtains account data from the expired pond automatically when the account data amount in effective pond is lower than the threshold value, to guarantee to have
The account data amount in pond is imitated at twice or more of threshold value.
As a further improvement of the present invention, the URL task queue is by the task queue generation module according to target
The information architectures such as webpage framework, rendering mode and the web page contents of website are formed.
As a further improvement of the present invention, the multithreading crawler module is opened multiple crawler threads simultaneously and is climbed
It takes, the multithreading crawler module includes message queue module, and the message queue module is crawled by multiple crawler threads
When the unexpected message composition that returns, unexpected message is sent to the scheduling of resource module by the message queue module, and to institute
It states scheduling of resource module and initiates resource request.
As a further improvement of the present invention, message queue listening thread is equipped in the scheduling of resource module, with real-time
It monitors the message that the message queue module issues and scheduling of resource is carried out according to message content, the scheduling of resource module is based on
First in first out provides effective Agent IP and cookie information to the multithreading crawler module.
As a further improvement of the present invention, the database update module is crawled for receiving multiple crawler threads
When the normal information that returns, and the normal information is updated storage in relevant database.
The beneficial effects of the present invention are: the height of multithreading intelligent scheduling of the invention is hidden, there are six moulds for crawler system setting
Block: Agent IP pond module, the pond Cookies module, scheduling of resource module, multithreading crawler module, task queue generation module with
And Back Administration Module, and interconnection/cooperation between each module, so as to improve crawler crawl efficiency, robustness with
And the stability under distributed reptile system environments, and then cluster web pages information rapidly and efficiently and construct huge retrieval
Library.
Detailed description of the invention
Fig. 1 is that the height of multithreading intelligent scheduling of the present invention is hidden the structure function figure of crawler system.
Fig. 2 is the building flow chart of Agent IP pond module in Fig. 1.
Fig. 3 is the building flow chart of the pond Cookies module in Fig. 1.
Fig. 4 is the structural schematic diagram of scheduling of resource module in Fig. 1.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, right in the following with reference to the drawings and specific embodiments
The present invention is described in detail.
Refering to Figure 1, hideing crawler system present invention discloses a kind of height of multithreading intelligent scheduling, in target
In the case that website has certain " counter to climb strategy ", efficiently crawled, improve crawler crawl efficiency, robustness and
Stability under distributed reptile system environments, cluster web pages information then rapidly and efficiently and constructs huge retrieval
Library.
The height of multithreading intelligent scheduling crawler system of hideing mainly includes following six module: Agent IP pond module,
The pond Cookies module, scheduling of resource module, multithreading crawler module, task queue generation module and Back Administration Module.
Wherein, Agent IP pond module is for obtaining effective Agent IP;The pond Cookies module is effective for obtaining
Cookies information;Scheduling of resource module is connected with Agent IP pond module and the pond Cookies module respectively, is used for real-time control generation
The intelligent scheduling of the pond IP module and the pond Cookies module is managed, is believed with the highest Agent IP of scheduling priority value and newest Cookie
Breath.
Multithreading crawler module is connected with scheduling of resource module, for initiating resource request to scheduling of resource module, and connects
Receive effective Agent IP and cookie information that scheduling of resource module issues.
Task queue generation module is connected with multithreading crawler module, so that multithreading crawler module is from URL task queue
The middle effective URL of random selection crawls targeted website.Wherein, the URL task queue is by the task queue generation module root
It is formed according to information architectures such as the webpage framework of targeted website, rendering mode and web page contents.
Back Administration Module includes database update module and background control module, background control module and multithreading crawler
Module is connected, and to monitor, multiple crawler threads whether there is and whether the number of crawler thread reaches number of threshold values.
Above-mentioned six modules will be described in detail respectively below.
It please join Fig. 2 and as shown in connection with fig. 1, the foundation of Agent IP pond module mainly includes following five steps:
S1, several Agent IPs are obtained, to form the queue of Agent IP source;
S2, each Agent IP in the queue of Agent IP source is detected, judges and stores effective Agent IP;
S3, in each effective generation, is calculated according to the Access Success Rate of effective Agent IP access target website and access response time
Manage the preferred value of IP;
S4, encapsulation web application interface choose the highest Agent IP of preferred value as request IP;
S5, within a preset time interval, the pond dynamic renewal agency IP module.
Specifically: step S1 specifically: periodically crawl several Agent IPs from multiple Agent IP websites or pass through
The third party service organization buys multiple Agent IPs, to form the queue of Agent IP source.Preferably, in order to guarantee the effective of Agent IP
Property, should hide agency from the acquisition of different sources and as far as possible crawl height, and the queue of Agent IP source is added in the Agent IP that will acquire.
Step S2 specifically: the maximum capacity of default Agent IP pond module is T, the queue of Agent IP source is traversed, in queue
Each Agent IP detected, judge whether acquired Agent IP effective, and effective Agent IP is stored in Agent IP
In the module of pond.Specifically, access target website is gone to the Agent IP for each Agent IP, if accessed successfully, it is determined that
The Agent IP effectively and be put into the database of Agent IP pond module, if access failure, it is determined that the Agent IP be it is invalid, no
It is put into the module of Agent IP pond.
In step S3, in order to safeguard the validity of Agent IP pond module, every preset time to the Agent IP in database
It is detected, and targeted website is set and is linked as detection link.In order to use most effective agency every time during crawling
IP rather than the Agent IP randomly selected all should be according to the visit of effective Agent IP access target website in each detection process
Ask that success rate and access response time calculate the preferred value of each effectively Agent IP.Wherein, Access Success Rate refers to default
Time in the successful total access times of number Zhan of access ratio, the access response time, which refers to issue a request to, receives target
The time that website is responded, preferred value here reflect a possibility that each Agent IP is scheduled size.
The specific calculation of preferred value are as follows:
1, achievement data standardizes.Because Access Success Rate is direct index for preferred value, i.e. Access Success Rate is bigger,
Agent IP is more effective, and preferred value is bigger;The access response time is negative index for preferred value, i.e. the access response time gets over
Short, Agent IP is more effective, and preferred value is bigger;Therefore the inverse of access response time is taken as New Set to replace access response
Time, such two classes index is direct index for acting on behalf of ip-precedence value, is denoted as x respectively1、x2。
2, achievement data standardizes.In order to eliminate the influence that dimension calculates preferred value, standardized using 0-1, by two classes
Index is normalized into respectively in [0,1] section, it may be assumed that
xi1 *=(xi1-min x1)/(max x1-min x1), xi2 *=(xi2-min x2)/(max x2-min x2)
Wherein, i=1,2 ..., n, n are the quantity of Agent IP, xi1And x *i2* respectively represent the access of i-th of Agent IP at
The reciprocal value of performance number and access response time, max x1With max x2It respectively indicates and is accessed successfully in the module of the pond current agent IP
The maximum value of rate value and the reciprocal value of access response time, min x1With min x2Indicate the pond current agent IP module in access at
The minimum value of performance number and the reciprocal value of access response time.
3, since the importance of Access Success Rate is greater than the access response time, so taking the two class index weight values to be respectively respectively
0.7 and 0.3, therefore, the calculation formula that can define the preferred value of each effective Agent IP in the module of the pond current agent IP is
Priorityi=0.7xi1 *+0.3xi2 *, i=1,2 ..., n, wherein n is the quantity of Agent IP.
In step S4, corresponding Agent IP data are obtained due to being directly connected to database and need to configure link information, this
Sample is easy the link information of exposure database, therefore can pass through the web application interface of the encapsulation pond access agent IP module
(Web api interface) obtains effective Agent IP.In addition, in access agent IP, in order to guarantee selected Agent IP tool
There is higher validity, chooses preferred value highest Agent IP from the module of Agent IP pond every time as requesting IP (i.e. crawler
Agent IP).
In step S5, in order to ensure the Agent IP validity with higher in the module of Agent IP pond, therefore when default
Between T0Update an Agent IP pond module.Specifically, every preset time T0Obtain an Agent IP source queue, the Agent IP
Agent IP, the Agent IP from third party service organization's purchase and the attached Agent IP pond that source queue is crawled by multiple Agent IP websites
In Agent IP composition, each Agent IP in queue is detected, invalid Agent IP is rejected, for effective Agent IP
Its preferred value is calculated one by one according to step S3;If the capacity of the pond current agent IP module less than 2/3, adds effective agency
Otherwise IP replaces the lower Agent IP of preferred value, to construct efficient Agent IP pond module.
Fig. 3 and as shown in connection with fig. 1 is please referred to, the foundation of the pond Cookies module mainly includes following four step:
A, the account pond of targeted website is established;
B, effective account within a preset time interval, is randomly selected, simulation logs in targeted website, and uses deep learning
Method identifies logon authentication code, to obtain Cookies;
C, by effective Cookies storage into the pond Cookies module;
D, in the module of the pond periodic detection Cookies each Cookie validity, and by invalid Cookie from Cookies
It is deleted in the module of pond.
In step A, the account pond includes effective pond for storing effective account data and for storing expired account
The expired pond of data is provided with one in effective pond for marking the threshold value of lowest capacity, when the account number in effective pond
When according to amount lower than the threshold value, account data is obtained from the expired pond automatically, and guarantee that the account data amount in effective pond exists
Twice or more of threshold value.Certainly, in order to guarantee that the dynamic of the validity of account data and the pond Cookies module is more in account pond
New property replaces the account data in an account pond every some cycles.
Step B specifically: every preset time, open multiple account crawl threads, randomly grab one from effective pond
A account, including username and password information, then simulation logs in targeted website, designs suitable deep learning method, intelligence
It identifies logon authentication code, if logging in success, the Cookies information of acquisition is just returned to and given memory module storage, if
Failure is logged in, then switches account from effective pond at random and continues to simulate log on request, and the account immigration for logging in failure is expired
Expired time is arranged in pond.
Step C specifically: together by the effective Cookies information obtained in step B, the username and password information of account
It stores in the module of the pond Cookies, while the account for reaching expired time in expired pond is moved into again in effective pond, realize account
The effective use of number resource.Certainly, it is operated in order to facilitate queried access etc., should also provide some data access interfaces.
Step D specifically: every the validity of each cookie information in the preset time detection pond Cookies module, such as
Fruit is invalid, then deletes invalid Cookie from the module of the pond Cookies.Specifically, a timing detection module can be increased
And corresponding detection link is set, all Cookies request detection links are traversed, invalid then deletion effectively then retains
As Cookies to be taken.
Task queue generation module is responsible for according to information structures such as the webpage framework of targeted website, rendering mode and web page contents
Build the URL task queue crawled.In general, the URL of website has certain rule, " can be spelled according to this rule in advance
Connect " go out to need to crawl the URL of website.Such as: Baidu search engine generate search result can carry many pages, wherein page i-th and
The URL composition of i+1 page is most of be it is identical, only pn parameter is different, and pn parameter is for controlling the page such as i-th
The pn value of page is i*10, and the pn value of i+1 page is (i+1) * 10.Certainly, such rule is also had as remaining web page class.
Therefore, according to this rule, task queue generation module can set the URL rule for the webpage to be crawled in advance,
And the number of pages for the website for needing to crawl, a URL task queue is generated with a cyclic program again later, database is written
In, and the state that crawls that these URL task queues are arranged is not crawl.
Multithreading crawler module opens multiple crawler threads simultaneously and carries out crawling for targeted website, and receive crawler thread/
What targeted website returned crawls information.The multithreading crawler module includes message queue module, the message queue module by
The exception information crawled in information that multiple crawler threads return when crawling encapsulates the unexpected message to be formed composition, message queue mould
Unexpected message is sent to scheduling of resource module by block, and initiates resource request to scheduling of resource module.Scheduling of resource module according to
The content of unexpected message simultaneously provides effective Agent IP and cookie information to multithreading crawler module based on first in first out.
Specifically, what multithreading crawler module received that multiple crawler threads return crawls information, and will crawl in information
Exception information be packaged into unexpected message, form message queue, subsequent message queue initiates resource request to scheduling of resource module,
Based on first in first out (FIFO), scheduling of resource module every time according to scheduling strategy and the particular content of unexpected message be in
The crawler thread of head of the queue provides effective Agent IP and cookie information.After each crawler thread obtains resource, from URL task team
Effective URL is randomly choosed in column and crawls targeted website, and self-test is carried out according to the information that crawls that targeted website returns, if returned
Return crawl information be it is normal, then continue to use identical resource information and crawled, and crawl information by normal and be sent to
Database update module;If what is returned crawls information to be abnormal, exception information is packaged into unexpected message, message team is added
Column are requested scheduling of resource module replacing Agent IP or cookie information, are crawled with re-starting.
Wherein, the content of multithreading crawler module self-test includes whether the secondary request is primary correct network request, i.e.,
The responsive state code of detection service device;Further include the secondary request whether by server be identified as crawler request, the method master of detection
Whether what if detection returned crawls comprising certain keywords in information, such as: " identifying code ", " request is too fast " etc..
As it can be seen that multithreading crawler module, which can be realized multithreading, crawls targeted website, and message queue mechanism is combined, mentioned
The efficiency that high crawler crawls, enhances flexibility and the robustness of crawler.
Scheduling of resource module is used for the intelligent scheduling of real-time control Agent IP pond module and the pond Cookies module.
Fig. 4 and as shown in connection with fig. 1 is please referred to, message queue listening thread is equipped in scheduling of resource module, to monitor in real time
The message of message queue module sending simultaneously carries out scheduling of resource according to message content.Specifically: if message queue listening-in line
Journey listen to crawler thread return exception information for failure, then according to the failure information judgement be Agent IP failure or
Cookie failure then uses preferred value dispatching algorithm if Agent IP fails, and it is current to choose the highest Agent IP replacement of preferred value
Crawler IP;If Cookie fails, then chooses newest Cookie and be replaced.The message queue listening thread possesses one
It is main thread, multiple from thread, when service is monitored on starting backstage, if main thread cannot work, it can be used and substituted from thread.
Agent IP scheduling thread, Cookies scheduling thread and resource transmission thread are additionally provided in scheduling of resource module, it is described
Agent IP scheduling thread is used to call the highest Agent IP of preferred value, the Cookies scheduling thread from the module of Agent IP pond
For calling newest cookie information, the preferred value that the resource transmission thread is used to call from the module of the pond Cookies
Highest Agent IP and newest cookie information send multithreading crawler module to, so that the progress of multithreading crawler module is multiple
Crawler thread crawls.
Back Administration Module includes database update module and background control module.Database update module is more for receiving
What a crawler thread returned normally crawls information, and this is normally crawled information update storage to relevant database
In MySQL.
The function of background control module is to guarantee that crawler thread can be stable in running background.Specifically, should
Multiple crawler threads whether there is background control module and whether the number of crawler thread reaches number of threshold values for monitoring, if climbed
The number of worm thread is maintained at number of threshold values or more, then normally exits;Otherwise start new crawler thread, so that of crawler thread
Number reaches number of threshold values.
In order to guarantee that the background control module can take in running background using the crontab of (SuSE) Linux OS
Business, crontab order is common among the operating system of Unix and class Unix, for the instruction being periodically performed to be arranged.It should
Crontab order is read from standard input device and is instructed, and is deposited in " crontab " file, for reading later and
It executes.Timed task is added in crontab, is realized and is executed a Background control script per minute.
When the height using multithreading intelligent scheduling of the invention hides crawler system, firstly, module acquisition in Agent IP pond has
The Agent IP of effect, the pond Cookies module obtain effective Cookies, and scheduling of resource module is adjusted from the module of Agent IP pond later
The highest Agent IP of preferred value is spent, while dispatching newest cookie information from the module of the pond Cookies.
Then, message queue module is initialized, if there are message in message queue module, to scheduling of resource mould
Block sends message (it can be appreciated that sending resource request), and scheduling of resource module is listened to according to message queue listening thread
Message content carries out scheduling of resource, that is, returns to optimal resource information (including the highest Agent IP of preferred value and newest
Cookie information) give multithreading crawler module.
Finally, multithreading crawler module obtains URL task queue from task queue generation module, and randomly choose effective
URL crawl targeted website, crawled simultaneously at this point, multithreading crawler module opens multiple crawler threads, and according to crawler thread/
Targeted website return crawl information carry out self-test, if return crawl information be it is normal, continue to use identical resource
Information is crawled, and normally will be crawled information and be sent to database update module, will be normal by database update module
Information update storage is crawled into relevant database MySQL;If what is returned crawls information as exception, exception information is sealed
It dresses up unexpected message to be added in message queue module, scheduling of resource module replacing Agent IP or Cookie is requested, to re-start
It crawls.
It should be understood that the Agent IP pond module in the present invention can be based on preferred value management, to be scheduling of resource mould
Block provides most effective Agent IP;The pond Cookies module uses and builds account pond, the effective account of selection carries out simulation and logs in target
The mode of website can obtain effective, newest cookie information, so as to scheduling of resource module calling;Scheduling of resource module
It can be realized the real-time calling to preferred value highest Agent IP and newest cookie information;Multithreading crawler module not only can be real
Existing multiple threads crawl targeted website, can also carry out real-time update to Agent IP and cookie information based on message queue mechanism,
The efficiency that crawler crawls is improved, flexibility and the robustness of crawler are enhanced.
The crawler system in conclusion height of multithreading intelligent scheduling of the invention is hidden solves and needs to step in targeted website
Land and in the case where having certain " counter to climb " measure, can construct the Agent IP pond and the pond Cookies of stability and high efficiency, based on preferential
Value scheduling strategy and message queue mechanism realize the real-time update of selected Agent IP and cookie information in crawler thread, improve
Crawler crawls efficiency, robustness and the stability under distributed reptile system environments, thus polymerization rapidly and efficiently
Webpage information and construct huge search library.
The above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although referring to preferred embodiment to this hair
It is bright to be described in detail, those skilled in the art should understand that, it can modify to technical solution of the present invention
Or equivalent replacement, without departing from the spirit and scope of the technical solution of the present invention.
Claims (10)
- The crawler system 1. a kind of height of multithreading intelligent scheduling is hidden characterized by comprisingAgent IP pond module, for obtaining effective Agent IP;The pond Cookies module, for obtaining effective Cookies information;Scheduling of resource module is connected with Agent IP pond module and the pond Cookies module respectively, for generation described in real-time control The intelligent scheduling of the pond IP module and the pond Cookies module is managed, is believed with the highest Agent IP of scheduling priority value and newest Cookie Breath;Multithreading crawler module is connected with the scheduling of resource module, for initiating resource request to the scheduling of resource module, And receive effective Agent IP and cookie information that the scheduling of resource module issues;Task queue generation module is connected with the multithreading crawler module, so that the multithreading crawler module is from URL task Effective URL is randomly choosed in queue crawls targeted website;Back Administration Module, including database update module and background control module, the background control module with it is described multi-thread Journey crawler module is connected, and to monitor, multiple crawler threads whether there is and whether the number of crawler thread reaches number of threshold values.
- The crawler system 2. height of multithreading intelligent scheduling according to claim 1 is hidden, which is characterized in that the Agent IP pond The foundation of module, mainly comprises the steps thatS1, several Agent IPs are obtained, to form the queue of Agent IP source;S2, each Agent IP in the queue of Agent IP source is detected, judges and stores effective Agent IP;S3, each effective Agent IP is calculated according to the Access Success Rate of effective Agent IP access target website and access response time Preferred value;S4, encapsulation web application interface choose the highest Agent IP of preferred value as request IP;S5, within a preset time interval, the pond dynamic renewal agency IP module.
- The crawler system 3. height of multithreading intelligent scheduling according to claim 2 is hidden, which is characterized in that the step S1 tool Body are as follows: periodically crawl several Agent IPs from multiple Agent IP websites or multiple agencies are bought by the third party service organization IP, to form the queue of Agent IP source.
- The crawler system 4. height of multithreading intelligent scheduling according to claim 2 is hidden, it is characterised in that: in the step S3 The calculation formula of the preferred value of effective Agent IP is Priorityi=0.7xi1 *+0.3xi2 *, i=1,2 ..., n, xi1 *=(xi1- min x1)/(max x1-min x1), xi2 *=(xi2-min x2)/(max x2-min x2), wherein n is the quantity of Agent IP, xi1 *And xi2 *Respectively represent the Access Success Rate value and the reciprocal value of access response time of i-th of Agent IP, max x1With max x2 Respectively indicate the maximum value of Access Success Rate value and the reciprocal value of access response time in the module of the pond current agent IP, min x1With min x2Indicate the minimum value of Access Success Rate value and the reciprocal value of access response time in the module of the pond current agent IP.
- The crawler system 5. height of multithreading intelligent scheduling according to claim 1 is hidden, which is characterized in that the Cookies The foundation of pond module, mainly comprises the steps thatA, the account pond of targeted website is established;B, effective account within a preset time interval, is randomly selected, simulation logs in targeted website, and uses deep learning method Logon authentication code is identified, to obtain Cookies;C, by effective Cookies storage into the pond Cookies module;D, in the module of the pond periodic detection Cookies each Cookie validity, and by invalid Cookie from the pond Cookies mould It is deleted in block.
- The crawler system 6. height of multithreading intelligent scheduling according to claim 5 is hidden, it is characterised in that: described in step A Account pond includes effective pond for storing effective account data and the expired pond for storing expired account data, it is described effectively One is provided in pond for marking the threshold value of lowest capacity, when the account data amount in effective pond is lower than the threshold value, automatically from Account data is obtained in the expired pond, to guarantee the account data amount in effective pond at twice or more of threshold value.
- The crawler system 7. height of multithreading intelligent scheduling according to claim 1 is hidden, it is characterised in that: the URL task Webpage framework, the rendering information architectures such as mode and web page contents of the queue by the task queue generation module according to targeted website It is formed.
- The crawler system 8. height of multithreading intelligent scheduling according to claim 1 is hidden, it is characterised in that: the multithreading is climbed Erpoglyph block is opened multiple crawler threads simultaneously and is crawled, and the multithreading crawler module includes message queue module, described to disappear The unexpected message that breath Queue module returns when being crawled by multiple crawler threads forms, and the message queue module will disappear extremely Breath is sent to the scheduling of resource module, and initiates resource request to the scheduling of resource module.
- The crawler system 9. height of multithreading intelligent scheduling according to claim 8 is hidden, it is characterised in that: the scheduling of resource Message queue listening thread is equipped in module, to monitor the message of the message queue module sending in real time and according to message content Scheduling of resource is carried out, the scheduling of resource module is based on first in first out and provides effective generation to the multithreading crawler module Manage IP and cookie information.
- The crawler system 10. height of multithreading intelligent scheduling according to claim 8 is hidden, it is characterised in that: the database Update module updates storage the normal information for receiving the normal information returned when multiple crawler threads are crawled In relevant database.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811481201.2A CN109508422A (en) | 2018-12-05 | 2018-12-05 | The height of multithreading intelligent scheduling is hidden crawler system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811481201.2A CN109508422A (en) | 2018-12-05 | 2018-12-05 | The height of multithreading intelligent scheduling is hidden crawler system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109508422A true CN109508422A (en) | 2019-03-22 |
Family
ID=65752588
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811481201.2A Pending CN109508422A (en) | 2018-12-05 | 2018-12-05 | The height of multithreading intelligent scheduling is hidden crawler system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109508422A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110457556A (en) * | 2019-07-04 | 2019-11-15 | 重庆金融资产交易所有限责任公司 | Distributed reptile system architecture, the method and computer equipment for crawling data |
CN110489626A (en) * | 2019-08-05 | 2019-11-22 | 苏州闻道网络科技股份有限公司 | A kind of information collecting method and device |
CN111104578A (en) * | 2019-12-18 | 2020-05-05 | 北京阿尔山区块链联盟科技有限公司 | Crawler system, method and server |
CN111711617A (en) * | 2020-05-29 | 2020-09-25 | 北京金山云网络技术有限公司 | Method and device for detecting web crawler, electronic equipment and storage medium |
CN111741141A (en) * | 2020-06-15 | 2020-10-02 | 重庆帮企科技集团有限公司 | Method and system for realizing efficient IP proxy pool and data acquisition method |
CN111881337A (en) * | 2020-08-06 | 2020-11-03 | 成都信息工程大学 | Data acquisition method and system based on Scapy framework and storage medium |
CN112416929A (en) * | 2020-11-17 | 2021-02-26 | 四川长虹电器股份有限公司 | Retrieval library management and data retrieval method based on mysql and java |
WO2021047004A1 (en) * | 2019-09-11 | 2021-03-18 | 苏州朗动网络科技有限公司 | Ip proxy pool management method and device, and storage medium |
CN117633329A (en) * | 2024-01-26 | 2024-03-01 | 中国人民解放军军事科学院系统工程研究院 | Data acquisition method and system for multiple data sources |
CN117714537A (en) * | 2024-02-06 | 2024-03-15 | 湖南四方天箭信息科技有限公司 | Access method, device, terminal and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110246531A1 (en) * | 2007-12-21 | 2011-10-06 | Mcafee, Inc., A Delaware Corporation | System, method, and computer program product for processing a prefix tree file utilizing a selected agent |
CN103902386A (en) * | 2014-04-11 | 2014-07-02 | 复旦大学 | Multi-thread network crawler processing method based on connection proxy optimal management |
CN105956175A (en) * | 2016-05-24 | 2016-09-21 | 考拉征信服务有限公司 | Webpage content crawling method and device |
CN107832355A (en) * | 2017-10-23 | 2018-03-23 | 北京金堤科技有限公司 | The method and device that a kind of agency of crawlers obtains |
CN108345642A (en) * | 2018-01-12 | 2018-07-31 | 深圳壹账通智能科技有限公司 | Method, storage medium and the server of website data are crawled using Agent IP |
-
2018
- 2018-12-05 CN CN201811481201.2A patent/CN109508422A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110246531A1 (en) * | 2007-12-21 | 2011-10-06 | Mcafee, Inc., A Delaware Corporation | System, method, and computer program product for processing a prefix tree file utilizing a selected agent |
CN103902386A (en) * | 2014-04-11 | 2014-07-02 | 复旦大学 | Multi-thread network crawler processing method based on connection proxy optimal management |
CN105956175A (en) * | 2016-05-24 | 2016-09-21 | 考拉征信服务有限公司 | Webpage content crawling method and device |
CN107832355A (en) * | 2017-10-23 | 2018-03-23 | 北京金堤科技有限公司 | The method and device that a kind of agency of crawlers obtains |
CN108345642A (en) * | 2018-01-12 | 2018-07-31 | 深圳壹账通智能科技有限公司 | Method, storage medium and the server of website data are crawled using Agent IP |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110457556A (en) * | 2019-07-04 | 2019-11-15 | 重庆金融资产交易所有限责任公司 | Distributed reptile system architecture, the method and computer equipment for crawling data |
CN110457556B (en) * | 2019-07-04 | 2023-11-14 | 重庆金融资产交易所有限责任公司 | Distributed crawler system architecture, method for crawling data and computer equipment |
CN110489626A (en) * | 2019-08-05 | 2019-11-22 | 苏州闻道网络科技股份有限公司 | A kind of information collecting method and device |
WO2021047004A1 (en) * | 2019-09-11 | 2021-03-18 | 苏州朗动网络科技有限公司 | Ip proxy pool management method and device, and storage medium |
CN111104578A (en) * | 2019-12-18 | 2020-05-05 | 北京阿尔山区块链联盟科技有限公司 | Crawler system, method and server |
CN111711617A (en) * | 2020-05-29 | 2020-09-25 | 北京金山云网络技术有限公司 | Method and device for detecting web crawler, electronic equipment and storage medium |
CN111741141A (en) * | 2020-06-15 | 2020-10-02 | 重庆帮企科技集团有限公司 | Method and system for realizing efficient IP proxy pool and data acquisition method |
CN111881337A (en) * | 2020-08-06 | 2020-11-03 | 成都信息工程大学 | Data acquisition method and system based on Scapy framework and storage medium |
CN112416929A (en) * | 2020-11-17 | 2021-02-26 | 四川长虹电器股份有限公司 | Retrieval library management and data retrieval method based on mysql and java |
CN117633329A (en) * | 2024-01-26 | 2024-03-01 | 中国人民解放军军事科学院系统工程研究院 | Data acquisition method and system for multiple data sources |
CN117714537A (en) * | 2024-02-06 | 2024-03-15 | 湖南四方天箭信息科技有限公司 | Access method, device, terminal and storage medium |
CN117714537B (en) * | 2024-02-06 | 2024-04-16 | 湖南四方天箭信息科技有限公司 | Access method, device, terminal and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109508422A (en) | The height of multithreading intelligent scheduling is hidden crawler system | |
US20200218658A1 (en) | Invalidation and refresh of multi-tier distributed caches | |
CN106874487A (en) | A kind of distributed reptile management system and its method | |
US10262271B1 (en) | Systems and methods for modeling machine learning and data analytics | |
CN105243159B (en) | A kind of distributed network crawler system based on visualization script editing machine | |
CN103902386B (en) | Multi-thread network crawler processing method based on connection proxy optimal management | |
US10747670B2 (en) | Reducing latency by caching derived data at an edge server | |
US8954971B2 (en) | Data collecting method, data collecting apparatus and network management device | |
CN108322541B (en) | Self-adaptive distributed system architecture | |
CN106778253A (en) | Threat context aware information security Initiative Defense model based on big data | |
US20110264704A1 (en) | Methods and Systems for Deleting Large Amounts of Data From a Multitenant Database | |
CN106664254A (en) | Optimizing network traffic management in a mobile network | |
TW201237653A (en) | Sending product information based on determined preference values | |
CN104767653B (en) | A kind of method and apparatus of network interface monitoring | |
CN106484713A (en) | A kind of based on service-oriented Distributed Request Processing system | |
CN109933701A (en) | A kind of microblog data acquisition methods based on more strategy fusions | |
US20090204575A1 (en) | Modular web crawling policies and metrics | |
Li et al. | SEER-MCache: A prefetchable memory object caching system for IoT real-time data processing | |
CN107844402A (en) | A kind of resource monitoring method, device and terminal based on super fusion storage system | |
WO2019109798A1 (en) | Method, device, terminal and storage medium for loading resource | |
CN108804679A (en) | A kind of operation system user's operation monitoring data method for visualizing | |
US20120084856A1 (en) | Gathering, storing and using reputation information | |
Aldin et al. | Strict timed causal consistency as a hybrid consistency model in the cloud environment | |
CN107491463A (en) | The optimization method and system of data query | |
CN107958052A (en) | A kind of access method and device of large scale network crawlers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190322 |
|
RJ01 | Rejection of invention patent application after publication |