CN107704323A - A kind of web crawlers method for scheduling task and device - Google Patents
A kind of web crawlers method for scheduling task and device Download PDFInfo
- Publication number
- CN107704323A CN107704323A CN201711088266.6A CN201711088266A CN107704323A CN 107704323 A CN107704323 A CN 107704323A CN 201711088266 A CN201711088266 A CN 201711088266A CN 107704323 A CN107704323 A CN 107704323A
- Authority
- CN
- China
- Prior art keywords
- task
- internal memory
- reptile
- reptile task
- memory priority
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
- G06F9/5016—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/5021—Priority
Abstract
The invention discloses a kind of web crawlers method for scheduling task and device, it is related to field of software engineering, the frequent read-write database of needs is present to solve existing reptile task scheduling, database easily blocks, and causes ineffective problem.This method includes:First scheduler receives the first reptile task, and the type of the first reptile task is determined according to the state of the first reptile task;When confirming as delay process, it is determined that corresponding with delay process perform the time and the first reptile task be stored in into cache database;Second scheduler travels through cache database in the update cycle, it is determined that when reaching the execution time, the first reptile task corresponding with performing the time is sent to internal memory priority query;3rd scheduler uses according to this acquisition reptile task of the round-robin algorithm out of internal memory priority query, until the first reptile task is taken out out of internal memory priority query.
Description
Technical field
The present invention relates to field of software engineering, more particularly relates to a kind of web crawlers method for scheduling task and device.
Background technology
Web crawlers task is a program for automatically extracting webpage, and it is search engine contained network page above and below WWW,
It is the important composition of search engine.
Existing reptile task scheduling is only made up of a scheduler module, and scheduler module performs reptile in a circulation and appointed
Business persistence, database duplicate checking, task priority sequence, it is various time-consuming to perform timed task, reptile task status statistics etc.
Operation.When reptile task quantity reaches more than dozens of, concurrent reptile task quantity will reach thousand ranks, at scheduler module
Managing these tasks needs frequently read-write database, and database burden is extremely serious, and scheduler whole efficiency becomes very low.And
And because being blocked in database manipulation, system is all used for database I/O operation often in torpor, all system resources,
CPU and network bandwidth utilization rate drop to less than 1%, extremely waste server resource.
In summary, existing reptile task scheduling, which exists, needs frequently read-write database, database easily to block, cause
Ineffective problem.
The content of the invention
The embodiment of the present invention provides a kind of web crawlers method for scheduling task and device, and to solve, existing reptile is appointed
Business scheduling, which exists, needs frequently read-write database, database easily to block, and causes ineffective problem.
The embodiment of the present invention provides a kind of web crawlers method for scheduling task, including:
First scheduler receives the first reptile task, and first reptile is determined according to the state of the first reptile task
The type of task;When the type for confirming the first reptile task is delay process, it is determined that corresponding with the delay process
Perform the time and the first reptile task is stored in cache database;
Second scheduler travels through the cache database in the update cycle, it is determined that when reaching the execution time, will be with institute
State the first reptile task corresponding to performing the time to send to internal memory priority query, according to the first reptile task
The priority for multiple reptile tasks that priority and the internal memory priority query include, confirm the first reptile task in institute
State the discharge order in internal memory priority query;
3rd scheduler uses the according to this acquisition reptile of the round-robin algorithm out of described internal memory priority query
Task, until the first reptile task is taken out out of described internal memory priority query.
Preferably, after the state according to the first reptile task determines the type of the first reptile task,
Also include:
When the type for confirming the first reptile task is currently processed, the first reptile task is sent to described
In internal memory priority query.
Preferably, multiple internal memory priority queries are included in the cache database;
It is described to send the first reptile task corresponding with the execution time to internal memory priority query, also wrap
Include:
The priority of the internal memory priority query according to where the first reptile task and multiple internal memories are excellent
The priority of first level queue, the internal memory priority query where determining the first reptile task are excellent in multiple internal memories
Discharge order in first level queue.
Preferably, the 3rd scheduler uses round-robin algorithm obtaining according to this out of described internal memory priority query
The reptile task is taken, including:
3rd scheduler is determined different excellent using round-robin algorithm out of multiple the internal memory priority queries
First priority query's distribution goes out group quantity, and the internal memory priority query where the first reptile task goes out according to
The discharge order of team quantity and the first reptile task in the internal memory priority query, determines the first reptile task
Go out occasion sequence out of the internal memory priority query where the first reptile task.
Also a kind of web crawlers task scheduling apparatus of the embodiment of the present invention, including:
First scheduler, for receiving the first reptile task, described is determined according to the state of the first reptile task
The type of one reptile task;When the type for confirming the first reptile task is delay process, it is determined that with the delay process
The first reptile task is simultaneously stored in cache database by the corresponding execution time;
Second scheduler, for traveling through the cache database in the update cycle, it is determined that when reaching the execution time, will
The first reptile task corresponding with the execution time is sent to internal memory priority query, is appointed according to first reptile
The priority for multiple reptile tasks that the priority of business and the internal memory priority query include, confirm the first reptile task
Discharge order in the internal memory priority query;
3rd scheduler, for using described in acquisition according to this of the round-robin algorithm out of described internal memory priority query
Reptile task, until the first reptile task is taken out out of described internal memory priority query.
Preferably, first scheduler is additionally operable to:When the type for confirming the first reptile task is currently processed,
The first reptile task is sent to the internal memory priority query.
Preferably, multiple internal memory priority queries are included in the cache database;
Second scheduler is additionally operable to:The internal memory priority query according to where the first reptile task it is excellent
First level and the priority of multiple internal memory priority queries, determine the internal memory priority where the first reptile task
Discharge order of the queue in multiple internal memory priority queries.
Preferably, the 3rd scheduler is specifically used for:
Different priority queues point are determined out of multiple the internal memory priority queries using round-robin algorithm
That matches somebody with somebody goes out group quantity, and the internal memory priority query where the first reptile task goes out group quantity and described according to
Discharge order of the one reptile task in the internal memory priority query, determines the first reptile task from first reptile
Go out occasion sequence in the internal memory priority query where task.
In the embodiment of the present invention, there is provided a kind of web crawlers method for scheduling task, including:First scheduler receives first
Reptile task, the type of the first reptile task is determined according to the state of the first reptile task;When confirmation described first
When the type of reptile task is delay process, it is determined that corresponding with the delay process perform the time and appoint first reptile
Business is stored in cache database;Second scheduler travels through the cache database in the update cycle, it is determined that when reaching the execution
Between when, will be sent with the execution time corresponding first reptile task to internal memory priority query, according to described the
The priority for multiple reptile tasks that the priority of one reptile task and the internal memory priority query include, confirm described first
Discharge order of the reptile task in the internal memory priority query;3rd scheduler is using round-robin algorithm from described interior
The reptile task of acquisition according to this in priority query is deposited, until the first reptile task is by from the internal memory priority team
Taken out in row.In the above method, realized using distributed memory priority query and the 3rd scheduler and reptile task is carried out
Priority scheduling;Cache database mainly enters row write by the first scheduler, and the second scheduler is read, then the first scheduler and
Two schedulers will not mutually block because of the write-in bottleneck of cache database, it might even be possible to do read and write abruption to cache database
To improve the efficiency read;First scheduler uses the quick duplicate removal of internal storage set, therefore no longer needs to carry out cache database task
Duplicate removal, so cache database need not store whole tasks to cache database, and it need to only store delayed tasks and unsuccessfully appoint
Business, is not only greatly decreased the amount of storage of cache database, and reduces the read-write pressure of cache database, so as to improve work
Efficiency.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this
Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with
Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of web crawlers method for scheduling task schematic flow sheet provided in an embodiment of the present invention;
Fig. 2 is that the embodiment of the present invention one provides a kind of web crawlers method for scheduling task schematic diagram;
Fig. 3 is a kind of web crawlers task scheduling apparatus structural representation provided in an embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made
Embodiment, belong to the scope of protection of the invention.
Fig. 1 is a kind of web crawlers method for scheduling task schematic flow sheet provided in an embodiment of the present invention, as shown in figure 1,
This method comprises the following steps:
Step 101, the first scheduler receives the first reptile task, according to determining the state of the first reptile task
The type of first reptile task;When the type for confirming the first reptile task is delay process, it is determined that with the delay
The time is performed corresponding to reason and the first reptile task is stored in cache database;
Step 102, the second scheduler travels through the cache database in the update cycle, it is determined that reaching the execution time
When, the first reptile task corresponding with the execution time is sent to internal memory priority query, according to described first
The priority for multiple reptile tasks that the priority of reptile task and the internal memory priority query include, confirm that described first climbs
Discharge order of the worm task in the internal memory priority query;
Step 103, the 3rd scheduler uses according to this acquisition of the round-robin algorithm out of described internal memory priority query
The reptile task, until the first reptile task is taken out out of described internal memory priority query.
It should be noted that in embodiments of the present invention, mainly by three scheduler modules, one is based on internal memory priority
The Multi net voting reptile priority query of queue cluster, the task distribution scheduling method that a cache database combines.Wherein,
Three scheduler modules mainly include the first scheduler, the second scheduler and the 3rd scheduler.
Wherein, three scheduler modules include the first scheduler, the second scheduler and the 3rd scheduler, and the first scheduler is used for
Create internal memory priority query and cache database.Further, the control terminal of cache database is the first scheduler, caching
The output end of database is the second scheduler, i.e. the first scheduler is responsible for carrying out cache database write-in task, and second adjusts
Degree device then reads task out of cache database.
Specifically, the first scheduler is used to carry out first pass processing to the web crawlers task received, carries out internal memory collection
Conjunction goes to fill, and filters out the task of repetition, and the state of the reptile task to receiving makes a distinction, and to the reptile task after differentiation
Carry out classification processing;Second scheduler is mainly responsible for taking out pending reptile times out of cache database according to setting time
Business, and the reptile task of taking-up is put into corresponding reptile internal memory priority query;3rd scheduler is mainly according to different
The priority and gate region idle condition of web crawlers carry out Time Slice Circular Scheduling, are taken from corresponding web crawlers queue
The web crawlers task for going out respective amount is distributed to gate region so that downstream module handles task.
In a step 101, after the first scheduler receives the first reptile task, first it needs to be determined that the first reptile task
State, according to the state of the first reptile task, it is determined that the first reptile task is sent to cache database, or send to internal memory
In priority query.Specifically, the state of the first reptile includes two kinds of delay process state and current processing status, when first
Scheduler is when state is delay process state with determining the first reptile, can reaffirm the according to the state of the first reptile task
The execution time of one reptile delay process, then the first reptile is sent to cache database and cached;When the first scheduling is gone really
When fixed first reptile task is currently processed, then the first reptile task can be sent to internal memory priority query immediately.
Below so that the state of the first reptile task is delay process as an example, the first reptile task is specifically introduced data cached
Situation is performed in storehouse and internal memory priority query:
In a step 102, after the first scheduler sends the first reptile task to cache database, the second scheduler meeting
Cache database is traveled through according to the update cycle of setting, viewed up to the reptile task for performing the time.Such as when the second scheduler
Cache database is traveled through in the update cycle, and the execution time of the first reptile task also reaches, then the second scheduler can will be with
First reptile task corresponding to performing time arrival extracts from cache database, and sends to internal memory priority query.
It should be noted that multiple reptile tasks can be arranged in actual applications, in internal memory priority query, and arrange
Multiple reptile tasks be ranked up according to the priority orders of each reptile task, when the first reptile task is added to preferentially
, it is necessary to priority by multiple reptile tasks in the priority and internal memory priority query of the first reptile task after in level queue
It is compared, it is then determined that the first reptile task putting in order in internal memory priority query.
Further, the quantity of internal memory priority query also includes multiple, and each internal memory priority query also has not
Same priority.Such as if the highest priority of the internal memory priority query where the first reptile task, the first reptile task
The internal memory priority query at place is ordered as first in multiple internal memory priority queries;It is if interior where the first reptile task
Deposit that the priority of priority query is minimum, then multiple internal memory priority teams again of the internal memory priority query where the first reptile task
Being ordered as in row is last.
It should be noted that when the state of the first reptile task is immediately treats, the first reptile task is sent to excellent
In first level queue, and can be according to the priority of multiple reptile tasks in the priority and priority query of the first reptile task
It is compared, it is then determined that the first reptile task putting in order in internal memory priority query.
In step 103, the 3rd scheduler is mainly idle according to the priority of different web crawlers and gate region
Situation carries out Time Slice Circular Scheduling, and the web crawlers task that respective amount is taken out from corresponding web crawlers queue is distributed to out
Mouth queue handles task so as to downstream module.
3rd scheduler passage time piece round robin algorithm obtains reptile task, Zhi Dao successively out of internal memory priority query
One reptile task is taken out out of internal memory priority query.Such as the priority when the first reptile place internal memory priority query
When priority in multiple internal memory priority queries is minimum, and priority of first reptile in internal memory priority query is most
Height, i.e. the first reptile in internal memory priority query when being ordered as first, because the 3rd scheduler uses timeslice wheel algorithm
It is pre-assigned to different priorities queue go out group quantity difference, the internal memory priority query where the first reptile can only be assigned to
1 when going out group quantity, then the 3rd scheduler can be in first time by where the first reptile task from the first reptile task
Deposit in priority query and take out;If the priority of internal memory priority query is in multiple internal memory priority queries where the first reptile
In priority it is minimum when, and priority of first reptile in internal memory priority query is minimum, because the 3rd scheduler uses
Timeslice wheel algorithm it is pre-assigned to different priorities queue go out group quantity difference, the internal memory priority team where the first reptile
When row can only be assigned to 1 and go out group quantity, then the 3rd scheduler for the first time can not be by the first reptile task from the first reptile task
Taken out in the internal memory priority query at place, it is necessary to by repeatedly dispatching, could be by the first reptile task from the first reptile task
Taken out in the internal memory priority query at place.
After the 3rd scheduler takes out the first reptile task out of internal memory priority query, the first reptile task can be by under
Resume module is swum, wherein, downstream module processing mainly includes carrying out HTTP request and processing back page etc., implements in present aspect
In example, the particular content for handling downstream module the first reptile task does not limit.
During downstream module handles the first reptile task, unfolded more reptile tasks, such as T1, T2,
T3...Tn, these reptile tasks can reach the first scheduler successively and be handled accordingly.
In summary, in the embodiment of the present invention, there is provided a kind of web crawlers method for scheduling task, utilize distributed memory
Priority query and the 3rd scheduler are realized carries out priority scheduling to reptile task;Cache database is mainly by the first scheduling
Device enters row write, and the second scheduler is read, then the first scheduler and the second scheduler will not be because of the write-in bottles of cache database
Neck and mutually block, it might even be possible to read and write abruption is done to cache database to improve the efficiency of reading;First scheduler uses internal memory
Gather quick duplicate removal, therefore no longer need to carry out cache database task duplicate removal, so cache database need not store whole
Task need to only store delayed tasks and failed tasks to cache database, and the amount of storage of cache database is not only greatly decreased,
And the read-write pressure of cache database is reduced, so as to improve operating efficiency.
Fig. 2 is that the embodiment of the present invention one provides a kind of web crawlers method for scheduling task schematic diagram, it is assumed that excellent including one
First level minimum reptile S, the reptile task T with limit priority that the reptile S includes.Specifically:
Step 201, reptile task T reaches the first scheduler, and the first scheduler is according to reptile task T condition adjudgement, such as
Fruit is immediate task, and reptile task T is put into corresponding reptile S internal memory priority query, if delay reptile, mark perform
Time is simultaneously put into cache database.It is labeled to perform the time and be put into reptile S's because reptile task T is delayed tasks
Cache database;
Step 202, the second scheduler travels through cache database in the update cycle, searches and has reached the reptile times for performing the time
Be engaged in T, and is put into corresponding reptile S internal memory priority query, after wait and perform the cycle next time.The reptile task that is delayed T's
After performing time arrival, reptile task T is taken within some cycle that renewal scheduler performs from reptile S cache database
Go out, be put into reptile S internal memory priority query;
Existing multiple reptile task T in reptile S internal memory priority query, but reptile task T highest priority, so
Reptile task T is advanced to the head of internal memory priority query;
Step 203, the 3rd scheduler, usage time piece round robin algorithm allocate out team's number in advance to different priorities reptile S
Amount, because reptile S priority is minimum in all reptile S, is only assigned to 1 and goes out group quantity.Go out group scheduler from S's
1 reptile task T that head is taken out in internal memory priority query is put into dequeue, therefore reptile task T is removed;
After step 203, reptile task T is handled by downstream module, including carries out HTTP request and processing back page etc..
More reptile task T1, T2, T3...Tn are expanded during downstream processes T, these tasks reach master scheduler and quilt successively
Handled accordingly.
Based on same inventive concept, the embodiments of the invention provide a kind of web crawlers task scheduling apparatus, due to the dress
It is similar to a kind of web crawlers method for scheduling task to put the principle of solution technical problem, therefore the implementation side of may refer to of the device
The implementation of method, repeat part and repeat no more.
Fig. 3 is a kind of web crawlers task scheduling apparatus structural representation provided in an embodiment of the present invention, as shown in figure 3,
The device includes:First scheduler 301, the second scheduler 302 and the 3rd scheduler 303.
First scheduler 301, for receiving the first reptile task, according to determining the state of the first reptile task
The type of first reptile task;When the type for confirming the first reptile task is delay process, it is determined that with the delay
The time is performed corresponding to reason and the first reptile task is stored in cache database;
Second scheduler 302, for traveling through the cache database in the update cycle, it is determined that reaching the execution time
When, the first reptile task corresponding with the execution time is sent to internal memory priority query, according to described first
The priority for multiple reptile tasks that the priority of reptile task and the internal memory priority query include, confirm that described first climbs
Discharge order of the worm task in the internal memory priority query;
3rd scheduler 303, for the acquisition according to this using round-robin algorithm out of described internal memory priority query
The reptile task, until the first reptile task is taken out out of described internal memory priority query.
Preferably, first scheduler 301 is additionally operable to:When the type for confirming the first reptile task is currently processed
When, the first reptile task is sent to the internal memory priority query.
Preferably, multiple internal memory priority queries are included in the cache database;
Second scheduler 302 is additionally operable to:The internal memory priority query according to where the first reptile task
Priority and multiple internal memory priority queries priority, the internal memory where determining the first reptile task is excellent
Discharge order of the first level queue in multiple internal memory priority queries.
Preferably, the 3rd scheduler 303 is specifically used for:
Different priority queues point are determined out of multiple the internal memory priority queries using round-robin algorithm
That matches somebody with somebody goes out group quantity, and the internal memory priority query where the first reptile task goes out group quantity and described according to
Discharge order of the one reptile task in the internal memory priority query, determines the first reptile task from first reptile
Go out occasion sequence in the internal memory priority query where task.
It should be appreciated that the unit that includes of flame spread rates device determined above only according to the function realized of the apparatus
The logical partitioning of progress, in practical application, the superposition or fractionation of said units can be carried out.And the determination that the embodiment provides
The method for the determination flame spread rates that the function that the device of flame spread rates is realized provides with above-described embodiment corresponds, right
Should the more detailed handling process realized of device, be described in detail in above method embodiment one, herein no longer
It is described in detail.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program
Product.Therefore, the present invention can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Apply the form of example.Moreover, the present invention can use the computer for wherein including computer usable program code in one or more
The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.)
The form of product.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram
Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided
The processors of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce
A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real
The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to
Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or
The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted
Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, so as in computer or
The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in individual square frame or multiple square frames.
Although preferred embodiments of the present invention have been described, but those skilled in the art once know basic creation
Property concept, then can make other change and modification to these embodiments.So appended claims be intended to be construed to include it is excellent
Select embodiment and fall into having altered and changing for the scope of the invention.
Obviously, those skilled in the art can carry out the essence of various changes and modification without departing from the present invention to the present invention
God and scope.So, if these modifications and variations of the present invention belong to the scope of the claims in the present invention and its equivalent technologies
Within, then the present invention is also intended to comprising including these changes and modification.
Claims (8)
- A kind of 1. web crawlers method for scheduling task, it is characterised in that including:First scheduler receives the first reptile task, and the first reptile task is determined according to the state of the first reptile task Type;When the type for confirming the first reptile task is delay process, it is determined that corresponding with the delay process perform The first reptile task is simultaneously stored in cache database by the time;Second scheduler travels through the cache database in the update cycle, it is determined that when reaching the execution time, will be held with described The first reptile task is sent to internal memory priority query corresponding to the row time, according to the preferential of the first reptile task The priority for multiple reptile tasks that level and the internal memory priority query include, confirm the first reptile task described interior Deposit the discharge order in priority query;3rd scheduler uses according to this acquisition of the round-robin algorithm out of the described internal memory priority query reptile task, Until the first reptile task is taken out out of described internal memory priority query.
- 2. dispatching method as claimed in claim 1, it is characterised in that described to be determined according to the state of the first reptile task After the type of the first reptile task, in addition to:When the type for confirming the first reptile task is currently processed, the first reptile task is sent to the internal memory In priority query.
- 3. dispatching method as claimed in claim 1, it is characterised in that it is excellent to include multiple internal memories in the cache database First level queue;It is described to send the first reptile task corresponding with the execution time to internal memory priority query, in addition to:The priority of the internal memory priority query according to where the first reptile task and multiple internal memory priority The priority of queue, determine the internal memory priority query where the first reptile task in multiple internal memory priority Discharge order in queue.
- 4. dispatching method as claimed in claim 3, it is characterised in that the 3rd scheduler using round-robin algorithm from The reptile task of acquisition according to this in the internal memory priority query, including:3rd scheduler is determined different preferential excellent using round-robin algorithm out of multiple the internal memory priority queries First level queue assignment goes out group quantity, and the internal memory priority query where the first reptile task goes out team's number according to The discharge order of amount and the first reptile task in the internal memory priority query, determines the first reptile task from institute State in the internal memory priority query where the first reptile task and go out occasion sequence.
- A kind of 5. web crawlers task scheduling apparatus, it is characterised in that including:First scheduler, for receiving the first reptile task, determine that described first climbs according to the state of the first reptile task The type of worm task;When the type for confirming the first reptile task is delay process, it is determined that corresponding with the delay process The execution time and the first reptile task is stored in cache database;Second scheduler, for traveling through the cache database in the update cycle, it is determined that when reaching the execution time, will be with institute State the first reptile task corresponding to performing the time to send to internal memory priority query, according to the first reptile task The priority for multiple reptile tasks that priority and the internal memory priority query include, confirm the first reptile task in institute State the discharge order in internal memory priority query;3rd scheduler, for the reptile of acquisition according to this using round-robin algorithm out of described internal memory priority query Task, until the first reptile task is taken out out of described internal memory priority query.
- 6. dispatching device as claimed in claim 5, it is characterised in that first scheduler is additionally operable to:When confirming described the When the type of one reptile task is currently processed, the first reptile task is sent to the internal memory priority query.
- 7. dispatching device as claimed in claim 5, it is characterised in that it is excellent to include multiple internal memories in the cache database First level queue;Second scheduler is additionally operable to:The priority of the internal memory priority query according to where the first reptile task With the priority of multiple internal memory priority queries, the internal memory priority query where the first reptile task is determined Discharge order in multiple internal memory priority queries.
- 8. dispatching device as claimed in claim 7, it is characterised in that the 3rd scheduler is specifically used for:Different priority queue distribution are determined out of multiple the internal memory priority queries using round-robin algorithm Go out group quantity, the internal memory priority query where the first reptile task goes out group quantity and described first according to and climbed Discharge order of the worm task in the internal memory priority query, determines the first reptile task from the first reptile task Go out occasion sequence in the internal memory priority query at place.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711088266.6A CN107704323A (en) | 2017-11-07 | 2017-11-07 | A kind of web crawlers method for scheduling task and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711088266.6A CN107704323A (en) | 2017-11-07 | 2017-11-07 | A kind of web crawlers method for scheduling task and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107704323A true CN107704323A (en) | 2018-02-16 |
Family
ID=61178780
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711088266.6A Pending CN107704323A (en) | 2017-11-07 | 2017-11-07 | A kind of web crawlers method for scheduling task and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107704323A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110288993A (en) * | 2019-06-26 | 2019-09-27 | 广州探迹科技有限公司 | A kind of individualized intelligent voice interactive method and device based on container technique |
CN110968420A (en) * | 2018-09-30 | 2020-04-07 | 北京国双科技有限公司 | Scheduling method and device for multi-crawler platform, storage medium and processor |
CN111274013A (en) * | 2020-01-16 | 2020-06-12 | 北京思特奇信息技术股份有限公司 | Method and system for optimizing timed task scheduling based on memory database in container |
CN112231538A (en) * | 2020-12-15 | 2021-01-15 | 中移(苏州)软件技术有限公司 | Method, device, equipment and storage medium for updating scheduling task queue |
CN112286655A (en) * | 2020-10-19 | 2021-01-29 | 江苏银承网络科技股份有限公司 | Distributed delay scheduling method, device and system |
CN112416551A (en) * | 2020-11-19 | 2021-02-26 | 清创网御(合肥)科技有限公司 | Distributed crawler scheduling system |
CN112596882A (en) * | 2020-12-25 | 2021-04-02 | 上海悦易网络信息技术有限公司 | Method, device and system for scheduling delayed tasks |
CN116501502A (en) * | 2023-06-25 | 2023-07-28 | 电子科技大学 | Data parallel optimization method based on Pytorch framework |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1828541A (en) * | 2006-04-07 | 2006-09-06 | 浙江大学 | Implementation method for timing task in Java operating system |
CN1862575A (en) * | 2005-08-19 | 2006-11-15 | 华为技术有限公司 | Method for planing dispatching timing task |
CN1873615A (en) * | 2006-01-20 | 2006-12-06 | 华为技术有限公司 | Method for servicing task of timer |
CN104346215A (en) * | 2013-08-07 | 2015-02-11 | 中兴通讯股份有限公司 | Task scheduling service system and method |
CN104407922A (en) * | 2014-10-29 | 2015-03-11 | 中国建设银行股份有限公司 | Asynchronous batch-processing dispatching method and system |
CN105900064A (en) * | 2014-11-19 | 2016-08-24 | 华为技术有限公司 | Method and apparatus for scheduling data flow task |
CN106020951A (en) * | 2016-05-12 | 2016-10-12 | 中国农业银行股份有限公司 | Task scheduling method and system |
CN106547492A (en) * | 2016-12-08 | 2017-03-29 | 北京得瑞领新科技有限公司 | A kind of operational order dispatching method of NAND flash memory equipment and device |
CN106775977A (en) * | 2016-12-09 | 2017-05-31 | 北京小米移动软件有限公司 | Method for scheduling task, apparatus and system |
CN106970874A (en) * | 2017-01-22 | 2017-07-21 | 阿里巴巴集团控股有限公司 | A kind of task processing method, device and electronic equipment |
CN106980543A (en) * | 2017-04-05 | 2017-07-25 | 福建智恒软件科技有限公司 | The distributed task dispatching method and device triggered based on event |
CN107180050A (en) * | 2016-03-11 | 2017-09-19 | 精硕科技(北京)股份有限公司 | A kind of data grabber system and method |
-
2017
- 2017-11-07 CN CN201711088266.6A patent/CN107704323A/en active Pending
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1862575A (en) * | 2005-08-19 | 2006-11-15 | 华为技术有限公司 | Method for planing dispatching timing task |
CN1873615A (en) * | 2006-01-20 | 2006-12-06 | 华为技术有限公司 | Method for servicing task of timer |
CN1828541A (en) * | 2006-04-07 | 2006-09-06 | 浙江大学 | Implementation method for timing task in Java operating system |
CN104346215A (en) * | 2013-08-07 | 2015-02-11 | 中兴通讯股份有限公司 | Task scheduling service system and method |
CN104407922A (en) * | 2014-10-29 | 2015-03-11 | 中国建设银行股份有限公司 | Asynchronous batch-processing dispatching method and system |
CN105900064A (en) * | 2014-11-19 | 2016-08-24 | 华为技术有限公司 | Method and apparatus for scheduling data flow task |
CN107180050A (en) * | 2016-03-11 | 2017-09-19 | 精硕科技(北京)股份有限公司 | A kind of data grabber system and method |
CN106020951A (en) * | 2016-05-12 | 2016-10-12 | 中国农业银行股份有限公司 | Task scheduling method and system |
CN106547492A (en) * | 2016-12-08 | 2017-03-29 | 北京得瑞领新科技有限公司 | A kind of operational order dispatching method of NAND flash memory equipment and device |
CN106775977A (en) * | 2016-12-09 | 2017-05-31 | 北京小米移动软件有限公司 | Method for scheduling task, apparatus and system |
CN106970874A (en) * | 2017-01-22 | 2017-07-21 | 阿里巴巴集团控股有限公司 | A kind of task processing method, device and electronic equipment |
CN106980543A (en) * | 2017-04-05 | 2017-07-25 | 福建智恒软件科技有限公司 | The distributed task dispatching method and device triggered based on event |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110968420A (en) * | 2018-09-30 | 2020-04-07 | 北京国双科技有限公司 | Scheduling method and device for multi-crawler platform, storage medium and processor |
CN110288993A (en) * | 2019-06-26 | 2019-09-27 | 广州探迹科技有限公司 | A kind of individualized intelligent voice interactive method and device based on container technique |
CN111274013A (en) * | 2020-01-16 | 2020-06-12 | 北京思特奇信息技术股份有限公司 | Method and system for optimizing timed task scheduling based on memory database in container |
CN112286655A (en) * | 2020-10-19 | 2021-01-29 | 江苏银承网络科技股份有限公司 | Distributed delay scheduling method, device and system |
CN112416551A (en) * | 2020-11-19 | 2021-02-26 | 清创网御(合肥)科技有限公司 | Distributed crawler scheduling system |
CN112231538A (en) * | 2020-12-15 | 2021-01-15 | 中移(苏州)软件技术有限公司 | Method, device, equipment and storage medium for updating scheduling task queue |
CN112231538B (en) * | 2020-12-15 | 2021-05-14 | 中移(苏州)软件技术有限公司 | Method, device, equipment and storage medium for updating scheduling task queue |
CN112596882A (en) * | 2020-12-25 | 2021-04-02 | 上海悦易网络信息技术有限公司 | Method, device and system for scheduling delayed tasks |
CN116501502A (en) * | 2023-06-25 | 2023-07-28 | 电子科技大学 | Data parallel optimization method based on Pytorch framework |
CN116501502B (en) * | 2023-06-25 | 2023-09-05 | 电子科技大学 | Data parallel optimization method based on Pytorch framework |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107704323A (en) | A kind of web crawlers method for scheduling task and device | |
CN103345514B (en) | Streaming data processing method under big data environment | |
Patterson et al. | Scheduling a project under multiple resource constraints: a zero-one programming approach | |
US9798830B2 (en) | Stream data multiprocessing method | |
US8180975B2 (en) | Controlling interference in shared memory systems using parallelism-aware batch scheduling | |
US20150058858A1 (en) | Dynamic task prioritization for in-memory databases | |
CN103729480B (en) | Method for rapidly finding and scheduling multiple ready tasks of multi-kernel real-time operating system | |
CN106406987A (en) | Task execution method and apparatus in cluster | |
CA3177212A1 (en) | Resource allocating method, device, computer equipment, and storage medium | |
CN105843819B (en) | Data export method and device | |
CN105550274B (en) | The querying method and device of this parallel database of two-pack | |
CN109240795A (en) | A kind of resource regulating method of the cloud computing resources pool model suitable for super fusion IT infrastructure | |
CN105608138B (en) | A kind of system of optimization array data base concurrency data loading performance | |
CN110874271A (en) | Method and system for rapidly calculating mass building pattern spot characteristics | |
CN110275681A (en) | A kind of date storage method and data-storage system | |
US20110023044A1 (en) | Scheduling highly parallel jobs having global interdependencies | |
Wang et al. | CEFS: Compute-efficient flow scheduling for iterative synchronous applications | |
DE102013100169A1 (en) | Computer-implemented method for selection of a processor, which is incorporated in multiple processors to receive work, which relates to an arithmetic problem | |
CN103440113A (en) | Disk IO (Input/output) resource allocation method and device | |
DE102012220365A1 (en) | Method for preempting execution of program instructions in multi-process-assisted system, involves executing different program instructions in processing pipeline under utilization of one of contexts | |
CN110175073B (en) | Scheduling method, sending method, device and related equipment of data exchange job | |
EP2840513A1 (en) | Dynamic task prioritization for in-memory databases | |
CN116560860A (en) | Real-time optimization adjustment method for resource priority based on machine learning | |
CN103530742B (en) | Improve the method and device of scheduling arithmetic speed | |
CN114461356A (en) | Control method for number of processes of scheduler and IaaS cloud platform scheduling system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180216 |
|
RJ01 | Rejection of invention patent application after publication |