CN107704323A - A kind of web crawlers method for scheduling task and device - Google Patents

A kind of web crawlers method for scheduling task and device Download PDF

Info

Publication number
CN107704323A
CN107704323A CN201711088266.6A CN201711088266A CN107704323A CN 107704323 A CN107704323 A CN 107704323A CN 201711088266 A CN201711088266 A CN 201711088266A CN 107704323 A CN107704323 A CN 107704323A
Authority
CN
China
Prior art keywords
task
internal memory
reptile
reptile task
memory priority
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711088266.6A
Other languages
Chinese (zh)
Inventor
陈开冉
邓楚健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Trace Technology Co Ltd
Original Assignee
Guangzhou Trace Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Trace Technology Co Ltd filed Critical Guangzhou Trace Technology Co Ltd
Priority to CN201711088266.6A priority Critical patent/CN107704323A/en
Publication of CN107704323A publication Critical patent/CN107704323A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5021Priority

Abstract

The invention discloses a kind of web crawlers method for scheduling task and device, it is related to field of software engineering, the frequent read-write database of needs is present to solve existing reptile task scheduling, database easily blocks, and causes ineffective problem.This method includes:First scheduler receives the first reptile task, and the type of the first reptile task is determined according to the state of the first reptile task;When confirming as delay process, it is determined that corresponding with delay process perform the time and the first reptile task be stored in into cache database;Second scheduler travels through cache database in the update cycle, it is determined that when reaching the execution time, the first reptile task corresponding with performing the time is sent to internal memory priority query;3rd scheduler uses according to this acquisition reptile task of the round-robin algorithm out of internal memory priority query, until the first reptile task is taken out out of internal memory priority query.

Description

A kind of web crawlers method for scheduling task and device
Technical field
The present invention relates to field of software engineering, more particularly relates to a kind of web crawlers method for scheduling task and device.
Background technology
Web crawlers task is a program for automatically extracting webpage, and it is search engine contained network page above and below WWW, It is the important composition of search engine.
Existing reptile task scheduling is only made up of a scheduler module, and scheduler module performs reptile in a circulation and appointed Business persistence, database duplicate checking, task priority sequence, it is various time-consuming to perform timed task, reptile task status statistics etc. Operation.When reptile task quantity reaches more than dozens of, concurrent reptile task quantity will reach thousand ranks, at scheduler module Managing these tasks needs frequently read-write database, and database burden is extremely serious, and scheduler whole efficiency becomes very low.And And because being blocked in database manipulation, system is all used for database I/O operation often in torpor, all system resources, CPU and network bandwidth utilization rate drop to less than 1%, extremely waste server resource.
In summary, existing reptile task scheduling, which exists, needs frequently read-write database, database easily to block, cause Ineffective problem.
The content of the invention
The embodiment of the present invention provides a kind of web crawlers method for scheduling task and device, and to solve, existing reptile is appointed Business scheduling, which exists, needs frequently read-write database, database easily to block, and causes ineffective problem.
The embodiment of the present invention provides a kind of web crawlers method for scheduling task, including:
First scheduler receives the first reptile task, and first reptile is determined according to the state of the first reptile task The type of task;When the type for confirming the first reptile task is delay process, it is determined that corresponding with the delay process Perform the time and the first reptile task is stored in cache database;
Second scheduler travels through the cache database in the update cycle, it is determined that when reaching the execution time, will be with institute State the first reptile task corresponding to performing the time to send to internal memory priority query, according to the first reptile task The priority for multiple reptile tasks that priority and the internal memory priority query include, confirm the first reptile task in institute State the discharge order in internal memory priority query;
3rd scheduler uses the according to this acquisition reptile of the round-robin algorithm out of described internal memory priority query Task, until the first reptile task is taken out out of described internal memory priority query.
Preferably, after the state according to the first reptile task determines the type of the first reptile task, Also include:
When the type for confirming the first reptile task is currently processed, the first reptile task is sent to described In internal memory priority query.
Preferably, multiple internal memory priority queries are included in the cache database;
It is described to send the first reptile task corresponding with the execution time to internal memory priority query, also wrap Include:
The priority of the internal memory priority query according to where the first reptile task and multiple internal memories are excellent The priority of first level queue, the internal memory priority query where determining the first reptile task are excellent in multiple internal memories Discharge order in first level queue.
Preferably, the 3rd scheduler uses round-robin algorithm obtaining according to this out of described internal memory priority query The reptile task is taken, including:
3rd scheduler is determined different excellent using round-robin algorithm out of multiple the internal memory priority queries First priority query's distribution goes out group quantity, and the internal memory priority query where the first reptile task goes out according to The discharge order of team quantity and the first reptile task in the internal memory priority query, determines the first reptile task Go out occasion sequence out of the internal memory priority query where the first reptile task.
Also a kind of web crawlers task scheduling apparatus of the embodiment of the present invention, including:
First scheduler, for receiving the first reptile task, described is determined according to the state of the first reptile task The type of one reptile task;When the type for confirming the first reptile task is delay process, it is determined that with the delay process The first reptile task is simultaneously stored in cache database by the corresponding execution time;
Second scheduler, for traveling through the cache database in the update cycle, it is determined that when reaching the execution time, will The first reptile task corresponding with the execution time is sent to internal memory priority query, is appointed according to first reptile The priority for multiple reptile tasks that the priority of business and the internal memory priority query include, confirm the first reptile task Discharge order in the internal memory priority query;
3rd scheduler, for using described in acquisition according to this of the round-robin algorithm out of described internal memory priority query Reptile task, until the first reptile task is taken out out of described internal memory priority query.
Preferably, first scheduler is additionally operable to:When the type for confirming the first reptile task is currently processed, The first reptile task is sent to the internal memory priority query.
Preferably, multiple internal memory priority queries are included in the cache database;
Second scheduler is additionally operable to:The internal memory priority query according to where the first reptile task it is excellent First level and the priority of multiple internal memory priority queries, determine the internal memory priority where the first reptile task Discharge order of the queue in multiple internal memory priority queries.
Preferably, the 3rd scheduler is specifically used for:
Different priority queues point are determined out of multiple the internal memory priority queries using round-robin algorithm That matches somebody with somebody goes out group quantity, and the internal memory priority query where the first reptile task goes out group quantity and described according to Discharge order of the one reptile task in the internal memory priority query, determines the first reptile task from first reptile Go out occasion sequence in the internal memory priority query where task.
In the embodiment of the present invention, there is provided a kind of web crawlers method for scheduling task, including:First scheduler receives first Reptile task, the type of the first reptile task is determined according to the state of the first reptile task;When confirmation described first When the type of reptile task is delay process, it is determined that corresponding with the delay process perform the time and appoint first reptile Business is stored in cache database;Second scheduler travels through the cache database in the update cycle, it is determined that when reaching the execution Between when, will be sent with the execution time corresponding first reptile task to internal memory priority query, according to described the The priority for multiple reptile tasks that the priority of one reptile task and the internal memory priority query include, confirm described first Discharge order of the reptile task in the internal memory priority query;3rd scheduler is using round-robin algorithm from described interior The reptile task of acquisition according to this in priority query is deposited, until the first reptile task is by from the internal memory priority team Taken out in row.In the above method, realized using distributed memory priority query and the 3rd scheduler and reptile task is carried out Priority scheduling;Cache database mainly enters row write by the first scheduler, and the second scheduler is read, then the first scheduler and Two schedulers will not mutually block because of the write-in bottleneck of cache database, it might even be possible to do read and write abruption to cache database To improve the efficiency read;First scheduler uses the quick duplicate removal of internal storage set, therefore no longer needs to carry out cache database task Duplicate removal, so cache database need not store whole tasks to cache database, and it need to only store delayed tasks and unsuccessfully appoint Business, is not only greatly decreased the amount of storage of cache database, and reduces the read-write pressure of cache database, so as to improve work Efficiency.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of web crawlers method for scheduling task schematic flow sheet provided in an embodiment of the present invention;
Fig. 2 is that the embodiment of the present invention one provides a kind of web crawlers method for scheduling task schematic diagram;
Fig. 3 is a kind of web crawlers task scheduling apparatus structural representation provided in an embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.
Fig. 1 is a kind of web crawlers method for scheduling task schematic flow sheet provided in an embodiment of the present invention, as shown in figure 1, This method comprises the following steps:
Step 101, the first scheduler receives the first reptile task, according to determining the state of the first reptile task The type of first reptile task;When the type for confirming the first reptile task is delay process, it is determined that with the delay The time is performed corresponding to reason and the first reptile task is stored in cache database;
Step 102, the second scheduler travels through the cache database in the update cycle, it is determined that reaching the execution time When, the first reptile task corresponding with the execution time is sent to internal memory priority query, according to described first The priority for multiple reptile tasks that the priority of reptile task and the internal memory priority query include, confirm that described first climbs Discharge order of the worm task in the internal memory priority query;
Step 103, the 3rd scheduler uses according to this acquisition of the round-robin algorithm out of described internal memory priority query The reptile task, until the first reptile task is taken out out of described internal memory priority query.
It should be noted that in embodiments of the present invention, mainly by three scheduler modules, one is based on internal memory priority The Multi net voting reptile priority query of queue cluster, the task distribution scheduling method that a cache database combines.Wherein, Three scheduler modules mainly include the first scheduler, the second scheduler and the 3rd scheduler.
Wherein, three scheduler modules include the first scheduler, the second scheduler and the 3rd scheduler, and the first scheduler is used for Create internal memory priority query and cache database.Further, the control terminal of cache database is the first scheduler, caching The output end of database is the second scheduler, i.e. the first scheduler is responsible for carrying out cache database write-in task, and second adjusts Degree device then reads task out of cache database.
Specifically, the first scheduler is used to carry out first pass processing to the web crawlers task received, carries out internal memory collection Conjunction goes to fill, and filters out the task of repetition, and the state of the reptile task to receiving makes a distinction, and to the reptile task after differentiation Carry out classification processing;Second scheduler is mainly responsible for taking out pending reptile times out of cache database according to setting time Business, and the reptile task of taking-up is put into corresponding reptile internal memory priority query;3rd scheduler is mainly according to different The priority and gate region idle condition of web crawlers carry out Time Slice Circular Scheduling, are taken from corresponding web crawlers queue The web crawlers task for going out respective amount is distributed to gate region so that downstream module handles task.
In a step 101, after the first scheduler receives the first reptile task, first it needs to be determined that the first reptile task State, according to the state of the first reptile task, it is determined that the first reptile task is sent to cache database, or send to internal memory In priority query.Specifically, the state of the first reptile includes two kinds of delay process state and current processing status, when first Scheduler is when state is delay process state with determining the first reptile, can reaffirm the according to the state of the first reptile task The execution time of one reptile delay process, then the first reptile is sent to cache database and cached;When the first scheduling is gone really When fixed first reptile task is currently processed, then the first reptile task can be sent to internal memory priority query immediately.
Below so that the state of the first reptile task is delay process as an example, the first reptile task is specifically introduced data cached Situation is performed in storehouse and internal memory priority query:
In a step 102, after the first scheduler sends the first reptile task to cache database, the second scheduler meeting Cache database is traveled through according to the update cycle of setting, viewed up to the reptile task for performing the time.Such as when the second scheduler Cache database is traveled through in the update cycle, and the execution time of the first reptile task also reaches, then the second scheduler can will be with First reptile task corresponding to performing time arrival extracts from cache database, and sends to internal memory priority query.
It should be noted that multiple reptile tasks can be arranged in actual applications, in internal memory priority query, and arrange Multiple reptile tasks be ranked up according to the priority orders of each reptile task, when the first reptile task is added to preferentially , it is necessary to priority by multiple reptile tasks in the priority and internal memory priority query of the first reptile task after in level queue It is compared, it is then determined that the first reptile task putting in order in internal memory priority query.
Further, the quantity of internal memory priority query also includes multiple, and each internal memory priority query also has not Same priority.Such as if the highest priority of the internal memory priority query where the first reptile task, the first reptile task The internal memory priority query at place is ordered as first in multiple internal memory priority queries;It is if interior where the first reptile task Deposit that the priority of priority query is minimum, then multiple internal memory priority teams again of the internal memory priority query where the first reptile task Being ordered as in row is last.
It should be noted that when the state of the first reptile task is immediately treats, the first reptile task is sent to excellent In first level queue, and can be according to the priority of multiple reptile tasks in the priority and priority query of the first reptile task It is compared, it is then determined that the first reptile task putting in order in internal memory priority query.
In step 103, the 3rd scheduler is mainly idle according to the priority of different web crawlers and gate region Situation carries out Time Slice Circular Scheduling, and the web crawlers task that respective amount is taken out from corresponding web crawlers queue is distributed to out Mouth queue handles task so as to downstream module.
3rd scheduler passage time piece round robin algorithm obtains reptile task, Zhi Dao successively out of internal memory priority query One reptile task is taken out out of internal memory priority query.Such as the priority when the first reptile place internal memory priority query When priority in multiple internal memory priority queries is minimum, and priority of first reptile in internal memory priority query is most Height, i.e. the first reptile in internal memory priority query when being ordered as first, because the 3rd scheduler uses timeslice wheel algorithm It is pre-assigned to different priorities queue go out group quantity difference, the internal memory priority query where the first reptile can only be assigned to 1 when going out group quantity, then the 3rd scheduler can be in first time by where the first reptile task from the first reptile task Deposit in priority query and take out;If the priority of internal memory priority query is in multiple internal memory priority queries where the first reptile In priority it is minimum when, and priority of first reptile in internal memory priority query is minimum, because the 3rd scheduler uses Timeslice wheel algorithm it is pre-assigned to different priorities queue go out group quantity difference, the internal memory priority team where the first reptile When row can only be assigned to 1 and go out group quantity, then the 3rd scheduler for the first time can not be by the first reptile task from the first reptile task Taken out in the internal memory priority query at place, it is necessary to by repeatedly dispatching, could be by the first reptile task from the first reptile task Taken out in the internal memory priority query at place.
After the 3rd scheduler takes out the first reptile task out of internal memory priority query, the first reptile task can be by under Resume module is swum, wherein, downstream module processing mainly includes carrying out HTTP request and processing back page etc., implements in present aspect In example, the particular content for handling downstream module the first reptile task does not limit.
During downstream module handles the first reptile task, unfolded more reptile tasks, such as T1, T2, T3...Tn, these reptile tasks can reach the first scheduler successively and be handled accordingly.
In summary, in the embodiment of the present invention, there is provided a kind of web crawlers method for scheduling task, utilize distributed memory Priority query and the 3rd scheduler are realized carries out priority scheduling to reptile task;Cache database is mainly by the first scheduling Device enters row write, and the second scheduler is read, then the first scheduler and the second scheduler will not be because of the write-in bottles of cache database Neck and mutually block, it might even be possible to read and write abruption is done to cache database to improve the efficiency of reading;First scheduler uses internal memory Gather quick duplicate removal, therefore no longer need to carry out cache database task duplicate removal, so cache database need not store whole Task need to only store delayed tasks and failed tasks to cache database, and the amount of storage of cache database is not only greatly decreased, And the read-write pressure of cache database is reduced, so as to improve operating efficiency.
Fig. 2 is that the embodiment of the present invention one provides a kind of web crawlers method for scheduling task schematic diagram, it is assumed that excellent including one First level minimum reptile S, the reptile task T with limit priority that the reptile S includes.Specifically:
Step 201, reptile task T reaches the first scheduler, and the first scheduler is according to reptile task T condition adjudgement, such as Fruit is immediate task, and reptile task T is put into corresponding reptile S internal memory priority query, if delay reptile, mark perform Time is simultaneously put into cache database.It is labeled to perform the time and be put into reptile S's because reptile task T is delayed tasks Cache database;
Step 202, the second scheduler travels through cache database in the update cycle, searches and has reached the reptile times for performing the time Be engaged in T, and is put into corresponding reptile S internal memory priority query, after wait and perform the cycle next time.The reptile task that is delayed T's After performing time arrival, reptile task T is taken within some cycle that renewal scheduler performs from reptile S cache database Go out, be put into reptile S internal memory priority query;
Existing multiple reptile task T in reptile S internal memory priority query, but reptile task T highest priority, so Reptile task T is advanced to the head of internal memory priority query;
Step 203, the 3rd scheduler, usage time piece round robin algorithm allocate out team's number in advance to different priorities reptile S Amount, because reptile S priority is minimum in all reptile S, is only assigned to 1 and goes out group quantity.Go out group scheduler from S's 1 reptile task T that head is taken out in internal memory priority query is put into dequeue, therefore reptile task T is removed;
After step 203, reptile task T is handled by downstream module, including carries out HTTP request and processing back page etc.. More reptile task T1, T2, T3...Tn are expanded during downstream processes T, these tasks reach master scheduler and quilt successively Handled accordingly.
Based on same inventive concept, the embodiments of the invention provide a kind of web crawlers task scheduling apparatus, due to the dress It is similar to a kind of web crawlers method for scheduling task to put the principle of solution technical problem, therefore the implementation side of may refer to of the device The implementation of method, repeat part and repeat no more.
Fig. 3 is a kind of web crawlers task scheduling apparatus structural representation provided in an embodiment of the present invention, as shown in figure 3, The device includes:First scheduler 301, the second scheduler 302 and the 3rd scheduler 303.
First scheduler 301, for receiving the first reptile task, according to determining the state of the first reptile task The type of first reptile task;When the type for confirming the first reptile task is delay process, it is determined that with the delay The time is performed corresponding to reason and the first reptile task is stored in cache database;
Second scheduler 302, for traveling through the cache database in the update cycle, it is determined that reaching the execution time When, the first reptile task corresponding with the execution time is sent to internal memory priority query, according to described first The priority for multiple reptile tasks that the priority of reptile task and the internal memory priority query include, confirm that described first climbs Discharge order of the worm task in the internal memory priority query;
3rd scheduler 303, for the acquisition according to this using round-robin algorithm out of described internal memory priority query The reptile task, until the first reptile task is taken out out of described internal memory priority query.
Preferably, first scheduler 301 is additionally operable to:When the type for confirming the first reptile task is currently processed When, the first reptile task is sent to the internal memory priority query.
Preferably, multiple internal memory priority queries are included in the cache database;
Second scheduler 302 is additionally operable to:The internal memory priority query according to where the first reptile task Priority and multiple internal memory priority queries priority, the internal memory where determining the first reptile task is excellent Discharge order of the first level queue in multiple internal memory priority queries.
Preferably, the 3rd scheduler 303 is specifically used for:
Different priority queues point are determined out of multiple the internal memory priority queries using round-robin algorithm That matches somebody with somebody goes out group quantity, and the internal memory priority query where the first reptile task goes out group quantity and described according to Discharge order of the one reptile task in the internal memory priority query, determines the first reptile task from first reptile Go out occasion sequence in the internal memory priority query where task.
It should be appreciated that the unit that includes of flame spread rates device determined above only according to the function realized of the apparatus The logical partitioning of progress, in practical application, the superposition or fractionation of said units can be carried out.And the determination that the embodiment provides The method for the determination flame spread rates that the function that the device of flame spread rates is realized provides with above-described embodiment corresponds, right Should the more detailed handling process realized of device, be described in detail in above method embodiment one, herein no longer It is described in detail.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program Product.Therefore, the present invention can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the present invention can use the computer for wherein including computer usable program code in one or more The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided The processors of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.
Although preferred embodiments of the present invention have been described, but those skilled in the art once know basic creation Property concept, then can make other change and modification to these embodiments.So appended claims be intended to be construed to include it is excellent Select embodiment and fall into having altered and changing for the scope of the invention.
Obviously, those skilled in the art can carry out the essence of various changes and modification without departing from the present invention to the present invention God and scope.So, if these modifications and variations of the present invention belong to the scope of the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to comprising including these changes and modification.

Claims (8)

  1. A kind of 1. web crawlers method for scheduling task, it is characterised in that including:
    First scheduler receives the first reptile task, and the first reptile task is determined according to the state of the first reptile task Type;When the type for confirming the first reptile task is delay process, it is determined that corresponding with the delay process perform The first reptile task is simultaneously stored in cache database by the time;
    Second scheduler travels through the cache database in the update cycle, it is determined that when reaching the execution time, will be held with described The first reptile task is sent to internal memory priority query corresponding to the row time, according to the preferential of the first reptile task The priority for multiple reptile tasks that level and the internal memory priority query include, confirm the first reptile task described interior Deposit the discharge order in priority query;
    3rd scheduler uses according to this acquisition of the round-robin algorithm out of the described internal memory priority query reptile task, Until the first reptile task is taken out out of described internal memory priority query.
  2. 2. dispatching method as claimed in claim 1, it is characterised in that described to be determined according to the state of the first reptile task After the type of the first reptile task, in addition to:
    When the type for confirming the first reptile task is currently processed, the first reptile task is sent to the internal memory In priority query.
  3. 3. dispatching method as claimed in claim 1, it is characterised in that it is excellent to include multiple internal memories in the cache database First level queue;
    It is described to send the first reptile task corresponding with the execution time to internal memory priority query, in addition to:
    The priority of the internal memory priority query according to where the first reptile task and multiple internal memory priority The priority of queue, determine the internal memory priority query where the first reptile task in multiple internal memory priority Discharge order in queue.
  4. 4. dispatching method as claimed in claim 3, it is characterised in that the 3rd scheduler using round-robin algorithm from The reptile task of acquisition according to this in the internal memory priority query, including:
    3rd scheduler is determined different preferential excellent using round-robin algorithm out of multiple the internal memory priority queries First level queue assignment goes out group quantity, and the internal memory priority query where the first reptile task goes out team's number according to The discharge order of amount and the first reptile task in the internal memory priority query, determines the first reptile task from institute State in the internal memory priority query where the first reptile task and go out occasion sequence.
  5. A kind of 5. web crawlers task scheduling apparatus, it is characterised in that including:
    First scheduler, for receiving the first reptile task, determine that described first climbs according to the state of the first reptile task The type of worm task;When the type for confirming the first reptile task is delay process, it is determined that corresponding with the delay process The execution time and the first reptile task is stored in cache database;
    Second scheduler, for traveling through the cache database in the update cycle, it is determined that when reaching the execution time, will be with institute State the first reptile task corresponding to performing the time to send to internal memory priority query, according to the first reptile task The priority for multiple reptile tasks that priority and the internal memory priority query include, confirm the first reptile task in institute State the discharge order in internal memory priority query;
    3rd scheduler, for the reptile of acquisition according to this using round-robin algorithm out of described internal memory priority query Task, until the first reptile task is taken out out of described internal memory priority query.
  6. 6. dispatching device as claimed in claim 5, it is characterised in that first scheduler is additionally operable to:When confirming described the When the type of one reptile task is currently processed, the first reptile task is sent to the internal memory priority query.
  7. 7. dispatching device as claimed in claim 5, it is characterised in that it is excellent to include multiple internal memories in the cache database First level queue;
    Second scheduler is additionally operable to:The priority of the internal memory priority query according to where the first reptile task With the priority of multiple internal memory priority queries, the internal memory priority query where the first reptile task is determined Discharge order in multiple internal memory priority queries.
  8. 8. dispatching device as claimed in claim 7, it is characterised in that the 3rd scheduler is specifically used for:
    Different priority queue distribution are determined out of multiple the internal memory priority queries using round-robin algorithm Go out group quantity, the internal memory priority query where the first reptile task goes out group quantity and described first according to and climbed Discharge order of the worm task in the internal memory priority query, determines the first reptile task from the first reptile task Go out occasion sequence in the internal memory priority query at place.
CN201711088266.6A 2017-11-07 2017-11-07 A kind of web crawlers method for scheduling task and device Pending CN107704323A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711088266.6A CN107704323A (en) 2017-11-07 2017-11-07 A kind of web crawlers method for scheduling task and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711088266.6A CN107704323A (en) 2017-11-07 2017-11-07 A kind of web crawlers method for scheduling task and device

Publications (1)

Publication Number Publication Date
CN107704323A true CN107704323A (en) 2018-02-16

Family

ID=61178780

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711088266.6A Pending CN107704323A (en) 2017-11-07 2017-11-07 A kind of web crawlers method for scheduling task and device

Country Status (1)

Country Link
CN (1) CN107704323A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110288993A (en) * 2019-06-26 2019-09-27 广州探迹科技有限公司 A kind of individualized intelligent voice interactive method and device based on container technique
CN110968420A (en) * 2018-09-30 2020-04-07 北京国双科技有限公司 Scheduling method and device for multi-crawler platform, storage medium and processor
CN111274013A (en) * 2020-01-16 2020-06-12 北京思特奇信息技术股份有限公司 Method and system for optimizing timed task scheduling based on memory database in container
CN112231538A (en) * 2020-12-15 2021-01-15 中移(苏州)软件技术有限公司 Method, device, equipment and storage medium for updating scheduling task queue
CN112286655A (en) * 2020-10-19 2021-01-29 江苏银承网络科技股份有限公司 Distributed delay scheduling method, device and system
CN112416551A (en) * 2020-11-19 2021-02-26 清创网御(合肥)科技有限公司 Distributed crawler scheduling system
CN112596882A (en) * 2020-12-25 2021-04-02 上海悦易网络信息技术有限公司 Method, device and system for scheduling delayed tasks
CN116501502A (en) * 2023-06-25 2023-07-28 电子科技大学 Data parallel optimization method based on Pytorch framework

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1828541A (en) * 2006-04-07 2006-09-06 浙江大学 Implementation method for timing task in Java operating system
CN1862575A (en) * 2005-08-19 2006-11-15 华为技术有限公司 Method for planing dispatching timing task
CN1873615A (en) * 2006-01-20 2006-12-06 华为技术有限公司 Method for servicing task of timer
CN104346215A (en) * 2013-08-07 2015-02-11 中兴通讯股份有限公司 Task scheduling service system and method
CN104407922A (en) * 2014-10-29 2015-03-11 中国建设银行股份有限公司 Asynchronous batch-processing dispatching method and system
CN105900064A (en) * 2014-11-19 2016-08-24 华为技术有限公司 Method and apparatus for scheduling data flow task
CN106020951A (en) * 2016-05-12 2016-10-12 中国农业银行股份有限公司 Task scheduling method and system
CN106547492A (en) * 2016-12-08 2017-03-29 北京得瑞领新科技有限公司 A kind of operational order dispatching method of NAND flash memory equipment and device
CN106775977A (en) * 2016-12-09 2017-05-31 北京小米移动软件有限公司 Method for scheduling task, apparatus and system
CN106970874A (en) * 2017-01-22 2017-07-21 阿里巴巴集团控股有限公司 A kind of task processing method, device and electronic equipment
CN106980543A (en) * 2017-04-05 2017-07-25 福建智恒软件科技有限公司 The distributed task dispatching method and device triggered based on event
CN107180050A (en) * 2016-03-11 2017-09-19 精硕科技(北京)股份有限公司 A kind of data grabber system and method

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1862575A (en) * 2005-08-19 2006-11-15 华为技术有限公司 Method for planing dispatching timing task
CN1873615A (en) * 2006-01-20 2006-12-06 华为技术有限公司 Method for servicing task of timer
CN1828541A (en) * 2006-04-07 2006-09-06 浙江大学 Implementation method for timing task in Java operating system
CN104346215A (en) * 2013-08-07 2015-02-11 中兴通讯股份有限公司 Task scheduling service system and method
CN104407922A (en) * 2014-10-29 2015-03-11 中国建设银行股份有限公司 Asynchronous batch-processing dispatching method and system
CN105900064A (en) * 2014-11-19 2016-08-24 华为技术有限公司 Method and apparatus for scheduling data flow task
CN107180050A (en) * 2016-03-11 2017-09-19 精硕科技(北京)股份有限公司 A kind of data grabber system and method
CN106020951A (en) * 2016-05-12 2016-10-12 中国农业银行股份有限公司 Task scheduling method and system
CN106547492A (en) * 2016-12-08 2017-03-29 北京得瑞领新科技有限公司 A kind of operational order dispatching method of NAND flash memory equipment and device
CN106775977A (en) * 2016-12-09 2017-05-31 北京小米移动软件有限公司 Method for scheduling task, apparatus and system
CN106970874A (en) * 2017-01-22 2017-07-21 阿里巴巴集团控股有限公司 A kind of task processing method, device and electronic equipment
CN106980543A (en) * 2017-04-05 2017-07-25 福建智恒软件科技有限公司 The distributed task dispatching method and device triggered based on event

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110968420A (en) * 2018-09-30 2020-04-07 北京国双科技有限公司 Scheduling method and device for multi-crawler platform, storage medium and processor
CN110288993A (en) * 2019-06-26 2019-09-27 广州探迹科技有限公司 A kind of individualized intelligent voice interactive method and device based on container technique
CN111274013A (en) * 2020-01-16 2020-06-12 北京思特奇信息技术股份有限公司 Method and system for optimizing timed task scheduling based on memory database in container
CN112286655A (en) * 2020-10-19 2021-01-29 江苏银承网络科技股份有限公司 Distributed delay scheduling method, device and system
CN112416551A (en) * 2020-11-19 2021-02-26 清创网御(合肥)科技有限公司 Distributed crawler scheduling system
CN112231538A (en) * 2020-12-15 2021-01-15 中移(苏州)软件技术有限公司 Method, device, equipment and storage medium for updating scheduling task queue
CN112231538B (en) * 2020-12-15 2021-05-14 中移(苏州)软件技术有限公司 Method, device, equipment and storage medium for updating scheduling task queue
CN112596882A (en) * 2020-12-25 2021-04-02 上海悦易网络信息技术有限公司 Method, device and system for scheduling delayed tasks
CN116501502A (en) * 2023-06-25 2023-07-28 电子科技大学 Data parallel optimization method based on Pytorch framework
CN116501502B (en) * 2023-06-25 2023-09-05 电子科技大学 Data parallel optimization method based on Pytorch framework

Similar Documents

Publication Publication Date Title
CN107704323A (en) A kind of web crawlers method for scheduling task and device
CN103345514B (en) Streaming data processing method under big data environment
Patterson et al. Scheduling a project under multiple resource constraints: a zero-one programming approach
US9798830B2 (en) Stream data multiprocessing method
US8180975B2 (en) Controlling interference in shared memory systems using parallelism-aware batch scheduling
US20150058858A1 (en) Dynamic task prioritization for in-memory databases
CN103729480B (en) Method for rapidly finding and scheduling multiple ready tasks of multi-kernel real-time operating system
CN106406987A (en) Task execution method and apparatus in cluster
CA3177212A1 (en) Resource allocating method, device, computer equipment, and storage medium
CN105843819B (en) Data export method and device
CN105550274B (en) The querying method and device of this parallel database of two-pack
CN109240795A (en) A kind of resource regulating method of the cloud computing resources pool model suitable for super fusion IT infrastructure
CN105608138B (en) A kind of system of optimization array data base concurrency data loading performance
CN110874271A (en) Method and system for rapidly calculating mass building pattern spot characteristics
CN110275681A (en) A kind of date storage method and data-storage system
US20110023044A1 (en) Scheduling highly parallel jobs having global interdependencies
Wang et al. CEFS: Compute-efficient flow scheduling for iterative synchronous applications
DE102013100169A1 (en) Computer-implemented method for selection of a processor, which is incorporated in multiple processors to receive work, which relates to an arithmetic problem
CN103440113A (en) Disk IO (Input/output) resource allocation method and device
DE102012220365A1 (en) Method for preempting execution of program instructions in multi-process-assisted system, involves executing different program instructions in processing pipeline under utilization of one of contexts
CN110175073B (en) Scheduling method, sending method, device and related equipment of data exchange job
EP2840513A1 (en) Dynamic task prioritization for in-memory databases
CN116560860A (en) Real-time optimization adjustment method for resource priority based on machine learning
CN103530742B (en) Improve the method and device of scheduling arithmetic speed
CN114461356A (en) Control method for number of processes of scheduler and IaaS cloud platform scheduling system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180216

RJ01 Rejection of invention patent application after publication