CN111796922B - Method for scheduling tasks in batches based on programming language - Google Patents

Method for scheduling tasks in batches based on programming language Download PDF

Info

Publication number
CN111796922B
CN111796922B CN202010662645.7A CN202010662645A CN111796922B CN 111796922 B CN111796922 B CN 111796922B CN 202010662645 A CN202010662645 A CN 202010662645A CN 111796922 B CN111796922 B CN 111796922B
Authority
CN
China
Prior art keywords
task
links
loop
value
upper limit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010662645.7A
Other languages
Chinese (zh)
Other versions
CN111796922A (en
Inventor
毕可骏
刘楚雄
唐娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Changhong Electric Co Ltd
Original Assignee
Sichuan Changhong Electric Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Changhong Electric Co Ltd filed Critical Sichuan Changhong Electric Co Ltd
Priority to CN202010662645.7A priority Critical patent/CN111796922B/en
Publication of CN111796922A publication Critical patent/CN111796922A/en
Application granted granted Critical
Publication of CN111796922B publication Critical patent/CN111796922B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for scheduling tasks in batches based on a programming language, which comprises the steps of obtaining access links from a target webpage, creating the links in batches, analyzing the rules of the links of the same type, creating and executing the links in batches through an upper limit following algorithm, storing the links into a mongoDB database, and updating the links in the database through task execution results. According to the invention, batch creation is carried out on creation of the coroutine task through loop nesting, so that the memory occupied by the coroutine task creation is optimized, and the loss of service processing and time caused by failure of returned results due to interruption of the coroutine task is effectively reduced; the character strings are converted and directly stored in the database as the KEY values, so that the retrieval time of necessary character strings required by task processing is effectively shortened, and the stability and efficiency of completing the coroutine task are improved.

Description

Method for scheduling tasks in batches based on programming language
Technical Field
The invention relates to the technical field of computer software, in particular to a method for scheduling tasks in batches based on a programming language.
Background
Coroutines belong to lightweight threads, and the scheduling switching of the coroutines is faster than multithreading and multiprocessing. The execution mode of the coroutine is to execute the unblocked parts of all tasks, and the execution results of all tasks are returned uniformly after the execution of the unblocked parts of all tasks is finished. In the process of executing the task by the coroutine, a specific character string needs to be stored in a database, and when the specific character string needs to be acquired from the database, the coroutine is instantiated. When the number of character strings is large, the process occupies a large amount of storage space. And when the execution of the non-blocking part is finished, returning an execution result, and updating the database by traversing the execution result. When the task amount is very large and the execution frequency needs to be controlled, the time consumed by the coroutine execution task is long; when the unblocked portion of the job is executed, the program is corrupted due to network or other external factors, and the previous task processing will all return failures, thus wasting a lot of time and creating server pressure to process the traffic again.
Disclosure of Invention
In order to solve the problems in the prior art, the invention aims to provide a method for scheduling tasks in batches based on a programming language, which solves the technical problems that the time consumption is long, the program error is easily caused by a network or other factors, the time is wasted and the server burden is caused when a coroutine executes a large batch of tasks.
In order to achieve the purpose, the invention adopts the technical scheme that:
a method for scheduling tasks in batches based on a programming language, comprising the steps of:
step A, creating links in batches, acquiring access links from a target webpage, analyzing the rules of the links of the same type, and creating the links in batches through an upper limit following algorithm;
step B, storing the link into a mongoDB database;
step C, creating and executing tasks in batches;
d, updating the database, and updating the link in the database according to the task execution result;
and E, finishing the execution of all tasks and finishing.
Further, the tasks created in batches are coroutine tasks.
Further, in the step C, when a batch of links is created, the coroutine directly obtains the batch of links, instantiates an asynchronous coroutine object, places a plurality of links in a crawler access task of the intensify _ future () function, and starts to execute the coroutine task.
Further, in the step B, the link is converted by a re.sub () function, and is stored as a key value in a ditt form.
Further, the step D includes: firstly, data is inquired through a find () function to obtain a key value; converting the key value through list () to obtain the link; and restoring the link character string through a re.sub () function, thereby completing the updating of the link.
Further, the step E includes: after a batch of tasks are executed, judging whether all the tasks are finished; if not, repeating the steps A-D, and re-establishing and executing the tasks until all the tasks are executed.
Further, the upper limit following algorithm is:
firstly, initializing four variables, including an initial value a, a final value b, a batch scale h and an upper limit following value g; the upper limit following value is used for dynamically modifying the upper limit of the for loop;
secondly, judging through while circulation, and starting to execute creation when the initial value a is less than or equal to the final value b;
thirdly, through if judgment in while circulation, when the upper limit following value g is larger than the final value b, the upper limit following value g is modified into a final value b + 1;
adding a for loop with the if peer, setting a lower limit as an initial value a and an upper limit as an upper limit following value g through the for loop, wherein a task established in batches is in the for loop, and a next statement at the for loop peer is an execution function and executes the task through the execution function;
and fifthly, completing batch creation of all links.
Furthermore, the method for scheduling the tasks in batches is realized through loop nesting, variables are set for inner loops of the loop nesting, and the execution function is placed in outer loops of the loop nesting, so that the upper limit and the lower limit of the inner loops are automatically updated when the loop nesting enters the next loop after the execution of the first loop, and the linked batch creation is completed.
The invention has the beneficial effects that:
the links and the tasks are created in batches, the links are directly acquired during task creation, so that the memory occupied by coroutine task creation is optimized, the return failure caused by interruption during coroutine work is effectively reduced, the service processing and time loss is caused, and the safety of data acquisition is ensured.
By converting the character string into the KEY value and storing the KEY value into the database, the retrieval time of the necessary character string required by task processing is effectively shortened, and the updating speed of the database is improved.
Drawings
FIG. 1 is a flowchart of a method for scheduling tasks in batches based on a programming language according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating a one-time coroutine execution task according to an embodiment of the invention.
FIG. 3 is a diagram illustrating batch execution of coroutine tasks according to an embodiment of the invention.
Detailed Description
The following description is presented to disclose the invention so as to enable any person skilled in the relevant art to practice the invention. The embodiments described below are intended to be examples only, and other obvious modifications will occur to those skilled in the relevant art and are within the scope of the invention.
In a programming language, loop nesting is a common method for processing instructions in batches, and for a certain regular character string, loop nesting can be used for generation. Firstly, setting variables for the inner layer cycle, and utilizing the characteristic that the upper limit and the lower limit change along with the variables, after the cycle is nested in the next layer cycle after the execution of the first layer cycle, automatically updating the upper limit and the lower limit of the inner layer cycle, achieving the effect that the upper limit and the lower limit of the inner layer cycle follow up according to the outer layer cycle, and completing the batch establishment of tasks. And the algorithm of loop nesting is improved, and the execution function is arranged in the outer loop of the loop nesting, which is equivalent to opening up an execution space in the loop nesting to execute batched tasks. And a batch of tasks are created through the inner layer cycle, the outer layer cycle is executed, and then the creation and execution of the next batch of tasks are started, so that the batch scheduling of the tasks is realized.
In one embodiment, as shown in fig. 1, a method for scheduling tasks in batches based on a programming language includes the following specific steps:
step A, creating links in batches;
and obtaining the access links from the target webpage, analyzing the rules of the links of the same type, and performing batch creation through an upper limit following algorithm.
Step B, placing the created link into a mongoDB database;
the links created in batches are stored in a database of an execution space, import ports into a re module, convert through a re.sub () function, store the links as key values in a ditt form, and define the value values corresponding to the key values in the ditt as any retrievable values. For example, convert the linked character "#" to the other uncommon identifier "##".
The database is directly stored through the KEY value, then the speed of retrieving the KEY value is higher than that of traversing and matching the value, the influence of the number of documents is very small, the updating speed of the database can be shortened through the method, and a large amount of running space occupied by the updating process is avoided.
Step C, creating and executing coroutine tasks in batches;
when the link creation is completed, the coroutine directly acquires the link, instantiates an asynchronous coroutine object, puts a plurality of links into a crawler access execution task of the intensiure _ future () function, and starts to execute the coroutine task. And when the batch of tasks is executed, returning an execution result to the database, and updating the characters in the database.
Step D, updating the database;
accessing a database, inquiring data through a find () function, circularly outputting a single document, acquiring a key value of each document through keys (), converting the key value of each document into a list by using a list () method, acquiring a link, and restoring the link of the link character string through a re.sub () function. For example, converting the identifier back to ". multidot.. And updating the link in the database according to the execution result.
Step E, finishing the execution;
and after the batch of tasks in the execution space is executed, judging whether all the tasks are finished or not by an upper limit following method. If not, repeating the steps A-D, and re-establishing and executing the task; if yes, all coroutine tasks are executed.
Setting an initial value and an upper limit value of a created character string by a method for scheduling tasks in batches based on a programming language, acquiring different character strings with specified quantity, storing the character strings in a database of an outer-layer cycle execution space, and starting to create and execute a coroutine task. Waiting for a creation command of the next cycle of the outer cycle, and entering the next cycle of the outer cycle after the batch of tasks in the execution space are processed; and jumping out of the outer loop after the upper limit value is reached and the execution is finished, and finishing all batch scheduling tasks. Accordingly, the method for scheduling the tasks in batches is not only suitable for coroutine tasks such as thread tasks or other tasks, but also can be used for scheduling the tasks in batches.
The upper limit following algorithm is specifically as follows:
four variables are initialized, including an initial value a, a final value b, a batch scale h and an upper limit following value g. The initial value a is the initial value, the final value b is the maximum number, the batch scale h is the number of creating a batch of tasks, and the upper limit following value is the initial value plus the batch scale, namely g ═ a + h, used for dynamically modifying the upper limit of the for cycle.
Secondly, judging through while circulation, when the initial value a is less than or equal to the final value b, starting to establish, otherwise, not executing the establishment.
And thirdly, modifying the upper limit following value g into a final value b +1 when the upper limit following value g is larger than the final value b through if judgment in the while loop.
And adding a for loop with the if level, setting the lower limit as an initial value a and the upper limit as an upper limit following value g through the for loop, wherein the task is created in batches in the for loop, and the next statement at the same level as the for loop is a developed execution space to execute the task.
And then adding the numerical value of the batch scale h to both the initial value a and the upper limit following value b. And repeating the judgment processes from the second step to the fourth step until batch creation of all links is completed.
By the upper limit following algorithm, after a batch of tasks are executed, whether the execution is finished or not can be judged through while circulation, and the tasks are distributed again, and only the tasks needing to be executed need to be added at the opened execution space.
In another embodiment, if the coroutine needs to access 123 links, the time consumed by the coroutine requesting a link is 1 second, and the per-task block time is 9 seconds.
As shown in fig. 2, by performing the coroutine task once, the request takes 123 seconds, the jam takes 9 seconds, and the total waiting time is 132 seconds.
As shown in fig. 3, 30 links are accessed in a batch by batch executing the coroutine task, i.e., a is 0, b is 123, and h is 30. The 123 links are divided into 5 batches, wherein the number of the links in 4 batches is 30, the number of the links in 1 batch is the remaining 3, the request takes 123 seconds, the blocking takes 45 seconds, and the total waiting time is 168 seconds.
The total time consumption calculation formula is as follows: number of links + time spent per request + time blocked per time total batch
If network problems occur, all the coroutine tasks are failed to return through one-time execution, and data can be acquired only after the request is completed. By executing coroutine tasks in batches, only a batch of tasks being executed can be influenced, tasks of batches which are completed before can not be influenced, and task data which are completed before access is acquired, so that failure of returning all data caused by network problems is avoided. The less the batching is, the less the time consumption is, but the risk of data loss is increased, the method is suitable for processing a large number of coroutine tasks, and the safety of data acquisition is ensured.
By the method for scheduling tasks in batches based on the programming language, the execution space is combined in the loop nesting, the upper limit of the loop times of the loop function is greatly higher than the upper limit of the recursion times of the recursion function, and the combined execution space can be directly used for repeatedly creating and executing, can replace the function of the recursion function, and solves the problem that the recursion function is easy to overflow due to stacking. The creation of the tasks is created in batches, so that the number of the tasks executed by the coroutines in a single batch is reduced, and the data acquisition efficiency is improved. And the memory occupied by the coroutine task creation is optimized, and the service processing and time loss caused by the return failure due to interruption during coroutine work is effectively reduced. The character strings are directly stored in the database in a conversion mode, so that the retrieval time of necessary character strings required by task processing is effectively shortened, and the updating speed of the database is improved.
The above-mentioned embodiments only express the specific embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims (2)

1. A method for scheduling tasks in batches based on a programming language, comprising the steps of:
step A, creating links in batches, acquiring access links from a target webpage, analyzing the rules of the links of the same type, and creating the links in batches through an upper limit following algorithm;
the upper limit following algorithm is as follows:
firstly, initializing four variables, including an initial value a, a final value b, a batch scale h and an upper limit following value g; the upper limit following value is used for dynamically modifying the upper limit of the for loop;
secondly, judging through while circulation, and starting to execute creation when the initial value a is less than or equal to the final value b;
thirdly, through if judgment in while circulation, when the upper limit following value g is larger than the final value b, the upper limit following value g is modified into a final value b + 1;
adding a for loop with the if peer, setting a lower limit as an initial value a and an upper limit as an upper limit following value g through the for loop, wherein a task established in batches is in the for loop, and a next statement at the for loop peer is an execution function and executes the task through the execution function;
adding the initial value a and the upper limit following value g to the value of the batch scale h; repeating the judging processes from the second step to the fourth step until batch creation of all links is completed;
step B, storing the link into a mongoDB database;
in the step B, the link is converted through a re.sub () function, and the link is stored as a key value in a ditt form;
step C, creating and executing tasks in batches;
the task created in batches is a coroutine task, in the step C, when a batch of links are created, the coroutine directly obtains the batch of links, instantiates an asynchronous coroutine object, puts a plurality of links into a crawler access execution task of the intensify _ future () function, and starts to execute the coroutine task;
d, updating the database, and updating the link in the database according to the task execution result:
the step D comprises the following steps: firstly, data is inquired through a find () function to obtain a key value; converting the key value through list () to obtain the link; restoring the link character string through a re.sub () function, thereby completing updating of the link;
step E, finishing the execution of all tasks and ending;
the step E comprises the following steps: after a batch of tasks are executed, judging whether all the tasks are finished; if not, repeating the steps A-D, and re-establishing and executing the tasks until all the tasks are executed.
2. The programming language based task batch scheduling method of claim 1, wherein the task batch scheduling method is implemented by loop nesting, and the upper and lower limits of the inner loop are automatically updated to complete the batch creation of the link when the loop nesting enters the next loop after the execution of the first loop by setting variables for the inner loop of the loop nesting and placing an execution function in the outer loop of the loop nesting.
CN202010662645.7A 2020-07-10 2020-07-10 Method for scheduling tasks in batches based on programming language Active CN111796922B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010662645.7A CN111796922B (en) 2020-07-10 2020-07-10 Method for scheduling tasks in batches based on programming language

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010662645.7A CN111796922B (en) 2020-07-10 2020-07-10 Method for scheduling tasks in batches based on programming language

Publications (2)

Publication Number Publication Date
CN111796922A CN111796922A (en) 2020-10-20
CN111796922B true CN111796922B (en) 2022-02-01

Family

ID=72806760

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010662645.7A Active CN111796922B (en) 2020-07-10 2020-07-10 Method for scheduling tasks in batches based on programming language

Country Status (1)

Country Link
CN (1) CN111796922B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7299219B2 (en) * 2001-05-08 2007-11-20 The Johns Hopkins University High refresh-rate retrieval of freshly published content using distributed crawling
US7752207B2 (en) * 2007-05-01 2010-07-06 Oracle International Corporation Crawlable applications
CN101630330A (en) * 2009-08-14 2010-01-20 苏州锐创通信有限责任公司 Method for webpage classification
CN101908071B (en) * 2010-08-10 2012-09-05 厦门市美亚柏科信息股份有限公司 Method and device thereof for improving search efficiency of search engine
CN102073678B (en) * 2010-12-03 2013-02-27 厦门市美亚柏科信息股份有限公司 System and method for analyzing information of websites

Also Published As

Publication number Publication date
CN111796922A (en) 2020-10-20

Similar Documents

Publication Publication Date Title
CN107239335B (en) Job scheduling system and method for distributed system
CN105956021B (en) A kind of automation task suitable for distributed machines study parallel method and its system
CN111738434B (en) Method for executing deep neural network on heterogeneous processing unit
US4951225A (en) Updating pattern-matching networks
CN106681820A (en) Message combination based extensible big data computing method
CN113157694A (en) Database index generation method based on reinforcement learning
CN108415740A (en) A kind of workflow schedule method applied to data analysis task
Qu et al. Design and implementation of system generator based on rule engine
CN111857984A (en) Job calling processing method and device in bank system and computer equipment
CN111796922B (en) Method for scheduling tasks in batches based on programming language
CN117271101B (en) Operator fusion method and device, electronic equipment and storage medium
CN105550028A (en) Multi-task time sequence execution method and system based on cache locks
CN108345505B (en) Multithreading resource management method and system
CN113051722B (en) Method for improving safety performance analysis of nuclear power plant by embedding discrete dynamic event tree
WO2022253165A1 (en) Scheduling method, system, server and computer readable storage medium
CN111459464A (en) Node fusion method, code generation method and device
WO2022057459A1 (en) Tensorcore-based int4 data type processing method and system, device, and medium
JP6758252B2 (en) Histogram generation method, histogram generator and histogram generation program
CN114880079A (en) Kubernetes cluster scale adjustment method, system and equipment based on reinforcement learning
CN114092313A (en) Model reasoning acceleration method and system based on GPU (graphics processing Unit) equipment
CN114168594A (en) Secondary index creating method, device, equipment and storage medium of horizontal partition table
CN114490581A (en) Heterogeneous database migration and data comparison method, device, equipment and storage medium
CN108108472B (en) Data processing method and server
CN110297693A (en) A kind of method and its system of the distribution of distributed software task
CN114860435B (en) Big data job scheduling method based on task selection process reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant