CN116010053A

CN116010053A - Task information pushing method and device, storage medium and electronic device

Info

Publication number: CN116010053A
Application number: CN202211686697.3A
Authority: CN
Inventors: 郑劭杰; 吴立; 江达秀; 骆昕; 王志骁; 王冠
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-12-27
Filing date: 2022-12-27
Publication date: 2023-04-25

Abstract

The invention discloses a method and a device for pushing task information, a storage medium and an electronic device, wherein the method comprises the following steps: acquiring a plurality of tasks to be trained and task information of the tasks to be trained, which are created by a cloud platform; and under the condition that the container cloud successfully dispatches the tasks to be trained according to the task information, acquiring running states corresponding to the tasks to be trained respectively, and pushing task information corresponding to the tasks to be trained to message queues corresponding to the running states of the tasks to be trained respectively. The task information management method and device solve the problems that task information of a task to be trained cannot be managed according to the running state of the task in the prior art.

Description

Task information pushing method and device, storage medium and electronic device

Technical Field

The invention relates to the field of deep learning, in particular to a task information pushing method and device, a storage medium and an electronic device.

Background

With the increasing computing power of Machine Learning (ML) oriented hardware and the further development of open source Machine Learning frameworks, the scope of deep Learning applications has been rapidly expanding and demands have been rapidly expanding.

In the prior art, the scheduling of the algorithm training task may be by the following method: judging whether to schedule execution or not by comparing the execution time of the task with the current time; the distributed high concurrency scheduling system based on the decentralized job task realizes task scheduling by sending data to a designated distributed message queue for other nodes to find. However, based on the task information obtained in the task scheduling process, it is not considered that the task to be scheduled and allocated may be in different states, such as whether the training environment is ready, whether training is finished, and the like.

Aiming at the problems that task information of a task to be trained cannot be managed according to the running state of the task and the like in the prior art, an effective solution is not proposed.

Disclosure of Invention

The embodiment of the invention provides a task information pushing method and device, a storage medium and an electronic device, which at least solve the problems that task information of a task to be trained cannot be managed according to the running state of the task in the prior art in the related art.

According to an aspect of the embodiment of the present invention, there is provided a method for pushing task information, including: acquiring a plurality of tasks to be trained and task information of the tasks to be trained, which are created by a cloud platform; and under the condition that the container cloud successfully dispatches the tasks to be trained according to the task information, acquiring running states corresponding to the tasks to be trained respectively, and pushing task information corresponding to the tasks to be trained to message queues corresponding to the running states of the tasks to be trained respectively.

In an exemplary embodiment, pushing task information corresponding to the plurality of tasks to be trained to a message queue corresponding to an operation state of the plurality of tasks to be trained, includes: determining a first training task with an operating state being an initialization state from the plurality of tasks to be trained, wherein the initialization state is used for indicating that a training container for training the first training task is not created; acquiring task information of the first training task, and pushing the task information of the first training task to an initialized task message queue.

In an exemplary embodiment, pushing task information corresponding to the plurality of tasks to be trained to a message queue corresponding to an operation state of the plurality of tasks to be trained, includes: determining a second training task with an operating state being an operating state from the plurality of tasks to be trained, wherein the operating state is used for indicating that a training container for training the second training task is created; and acquiring the task information of the second training task, and pushing the task information of the second training task to an operating task message queue.

In an exemplary embodiment, after pushing the task information of the second training task to the running task message queue, the method further includes: acquiring a training start script in task information of the second training task; training the second training task in a training container of the second training task according to the training initiation script.

In an exemplary embodiment, pushing task information corresponding to the plurality of tasks to be trained to a message queue corresponding to an operation state of the plurality of tasks to be trained, includes: determining a third training task with an operation state being an ending state from the plurality of tasks to be trained; and acquiring the task information of the third training task, and pushing the task information of the third training task to an ending task message queue.

In an exemplary embodiment, after pushing task information corresponding to the plurality of tasks to be trained to message queues corresponding to running states of the plurality of tasks to be trained, the method further includes: the message queue comprises at least one of the following: under the conditions of initializing a task message queue, an operating task message queue and an ending task message queue, determining a first training task corresponding to task information in the initializing task message queue, a second training task corresponding to task information in the operating task message queue and a third training task corresponding to task information in the ending task message queue; and determining the running conditions of the plurality of training tasks according to the first training task, the second training task and the third training task.

In an exemplary embodiment, after acquiring the plurality of tasks to be trained and the task information of the tasks to be trained created by the cloud platform, the method further includes: acquiring target resources required by training of the task to be trained according to the task information; comparing the size of the target resource with the size of the residual resources of the server resource group corresponding to the task to be trained; and under the condition that the residual resources are larger than the target resources, determining that the task to be trained passes through a first resource check.

According to another aspect of the embodiment of the present invention, there is also provided a device for pushing task information, including: the acquisition module is used for acquiring a plurality of tasks to be trained and task information of the tasks to be trained, which are created by the cloud platform; and the pushing module is used for acquiring the running states corresponding to the tasks to be trained respectively under the condition that the container cloud successfully dispatches the tasks to be trained according to the task information, and pushing the task information corresponding to the tasks to be trained to the message queues corresponding to the running states of the tasks to be trained respectively.

According to another aspect of the embodiment of the present invention, there is further provided a computer readable storage medium, where the computer readable storage medium includes a stored program, where the program runs the method for pushing task information described above.

According to still another aspect of the embodiment of the present invention, there is further provided an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor runs the method for pushing task information according to the computer program.

In the embodiment of the invention, a plurality of tasks to be trained and task information of the tasks to be trained, which are created by a cloud platform, are acquired; and under the condition that the container cloud successfully dispatches the tasks to be trained according to the task information, acquiring running states corresponding to the tasks to be trained respectively, and pushing task information corresponding to the tasks to be trained to message queues corresponding to the running states of the tasks to be trained respectively. That is, after the plurality of tasks to be trained are successfully scheduled by the container cloud, task information of the tasks to be trained is pushed to a message queue corresponding to the running state according to the running state of the tasks to be trained. The task information management method and device solve the problems that task information of a task to be trained cannot be managed according to the running state of the task in the prior art.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

fig. 1 is a hardware structure block diagram of a cloud platform of a task information pushing method according to an embodiment of the present invention;

FIG. 2 is a flow chart of an alternative method of pushing task information according to an embodiment of the invention;

FIG. 3 is a task processing flow diagram of an alternative method of pushing task information according to an embodiment of the present invention;

FIG. 4 is a task state flow diagram of an alternative task information pushing method according to an embodiment of the present invention;

fig. 5 is a block diagram of a task information pushing device according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The method embodiment provided by the embodiment of the invention can be operated in the cloud platform. Taking the operation on the cloud platform as an example, fig. 1 is a hardware structure block diagram of the cloud platform of a task information pushing method according to an embodiment of the present invention. As shown in fig. 1, the cloud platform may include one or more processors 102 (only one shown in fig. 1) (the processor 102 may include, but is not limited to, a processing system such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and in one exemplary embodiment, may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative, and is not limited to the structure of the cloud platform described above. For example, the image pickup apparatus may further include more or less components than those shown in fig. 1, or have a different configuration equivalent to the functions shown in fig. 1 or more than the functions shown in fig. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a method for pushing task information in an embodiment of the present invention, and the processor 102 executes various functional applications and data processing by executing the computer program stored in the memory 104, that is, implements the method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage systems, flash memory, or other non-volatile solid-state memory. In some examples, memory 104 may further include memory remotely located with respect to processor 102, which may be connected to secure text via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. The specific example of the network described above may include a wireless network provided by a communication provider of the image capturing apparatus. In one example, the transmission system 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet.

In this embodiment, a method for pushing task information is provided, including but not limited to being applied to the cloud platform, and fig. 2 is a flowchart of an alternative method for pushing task information according to an embodiment of the present invention, where the flowchart includes the following steps:

step S202, acquiring a plurality of tasks to be trained and task information of the tasks to be trained, which are created by a cloud platform;

step S204, under the condition that the container cloud successfully schedules the tasks to be trained according to the task information, acquiring running states corresponding to the tasks to be trained respectively, and pushing task information corresponding to the tasks to be trained to message queues corresponding to the running states of the tasks to be trained respectively.

Through the steps, a plurality of tasks to be trained and task information of the tasks to be trained, which are created by a cloud platform, are obtained; and under the condition that the container cloud successfully dispatches the tasks to be trained according to the task information, acquiring running states corresponding to the tasks to be trained respectively, and pushing task information corresponding to the tasks to be trained to message queues corresponding to the running states of the tasks to be trained respectively. That is, after the plurality of tasks to be trained are successfully scheduled by the container cloud, task information of the tasks to be trained is pushed to a message queue corresponding to the running state according to the running state of the tasks to be trained. The task information management method and device solve the problems that task information of a task to be trained cannot be managed according to the running state of the task in the prior art.

Optionally, pushing task information corresponding to the plurality of tasks to be trained to message queues corresponding to running states of the plurality of tasks to be trained, including: determining a first training task with an operating state being an initialization state from the plurality of tasks to be trained, wherein the initialization state is used for indicating that a training container for training the first training task is not created; acquiring task information of the first training task, and pushing the task information of the first training task to an initialized task message queue.

It can be understood that after the task to be trained created by the cloud platform is successfully scheduled to the container cloud, the container cloud creates unique task ids for the tasks to be trained, and meanwhile, the tasks to be trained enter an initialization state, so that task information of the first training tasks is pushed to an initialization task message queue. In the container cloud, 1) initializing a state is a process of creating a training environment and preparing training resources for a task to be trained. Including the downloading of the corresponding data sets required for the task to be trained and the downloading of the training code from the code repository onto the catalog on which the container is mounted. These tasks are done by the container cloud for the initialization container of the same container runtime environment created for the present training task, at which point a training container for training the first training task has not been created. 2) In the initialization task message queue, the task information of the first training task, which is successfully scheduled but not created by the training container, is received, and the initialization processing node continuously traverses the initialization task message queue.

Optionally, pushing task information corresponding to the plurality of tasks to be trained to message queues corresponding to running states of the plurality of tasks to be trained, including: determining a second training task with an operating state being an operating state from the plurality of tasks to be trained, wherein the operating state is used for indicating that a training container for training the second training task is created; and acquiring the task information of the second training task, and pushing the task information of the second training task to an operating task message queue.

It should be noted that, the training container for training the second training task in the running state is created, and the source of the second training task in which the running state is as follows: the initialization processing node continuously traverses an initialization task message queue, namely, the state of the first training task is continuously inquired in the container cloud according to the task id of the first training task, and when the state of the first training task is inquired to be changed into the running state, namely, the training container is established, the task information of the first training task is pushed to the running task message queue. When the task message queue is traversed and initialized and/or the task message queue in operation is/are not traversed, if the task state of the task to be trained is detected to be unchanged, pushing of task information is not carried out, and the next traversal is waited.

Optionally, after pushing the task information of the second training task to the running task message queue, the method further includes: acquiring a training start script in task information of the second training task; training the second training task in a training container of the second training task according to the training initiation script.

It will be appreciated that the second training task may be trained upon creation of a training container in the container cloud. However, it should be noted that a training initiation script in the task information of the second training task needs to be determined, and then training of the second training task is initiated in the training container according to the training initiation script and the training code.

Optionally, pushing task information corresponding to the plurality of tasks to be trained to message queues corresponding to running states of the plurality of tasks to be trained, including: determining a third training task with an operation state being an ending state from the plurality of tasks to be trained; and acquiring the task information of the third training task, and pushing the task information of the third training task to an ending task message queue.

Specifically, the second training task of the running state is a process of running training and generating a model according to a start script of the training code. The task state of the second training task which is completed by training and has generated the model is changed to an end state, and the task information is pushed to an end task message queue when the task state is changed to the end state according to the task id inquiry. The ending task processing node consumes the ending task message queue, and then schedules the container cloud to delete the corresponding third training task, records the resource occupation of the third training task, and releases the resource occupation of the third training task.

Optionally, after pushing the task information corresponding to the plurality of tasks to be trained to the message queues corresponding to the running states of the plurality of tasks to be trained, the method further includes: the message queue comprises at least one of the following: under the conditions of initializing a task message queue, an operating task message queue and an ending task message queue, determining a first training task corresponding to task information in the initializing task message queue, a second training task corresponding to task information in the operating task message queue and a third training task corresponding to task information in the ending task message queue; and determining the running conditions of the plurality of training tasks according to the first training task, the second training task and the third training task.

It can be understood that according to the task information in the initialized task message queue, the task message queue in operation and the task message queue in end, further different task states of the tasks to be trained corresponding to the task information in different message queues can be known clearly, and further the operation conditions of a plurality of tasks to be trained can be determined. For example, the task id allocated to a task to be trained after the task to be trained is successfully scheduled is 001, and the 001-number task to be trained, namely the 001-number second training task, can be known by inquiring the 001-number task information in the task information queue in operation, is being trained by the training container and generates a model.

After a plurality of tasks to be trained and task information of the tasks to be trained, which are created by a cloud platform, are acquired, resource verification is needed to be performed on the task information, and the method specifically comprises a first resource verification and a second resource verification, wherein 1) the first resource verification comprises: acquiring target resources required by training of the task to be trained according to the task information; comparing the size of the target resource with the size of the residual resources of the server resource group corresponding to the task to be trained; and under the condition that the residual resources are larger than the target resources, determining that the task to be trained passes through a first resource check.

Optionally, acquiring, according to the task information, a target resource required by the task to be trained for training, including: acquiring user group quota information in the task information, wherein the user group quota information is used for indicating resources required by training a task to be trained; and acquiring the target resource according to the user group quota information.

It should be noted that, the server resource group corresponding to the task to be trained is to add the server cluster of the deployment device to the container cloud, so that the container cloud performs unified management, and is divided and abstracted into resource groups according to the ML hardware. Servers under the same resource group have the same ML hardware. For example, the first resource check is performed, when the 001-th task to be trained is performed, after task information of the 001-th task to be trained is obtained, user group quota information in the task information is read, and further resources required by training of the 001-th task to be trained can be obtained, for example, resources required to be used account for 20% of a resource group where the 001-th task to be trained is located. The first resource check is passed under the condition that the remaining resource ratio of the resource group where the 001-th task to be trained is located is more than 20%; and under the condition that the remaining resource proportion of the resource group where the No. 001 task to be trained is located is less than 20%, the first resource check is not passed.

2) Before or after determining that the task to be trained passes the first resource check, the method further comprises: acquiring the number of servers required by training the task to be trained according to the task information; comparing the server number with the available server number of the server resource group corresponding to the task to be trained; and under the condition that the number of the available servers is larger than the number of the servers, determining that the task to be trained passes the second resource check.

Optionally, the obtaining the number of servers required by the task to be trained for training according to the task information includes: acquiring server number information in the task information, wherein the server number information is used for indicating the number of servers required by the task to be trained for training; and determining the server number information as the number of servers required by training the task to be trained.

It may be understood that, similarly to the first resource check, taking the 001 to-be-trained task as an example, in the case that the 001 to-be-trained task passes the first resource check, the number of servers required by the 001 to-be-trained task, for example, 3 servers, may be obtained through the task information of the 001 to-be-trained task, and in the case that the number of remaining servers of the server resource group where the 001 is located is greater than 3 servers, the 001 to-be-trained task passes the second resource check.

It should be noted that, in the process of resource verification, the task may not meet the requirements of the first resource verification and/or the second resource verification, and in this case, the task information may be added to the waiting task dictionary and the waiting reason dictionary of the memory database to wait for the next resource verification.

After the training task passes the resource verification, the task information is packaged into a target template which is allowed to be processed by the container cloud of the cloud platform, and the method comprises the following steps: determining whether a training starting script exists in the task information under the condition that the task to be trained passes through first resource verification and second resource verification, wherein the training starting script is used for starting a process for training the task to be trained; and in the presence of the training initiation script, packaging the task information into a target template which is allowed to be processed by a container cloud of the cloud platform.

Specifically, compliance checking is performed on task information of a task to be trained through resource verification, namely whether a training starting script for starting a process for training the task to be trained exists in the task information of the task to be trained is checked, and the task information of the task to be trained is packaged into a target template which can be processed by a container cloud under the condition that the training starting script exists, namely a K8s standard template. In addition, under the condition that the compliance check is not passed, the pushing of the task information is directly ended, and the task information of the task to be trained can be selectively pushed to an ending task message queue.

Optionally, acquiring the running state of the task to be trained includes: acquiring a task identifier of the task to be trained, which is generated by the container cloud according to the target template; and inquiring the running state in the container cloud according to the task identification.

It can be understood that after the task identifier is sent to the container cloud as the target template, the container cloud generates a unique task ID for each task to be trained, and according to the task ID, the running state of the task can be queried, where the task state includes an initialization state, an in-running state and an end state.

Optionally, the different message queues include at least one of: the system comprises a priority queue, a task message queue to be scheduled, an initialization task message queue, an on-the-fly task message queue and an ending task message queue.

Specifically, as shown in fig. 4, the different message queues include: the processing nodes corresponding to the priority queue, the task message queue to be scheduled, the initialization task message queue, the running task message queue and the ending task message queue are in sequence: the system comprises a resource check node, a task scheduling processing node, a task initialization processing node, a task running processing node and a task ending processing node.

The resource check processing node performs resource check on each resource group priority queue (i.e. the resource check processing node in fig. 4 consumes each resource group priority queue), the task to be trained which does not pass the resource check waits for the next resource check, and the task to be trained which passes the resource check enters the task message queue to be scheduled; the task scheduling processing node consumes a to-be-scheduled message queue, firstly, performs compliance check on task information, can schedule to-be-trained tasks if the compliance check is passed, can enter an initialized task message queue if the compliance check is not passed or the scheduling fails, and directly enters an end task message queue; the task initialization processing node consumes an initialization task message queue, performs training preparation work of a task to be trained, inquires the state of the task to be trained, and can enter an ending task message queue if the task state is failed or ended, and enters an operating task message queue if the task state is in operation; the processing node in task operation is used for consuming the task message queue in operation, carrying out training work of the task to be trained, inquiring the task state, and entering into the task message queue to be finished if the task state is that the task is finished or fails; and the task ending processing node consumes the task ending message queue, and obtains the failure reason of the task by the task id to the container cloud. The cause of this task failure is recorded in a relational database if any. And then, dispatching the container cloud and deleting the corresponding task. And simultaneously, recording the state of the task and the use condition of the resource in a relational database. And finally deleting the use condition of the resource group corresponding to the task in the memory database.

It should be noted that the task initialization node and the node in task operation process the task to be trained successfully, and the task initialization node and the node in task operation process query the task state from the container cloud and update the state of the task to be trained from the container cloud after processing the task to be trained.

In order to better understand the implementation manner of the task information pushing method, in an alternative embodiment, a scheme is further provided, and the scheme is used for explaining.

An alternative embodiment of the present invention provides a method for pushing task information, and fig. 3 is a task processing flow chart of an alternative method for pushing task information in an embodiment of the present invention, where the method includes the following steps:

step 1: using a cloud deep learning platform;

when an algorithm task (equivalent to the task to be trained in the embodiment) is newly built on the cloud deep learning platform, an algorithm person can select a resource group cluster and a training frame mirror image, and select a required data set through cloud storage. While training files may be selectively imported through cloud storage or code repository. The task information finally realizes the selection and operation of the server node through a training scheduling solution of the cloud deep learning platform. The task can carry out targeted processing on each process of the task through each state processing node in the training scheduling scheme, so that pipelined task scheduling operation is realized.

Step 2: submitting a task;

step 3: task information checking and resource checking;

note that 1) regarding the resource check:

each resource group has a priority queue in the memory database for storing task information that an algorithm person selectively submits training tasks (corresponding to the tasks to be trained in the above embodiments) to the resource group. The resource check processing node consumes each resource group priority queue.

As shown in fig. 4, the process of consuming the priority queue by the resource check processing node is as follows: and traversing the priority queues by acquiring the priority queues of all resource groups in the memory database, and acquiring all uploaded algorithm training task information in each priority queue. Traversing the information, and firstly checking whether the used resources and the needed resources meet the requirements or not by initial checking, namely checking the quota of the user group, and comparing the used resources and the needed resources with the maximum resources of the resource group. And comparing whether the number of the servers available for the resource group is met or not according to the number of the servers required by the algorithm training task through final inspection, and whether the available resources on the servers are met or not.

And (3) storing the service conditions of each resource group and each service on the resource group in a memory database for caching in the checking process. In the process of checking the resources, the task may not meet the requirement of the resource quota, and the task in the situation can add related information into a waiting task dictionary and a waiting reason dictionary of the memory database to wait for the next consumption. The task can be scheduled through a series of resource checks, then the task information is pushed to a queue to be scheduled in the message queue, and the task information in the priority queue, the waiting task dictionary and the waiting reason dictionary in the memory database is deleted. In summary, the resource check processing node is a process of circularly traversing all the priority queues and then traversing all the tasks in the priority queues.

2) Checking task information:

each resource group has a queue of task messages to be scheduled. The message is pushed by the resource check processing node to pass the checked task information.

As shown in fig. 4, the scheduling processing node consumes the task message queue to be scheduled. Specifically, in order to ensure reasonable and orderly scheduling of a large number of algorithm tasks, the scheduling processing node adopts depth-first traversal to the to-be-scheduled task message queues corresponding to each resource group. First, compliance checking is performed on task information of submitted algorithm tasks, i.e. whether necessary training start scripts specified by the platform exist. And then, the task information is packaged into a task template which can be processed by the container cloud, and finally, the task template is transmitted to the container cloud for actual task scheduling.

Step 4: whether the requirements are met;

step 5 is carried out under the condition that the task information inspection and the resource check meet the requirements;

when the time division is not satisfied, the two types of the time division are: 1) Case one: if the resource check is not passed, entering step 3, and carrying out resource check again; 2) And a second case: the task information check is not passed, and the process proceeds to step 9, where the task is ended.

Step 5: task scheduling;

it should be noted that, when the task is scheduled, the resources used by the information are recorded in the resource group use dictionary of the memory database.

Step 6: whether dispatch to the container cloud (equivalent to the container cloud in the above embodiment) was successful;

the scheduling succeeds in step 7, and optionally, the scheduling fails in step 9.

I.e. as shown in fig. 4. Successful scheduling pushes the task information scheduled this time to the initializing task message queue. Tasks that fail to schedule in the process may be placed in the end task message queue.

Further, there are 3 states of the container cloud from successful scheduling by the container cloud to the end of the last run, which are initialization, running, and end, respectively. The task of initializing state is an ongoing process of environment creation and resource preparation in the container cloud, and at this stage, some pre-work can be done for the container corresponding to the training task, including downloading the corresponding data set required by the training task and downloading the training code from the code warehouse to the catalog on which the container is mounted. These pre-jobs are completed by the container cloud for initializing the same container running environment created by this training task. In the downloading and running process of the prepositive works, the corresponding states of the tasks in the container cloud are initialized, and when the prepositive preparation works are processed, the corresponding states of the tasks in the container cloud are changed from initialization to running. The task in the running state is a training process of training a container in the container cloud, and when the normal running of the training is finished, the state in the container cloud is converted into the end. The main difference between the initialized state and the running state is that when the initialized state is adopted, the corresponding training container is not created to run, and the training container can be created and run after waiting for the completion of the pre-environment processing work. The task in the ending state is a training task which is already operated and ended in the container cloud, the corresponding training container is already operated and stopped, and after the ending processing node of the platform waits for consumption and records the corresponding information of the training, the ending processing node can delete the task information from the container cloud.

Step 7: task initialization status checking;

the initialization processing node may be divided into two threads to process successfully scheduled tasks in the scheduling processing node. As shown in fig. 4, a thread first goes to consume an initialization task message queue. The message for initializing the task message queue is pushed by the processing node to be scheduled to successfully schedule the task information. And synchronously storing the messages into a non-relational database and an in-memory database. The storage into the non-relational database is for persistent storage, and the storage into the memory database is for performing cache optimization for the training traversal of another thread. The other thread is the information of the memory database which is just synchronized by the training traversal. And according to the globally unique task id generated after each task is created by the container cloud, inquiring the related state of the corresponding task in the container cloud. If the queried state is running, the queried state is put into a running task message queue. If the state is finished, the task with the changed state is put in a finishing task message queue, and the corresponding record in the initializing task stored in the memory database is deleted. And the latest state of the task is synchronized back to the non-relational database. If the state is still in the initialization state, the initialization processing node does not perform additional operation, the task information of the initialization processing node is still reserved at the initialization task record in the memory database, and the initialization processing node continues to consume after waiting for the next cycle traversal.

Step 8: checking the state in task operation;

the running processing node consumes the running task message queue. The process is similar to the initialization process and can be divided into two threads for processing. One thread will first go to the consuming queue in running the consuming task. The message of the consumption queue in the task operation is pushed by the initialization processing node after successful initialization. And synchronously storing the messages into a non-relational database and an in-memory database. The storage into the non-relational database is for persistent storage, and the storage into the memory database is for performing cache optimization for the training traversal of another thread. The other thread is the information of the memory database which is just synchronized by the training traversal. And according to the globally unique task id generated after each task is created by the container cloud, inquiring the related state of the corresponding task in the container cloud. If the queried state is ended, the queried state is put into a task ending consumption queue, and the corresponding record in the memory database is deleted. At the same time, the latest state of the task is synchronized back to the non-relational database; if the state is running, the running processing node does not perform additional operation, the task information of the running processing node is still reserved at the running task record in the memory database, and the consumption is continued after the next cycle traversal of the running processing node.

However, it should be noted that the main task of the algorithm training in the initialized state is to download the relevant data set from the cloud storage, download the training codes on the frame image and pull the substitute code repository selected for training from the image repository. The running state algorithm training task is the process of running training and generating a model according to the starting script of the training code.

Step 9: the task ends.

As shown in fig. 4, the end processing node consumes an end task message queue, and obtains the failure reason of the task by the task id to the container cloud. The cause of this task failure is recorded in a relational database if any. And then, dispatching the container cloud and deleting the corresponding task. And simultaneously, recording the state of the task and the use condition of the resource in a relational database. And finally deleting the use condition of the resource group corresponding to the task in the memory database.

In order to solve the technical problem, 1) the present embodiment is composed of cloud storage, message queues, relational/non-relational databases, and a plurality of task state processing nodes as shown in fig. 4. The training files to be uploaded by the algorithm personnel can be uniformly placed in the cloud storage. The uploaded file data can be uniformly stored in the storage device, various information of the uploaded file is stored in the non-relational database, and the uploaded file is searched through the information in the database. 2) And (3) uniformly managing the server cluster of the deployment equipment to the container cloud. The ML hardware is partitioned and abstracted into resource groups. Servers under the same resource group have the same ML hardware. And each server realizes interaction with cloud storage through volume mounting. Based on the container cloud, the GPU service condition of the resources on the server can be obtained in real time. 3) As shown in fig. 4, each task state processing node adopts a single process mode to consume the corresponding message queue, and performs data synchronization operation while consuming, stores the message into a memory database, and performs caching processing and facilitates subsequent retry processing. And a persistence strategy of the memory database is started, so that the task data is not lost. Furthermore, the task state can be subdivided into resource checking, waiting for scheduling, initializing, and ending in operation. Each state processing node realizes the circulation of task states through the message queue, and realizes the pipeline processing of tasks.

Further, the present embodiment has the following advantages: 1) Through carrying out effective state division on the algorithm training task, each state adopts a proper processing method, and finally, the training task can be efficiently and reasonably scheduled on a large-scale computing server cluster; 2) The resource group division on hardware is carried out on the computing server cluster, and the flexible scheduling deployment service based on the virtual machine or the container can be realized and a unified data resource pool is provided by carrying out containerization management on the cluster; 3) The data in the consumption queue is synchronously stored in the memory database, and the fields commonly used in the task scheduling process such as resource use condition, user quota and the like are also stored in the memory database, so that the effective scheduling processing of the multi-training tasks can be realized. Meanwhile, the re-queuing of tasks which do not pass the inspection is added at the processing node of the resource inspection, so that the maximum possible dispatching of the training tasks is ensured.

Further, 1) the present embodiment provides for the proper state partitioning of the algorithmic training tasks and the pipelined processing design of the task processing procedures. And the resource utilization rate of the whole server cluster is improved while reasonable queuing of the algorithm training tasks submitted by users in the cloud deep learning platform is ensured. 2) According to the embodiment, through reasonable design of the consumption queue and matching use of the database, the task processing speed of the training and scheduling system is improved, and the task queuing time is reduced.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to run the method of the embodiments of the present invention.

Fig. 5 is a block diagram of a task information pushing device according to an embodiment of the present invention; as shown in fig. 5, includes:

the acquisition module 50 is used for acquiring a plurality of tasks to be trained and task information of the tasks to be trained, which are created by the cloud platform;

and the pushing module 52 is configured to obtain running states corresponding to the plurality of tasks to be trained respectively when the container cloud successfully schedules the plurality of tasks to be trained according to the task information, and push task information corresponding to the plurality of tasks to be trained respectively to message queues corresponding to the running states of the plurality of tasks to be trained.

Through the device, a plurality of tasks to be trained and task information of the tasks to be trained, which are created by a cloud platform, are acquired; and under the condition that the container cloud successfully dispatches the tasks to be trained according to the task information, acquiring running states corresponding to the tasks to be trained respectively, and pushing task information corresponding to the tasks to be trained to message queues corresponding to the running states of the tasks to be trained respectively. That is, after the plurality of tasks to be trained are successfully scheduled by the container cloud, task information of the tasks to be trained is pushed to a message queue corresponding to the running state according to the running state of the tasks to be trained. The task information management method and device solve the problems that task information of a task to be trained cannot be managed according to the running state of the task in the prior art.

Optionally, the pushing module 52 is further configured to determine, from the plurality of tasks to be trained, a first training task whose running state is an initialization state, where the initialization state is used to indicate that a training container for training the first training task is not created; acquiring task information of the first training task, and pushing the task information of the first training task to an initialized task message queue.

Optionally, the pushing module 52 is further configured to determine, from the plurality of tasks to be trained, a second training task having an running state that is an in-running state, where the in-running state is used to indicate that a training container for training the second training task has been created; and acquiring the task information of the second training task, and pushing the task information of the second training task to an operating task message queue.

Optionally, the pushing module 52 is further configured to obtain a training initiation script in the task information of the second training task after pushing the task information of the second training task to the running task message queue; training the second training task in a training container of the second training task according to the training initiation script.

Optionally, the pushing module 52 is further configured to determine a third training task with an operation state being an end state from the plurality of tasks to be trained; and acquiring the task information of the third training task, and pushing the task information of the third training task to an ending task message queue.

Optionally, the pushing module 52 is further configured to, after pushing task information corresponding to the plurality of tasks to be trained to message queues corresponding to running states of the plurality of tasks to be trained, respectively, where the message queues include at least one of: under the conditions of initializing a task message queue, an operating task message queue and an ending task message queue, determining a first training task corresponding to task information in the initializing task message queue, a second training task corresponding to task information in the operating task message queue and a third training task corresponding to task information in the ending task message queue; and determining the running conditions of the plurality of training tasks according to the first training task, the second training task and the third training task.

After acquiring a plurality of tasks to be trained and task information of the tasks to be trained, which are created by a cloud platform, resource verification is required to be performed on the task information, and the task information specifically comprises a first resource verification and a second resource verification, wherein the acquisition module 50 is further used for acquiring target resources required by the tasks to be trained for training according to the task information; comparing the size of the target resource with the size of the residual resources of the server resource group corresponding to the task to be trained; and under the condition that the residual resources are larger than the target resources, determining that the task to be trained passes through a first resource check.

Optionally, the obtaining module 50 is further configured to obtain user group quota information in the task information, where the user group quota information is used to indicate resources required for training a task to be trained; and acquiring the target resource according to the user group quota information.

The acquiring module 50 is further configured to acquire, according to the task information, the number of servers required for training the task to be trained, before or after the task to be trained passes the first resource check; comparing the server number with the available server number of the server resource group corresponding to the task to be trained; and under the condition that the number of the available servers is larger than the number of the servers, determining that the task to be trained passes the second resource check.

Optionally, the obtaining module 50 is further configured to obtain server number information in the task information, where the server number information is used to indicate a number of servers required for training the task to be trained; and determining the server number information as the number of servers required by training the task to be trained.

Optionally, the obtaining module 50 is further configured to determine whether a training initiation script exists in the task information if the task to be trained passes the first resource check and the second resource check, where the training initiation script is used to initiate a process for training the task to be trained; and in the presence of the training initiation script, packaging the task information into a target template which is allowed to be processed by a container cloud of the cloud platform.

Optionally, the acquiring module 50 is further configured to acquire a task identifier of the task to be trained, which is generated by the container cloud according to the target template; and inquiring the running state in the container cloud according to the task identification.

Optionally, the obtaining module 50 is further configured to, for the different message queues, at least one of: the system comprises a priority queue, a task message queue to be scheduled, an initialization task message queue, an on-the-fly task message queue and an ending task message queue.

An embodiment of the present invention also provides a storage medium including a stored program, wherein the program runs the method of any one of the above.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store program code for executing the steps of:

s1, acquiring a plurality of tasks to be trained and task information of the tasks to be trained, which are created by a cloud platform;

s2, under the condition that the container cloud successfully dispatches the tasks to be trained according to the task information, acquiring running states corresponding to the tasks to be trained respectively, and pushing task information corresponding to the tasks to be trained to message queues corresponding to the running states of the tasks to be trained respectively.

In one exemplary embodiment, the computer readable storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.

Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.

An embodiment of the invention also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

Alternatively, in this embodiment, the above-mentioned processor may be configured to execute the following steps by a computer program:

Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.

Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing system, they may be centralized in a single computing system, or distributed across a network of computing systems, and they may alternatively be implemented in program code that is executable by the computing system, such that they are stored in a memory system and, in some cases, executed in a different order than that shown or described, or they may be implemented as individual integrated circuit modules, or as individual integrated circuit modules. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The task information pushing method is characterized by comprising the following steps of:

acquiring a plurality of tasks to be trained and task information of the tasks to be trained, which are created by a cloud platform;

and under the condition that the container cloud successfully dispatches the tasks to be trained according to the task information, acquiring running states corresponding to the tasks to be trained respectively, and pushing task information corresponding to the tasks to be trained to message queues corresponding to the running states of the tasks to be trained respectively.

2. The method for pushing task information according to claim 1, wherein pushing task information corresponding to the plurality of tasks to be trained to message queues corresponding to running states of the plurality of tasks to be trained, respectively, includes:

determining a first training task with an operating state being an initialization state from the plurality of tasks to be trained, wherein the initialization state is used for indicating that a training container for training the first training task is not created;

acquiring task information of the first training task, and pushing the task information of the first training task to an initialized task message queue.

3. The method for pushing task information according to claim 1, wherein pushing task information corresponding to the plurality of tasks to be trained to message queues corresponding to running states of the plurality of tasks to be trained, respectively, includes:

Determining a second training task with an operating state being an operating state from the plurality of tasks to be trained, wherein the operating state is used for indicating that a training container for training the second training task is created;

and acquiring the task information of the second training task, and pushing the task information of the second training task to an operating task message queue.

4. A method of pushing task information according to claim 3, wherein after pushing the task information of the second training task to the running task message queue, the method further comprises: acquiring a training start script in task information of the second training task;

training the second training task in a training container of the second training task according to the training initiation script.

5. The method for pushing task information according to claim 1, wherein pushing task information corresponding to the plurality of tasks to be trained to message queues corresponding to running states of the plurality of tasks to be trained, respectively, includes:

determining a third training task with an operation state being an ending state from the plurality of tasks to be trained;

And acquiring the task information of the third training task, and pushing the task information of the third training task to an ending task message queue.

6. The method for pushing task information according to claim 1, wherein after pushing task information corresponding to the plurality of tasks to be trained to message queues corresponding to running states of the plurality of tasks to be trained, the method further comprises:

the message queue comprises at least one of the following: under the conditions of initializing a task message queue, an operating task message queue and an ending task message queue, determining a first training task corresponding to task information in the initializing task message queue, a second training task corresponding to task information in the operating task message queue and a third training task corresponding to task information in the ending task message queue;

and determining the running conditions of the plurality of training tasks according to the first training task, the second training task and the third training task.

7. The method for pushing task information according to claim 1, wherein after obtaining a plurality of tasks to be trained created by a cloud platform and task information of the tasks to be trained, the method further comprises:

Acquiring target resources required by training of the task to be trained according to the task information;

comparing the size of the target resource with the size of the residual resources of the server resource group corresponding to the task to be trained;

and under the condition that the residual resources are larger than the target resources, determining that the task to be trained passes through a first resource check.

8. A pushing device for task information, comprising:

the acquisition module is used for acquiring a plurality of tasks to be trained and task information of the tasks to be trained, which are created by the cloud platform;

and the pushing module is used for acquiring the running states corresponding to the tasks to be trained respectively under the condition that the container cloud successfully dispatches the tasks to be trained according to the task information, and pushing the task information corresponding to the tasks to be trained to the message queues corresponding to the running states of the tasks to be trained respectively.

9. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored program, wherein the program runs the method according to any one of the preceding claims 1 to 7.

10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the method according to any of the claims 1 to 7 by means of the computer program.