CN111882056B

CN111882056B - Deep learning training task management and control method and device based on copy mode

Info

Publication number: CN111882056B
Application number: CN202010566486.0A
Authority: CN
Inventors: 王文潇
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2022-07-08
Anticipated expiration: 2040-06-19
Also published as: CN111882056A

Abstract

The invention provides a deep learning training task management and control method based on a copy mode, which comprises the following steps: binding a deep learning training task input by a user with a training task copy; the monitoring module monitors the training task copy and sends a control instruction according to the operation input by the user; the execution module receives the control instruction sent by the monitoring module, processes the deep learning training task according to the control instruction, and returns the processing result to the monitoring module.

Description

Deep learning training task management and control method and device based on copy mode

Technical Field

The invention relates to the field of deep learning task management, in particular to a deep learning training task management and control method and device based on a copy mode.

Background

Deep learning is the intrinsic law and expression level of the learning sample data, and the information obtained in the learning process is very helpful for the interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds. Deep learning has achieved many achievements in search technology, data mining, machine learning, machine translation, natural language processing, multimedia learning, speech, recommendation and personalization technologies, and other related fields. The deep learning enables the machine to imitate human activities such as audio-visual and thinking, solves a plurality of complex pattern recognition problems, and makes great progress on the artificial intelligence related technology.

With the increasing of industries involved in deep learning, the industries also play more and more important roles, more and more algorithm engineers participate in the field, each algorithm engineer wants to perform deep learning training by himself, and first needs to build a proper training environment, not only needs to care about the configuration of the training environment, but also needs to perform management operations on training tasks, such as task creation, task stopping and the like.

However, before the training is performed, a lot of preparation work needs to be performed by the engineer, which often consumes a lot of time and energy of the algorithm engineer, and is not beneficial to improving the efficient performance of the deep learning task.

Disclosure of Invention

In order to solve the problems in the prior art, the invention innovatively provides a deep learning training task management and control method and device based on a copy mode, effectively solves the problem that a large amount of time and energy are needed for the existing deep learning task training, and effectively improves the efficient implementation of the deep learning task training.

The invention provides a deep learning training task management and control method based on a copy mode, which comprises the following steps:

binding a deep learning training task input by a user with a training task copy;

the monitoring module monitors the training task copy and sends a control instruction according to the operation input by the user;

the execution module receives the control instruction sent by the monitoring module, processes the deep learning training task according to the control instruction, and returns the processing result to the monitoring module.

Optionally, the binding the deep learning training task input by the user with the training task copy specifically includes:

acquiring deep learning training task parameter information input by a user;

and generating and storing a training task copy of the corresponding training task according to the deep learning training task parameter information input by the user, wherein the training task copy comprises the deep learning training task parameter information input by the user.

Further, the deep learning training task parameter information input by the user comprises: iteration times, training frames, batch numbers, CPU usage numbers, GPU usage numbers.

Optionally, the monitoring module monitors the training task copy, and sending the control instruction according to the operation input by the user includes:

the monitoring module monitors the training task copy in real time, acquires a user deep learning training task creation instruction in the training task copy, and sends the training task copy and a control instruction for creating the deep learning training task;

the monitoring module monitors the training task copies in real time, acquires a sequencing instruction and latest sequencing information of deep learning training tasks in the training task copies, updates the sequencing information in the corresponding task training copies, and sends control instructions corresponding to the training task copies and the sequencing of the deep learning training tasks;

the monitoring module monitors the training task copies in real time, acquires a user deep learning training task deleting instruction in the training task copies, and sends a control instruction corresponding to the training task copies and the deep learning training task deleting.

Further, the step that the execution module receives the control instruction sent by the monitoring module, processes the deep learning training task according to the control instruction, and returns the processing result to the monitoring module specifically comprises:

the execution module receives the training task copy sent by the monitoring module and the control instruction created by the deep learning training task, creates a corresponding deep learning training task according to the control instruction created by the deep learning training task and the training task copy, and returns the created result to the user through the monitoring module;

the execution module receives the corresponding training task copies and the control instructions corresponding to the deep learning training task sequences sent by the monitoring module, reorders the corresponding deep learning training tasks according to the control instructions corresponding to the deep learning training task sequences and the corresponding training task copies, and returns the sequencing results to the user through the monitoring module;

the execution module receives the control instruction for deleting the corresponding training task copy and the deep learning training task sent by the monitoring module, deletes the corresponding deep learning training task and the corresponding training task copy according to the control instruction for deleting the deep learning training task and the corresponding training task copy, and returns the deletion result to the user through the monitoring module.

Further, according to the control instruction corresponding to the deep learning training task ranking and the corresponding training task copy, the reordering of the corresponding deep learning training task is specifically as follows: and according to the latest sequencing information in the control instruction corresponding to the deep learning training task sequencing and the corresponding training task copy, sequentially recreating the corresponding deep learning training tasks according to the sequence.

The invention provides a deep learning training task management and control device based on a copy mode, which comprises:

the binding unit is used for binding the deep learning training task input by the user with the training task copy;

the monitoring module monitors the training task copies and sends control instructions according to the operation input by the user;

and the execution module receives the control instruction sent by the monitoring module, processes the deep learning training task according to the control instruction and returns a processing result to the monitoring module.

Optionally, the binding unit specifically includes:

the acquisition subunit acquires deep learning training task parameter information input by a user;

and the generating subunit is used for generating and storing a training task copy of the corresponding training task according to the deep learning training task parameter information input by the user, wherein the training task copy comprises the deep learning training task parameter information input by the user.

Optionally, the monitoring unit comprises:

the monitoring module monitors the training task copies in real time, acquires the sequencing instruction and the latest sequencing information of the deep learning training tasks in the training task copies, updates the sequencing information in the corresponding task training copies, and sends the control instruction corresponding to the training task copies and the sequencing of the deep learning training tasks;

and the monitoring module monitors the training task copy in real time, acquires a user deep learning training task deleting instruction in the training task copy, and sends a control instruction corresponding to the training task copy and the deep learning training task deleting.

Further, the execution unit specifically includes:

the first execution subunit is used for receiving the training task copy sent by the monitoring module and the control instruction created by the deep learning training task, creating a corresponding deep learning training task according to the control instruction created by the deep learning training task and the training task copy, and returning the created result to the user through the monitoring module;

the second execution subunit receives the corresponding training task copies and the control instructions corresponding to the deep learning training task sequences sent by the monitoring module, reorders the corresponding deep learning training tasks according to the control instructions corresponding to the deep learning training task sequences and the corresponding training task copies, and returns the sequencing results to the user through the monitoring module;

and the execution module receives the control instruction for deleting the corresponding training task copy and the deep learning training task sent by the monitoring module, deletes the corresponding deep learning training task and the corresponding training task copy according to the control instruction for deleting the deep learning training task and the corresponding training task copy, and returns the deletion result to the user through the monitoring module.

The technical scheme adopted by the invention comprises the following technical effects:

1. the invention effectively solves the problem that the existing deep task training needs to spend a large amount of time and energy, and effectively improves the high-efficiency implementation of the deep learning task training.

2. The user is helped to better manage the own deep learning training task, and the difficulty in managing the deep learning training task is reduced.

3. The method comprises the steps of binding a deep learning training task and a training task copy, interacting a user with the training task copy, monitoring training task copy information by a monitoring module, calling an execution module to describe and operate an actual training task at the bottom layer according to the training task copy, decoupling the operation of the user on the training task from the operation of the training task at the bottom layer in such a way, reporting the operation by the user, and then handing the operation by the monitoring module and the execution module for processing, so that the management logic of the life cycle of the whole training task is clearer, the system coupling degree is reduced, and the system expansibility is increased.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without any creative effort.

FIG. 1 is a schematic flow diagram of a process embodying aspects of the present invention;

FIG. 2 is a diagram of a user and a bottom layer in a first embodiment of the present invention;

fig. 3 is a schematic flow chart illustrating step S1 in a method according to an embodiment of the present invention;

fig. 4 is a schematic flow chart illustrating step S2 in a method according to an embodiment of the present invention;

fig. 5 is a schematic flow chart illustrating step S3 in a method according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an apparatus according to a second embodiment of the present invention;

fig. 7 is a schematic structural diagram of a binding unit in a second apparatus according to an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a monitoring unit in a second apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an execution unit in a second apparatus according to an embodiment of the present disclosure.

Detailed Description

In order to clearly explain the technical features of the present invention, the following detailed description of the present invention is provided with reference to the accompanying drawings. The following disclosure provides many different embodiments, or examples, for implementing different features of the invention. To simplify the disclosure of the present invention, the components and arrangements of specific examples are described below. Furthermore, the present invention may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. It should be noted that the components illustrated in the figures are not necessarily drawn to scale. Descriptions of well-known components and processing techniques and procedures are omitted so as to not unnecessarily limit the invention.

Example one

As shown in fig. 1-2, the present invention provides a deep learning training task management and control method based on a replica mode, including:

s1, binding the deep learning training task input by the user with the training task copy;

s2, the monitoring module monitors the training task copy and sends a control instruction according to the operation input by the user;

and S3, the execution module receives the control instruction sent by the monitoring module, processes the deep learning training task according to the control instruction, and returns the processing result to the monitoring module.

As shown in fig. 3, step S1 specifically includes:

s11, acquiring the deep learning training task parameter information input by the user;

and S12, generating and storing a training task copy of the corresponding training task according to the deep learning training task parameter information input by the user, wherein the training task copy comprises the deep learning training task parameter information input by the user.

In step S11, the deep learning training task parameter information input by the user may include: iteration times, training frames, batch numbers, CPU usage numbers, GPU usage numbers.

In step S12, the deep learning training task parameter information input by the user is written into a corresponding training task copy (duplicate), where the training task copy stores the full training task information, and the user can change the information in the training task copy, and the generated training task copy of the corresponding training task is stored in the system memory.

Further, as shown in fig. 4, step S2 includes:

s21, the monitoring module (listener) monitors the training task copy in real time, acquires a user deep learning training task creating instruction in the training task copy, and sends the training task copy and a control instruction created by the deep learning training task;

s22, the monitoring module monitors the training task copies in real time, acquires the sequencing instruction and the latest sequencing information of the deep learning training tasks in the training task copies, updates the sequencing information in the corresponding task training copies, and sends the control instruction corresponding to the training task copies and the sequencing of the deep learning training tasks;

s23, the monitoring module monitors the training task copy in real time, acquires a user deep learning training task deleting instruction in the training task copy, and sends a control instruction corresponding to the training task copy and the deep learning training task deleting.

Further, as shown in fig. 5, step S3 specifically includes:

s31, the execution module receives the training task copy sent by the monitoring module and the control instruction created by the deep learning training task, creates a corresponding deep learning training task according to the control instruction created by the deep learning training task and the training task copy, and returns the created result to the user through the monitoring module;

s32, the execution module receives the corresponding training task copies and the control instructions corresponding to the deep learning training task sequences sent by the monitoring module, reorders the corresponding deep learning training tasks according to the control instructions corresponding to the deep learning training task sequences and the corresponding training task copies, and returns the sequencing results to the user through the monitoring module;

and S33, the execution module receives the control instruction for deleting the corresponding training task copy and the deep learning training task sent by the monitoring module, deletes the corresponding deep learning training task and the corresponding training task copy according to the control instruction for deleting the deep learning training task and the corresponding training task copy, and returns the deletion result to the user through the monitoring module.

In step S32, the specific step of reordering the deep learning training tasks according to the control instruction corresponding to the deep learning training task ordering and the corresponding training task copy is: and according to the latest sequencing information in the control instruction corresponding to the deep learning training task sequencing and the corresponding training task copy, sequentially recreating the corresponding deep learning training tasks according to the sequence.

In steps S31-S33, the step of returning the creation, sorting and deletion results to the user through the monitoring module by the execution module (execute) is specifically: the execution module returns the execution result to the monitoring module, and the monitoring module returns the received execution result to the user through the callback mode.

It should be noted that, in the technical solution of the present invention, both the monitoring module and the execution module can be implemented by a programmed program, and the specific idea corresponds to the steps and functions executed by the monitoring module and the execution module, and may also be implemented by other ways, which are not limited herein.

The invention effectively solves the problem that the existing deep task training needs to spend a large amount of time and energy, and effectively improves the high-efficiency implementation of the deep learning task training.

The invention helps users to better manage own deep learning training tasks, and reduces the difficulty of managing the deep learning training tasks.

The method binds the deep learning training task and the training task copy, the user interacts with the training task copy, the monitoring module monitors the training task copy information, the execution module is called to describe the actual training task at the bottom layer according to the training task copy, the operation of the training task by the user and the operation of the training task at the bottom layer are decoupled in this way, the user only needs to report the operation, and then the operation is processed by the monitoring module and the execution module, so that the management logic of the life cycle of the whole training task is clearer, the system coupling degree is reduced, and the system expansibility is increased.

Example two

As shown in fig. 6, the technical solution of the present invention further provides a deep learning training task management and control device based on a replica mode, including:

the binding unit 101 binds a deep learning training task input by a user with a training task copy;

the monitoring unit 102 is used for monitoring the training task copy by a monitoring module and sending a control instruction according to the operation input by the user;

and the execution unit 103 is used for receiving the control instruction sent by the monitoring module, processing the deep learning training task according to the control instruction and returning a processing result to the monitoring module.

As shown in fig. 7, the binding unit 101 specifically includes:

an acquiring subunit 1011, acquiring deep learning training task parameter information input by a user;

the generating subunit 1012, according to the deep learning training task parameter information input by the user, generates and stores a training task copy of a corresponding training task, where the training task copy includes the deep learning training task parameter information input by the user.

Further, as shown in fig. 8, the monitoring unit 102 includes:

the first monitoring subunit 1021, the monitoring module monitors the training task copy in real time, acquires a user deep learning training task creating instruction in the training task copy, and sends the training task copy and a control instruction for creating the deep learning training task;

the second monitoring subunit 1022, the monitoring module monitors the training task copy in real time, obtains the ranking instruction and the latest ranking information of the deep learning training tasks in the training task copy, updates the ranking information in the training copy of the corresponding task, and sends the control instruction corresponding to the training task copy and the ranking of the deep learning training tasks;

and the third monitoring subunit 1023 is used for monitoring the training task copies by the monitoring module in real time, acquiring a user deep learning training task deletion instruction in the training task copies, and sending a control instruction corresponding to the training task copies and the deep learning training task deletion.

Further, as shown in fig. 9, the execution unit 103 specifically includes:

the first execution subunit 1031, where the execution module receives the training task copy sent by the monitoring module and the control instruction created by the deep learning training task, creates a corresponding deep learning training task according to the control instruction created by the deep learning training task and the training task copy, and returns the creation result to the user through the monitoring module;

the second execution subunit 1032 receives the corresponding training task copy and the control instruction corresponding to the deep learning training task ranking sent by the monitoring module, reorders the corresponding deep learning training task according to the control instruction corresponding to the deep learning training task ranking and the corresponding training task copy, and returns a ranking result to the user through the monitoring module;

the third execution subunit 1033, where the execution module receives the control instruction for deleting the corresponding training task copy and the deep learning training task sent by the monitoring module, deletes the corresponding deep learning training task and the corresponding training task copy according to the control instruction for deleting the deep learning training task and the corresponding training task copy, and returns the deletion result to the user through the monitoring module.

The method binds the deep learning training task and the training task copy, the user interacts with the training task copy, the monitoring module monitors the information of the training task copy, the execution module is called to describe and operate the actual training task at the bottom layer according to the training task copy, the operation of the user on the training task and the operation of the training task at the bottom layer are decoupled in the way, the user only needs to report the operation, and then the operation is processed by the monitoring module and the execution module, so that the management logic of the life cycle of the whole training task is clearer, the system coupling degree is reduced, and the system expansibility is increased.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A deep learning training task management and control method based on a copy mode is characterized by comprising the following steps:

the monitoring module monitors the training task copy and sends a control instruction according to the operation input by the user; the monitoring module monitors the training task copies and sends control instructions according to the operation input by the user, wherein the control instructions comprise:

the monitoring module monitors the training task copies in real time, acquires a sequencing instruction and latest sequencing information of deep learning training tasks in the training task copies, updates the sequencing information in the corresponding task training copies, and sends the corresponding training task copies and control instructions corresponding to the sequencing of the deep learning training tasks;

the monitoring module monitors the training task copies in real time, acquires a user deep learning training task deleting instruction in the training task copies, and sends a control instruction corresponding to the training task copies and the deep learning training task deleting;

the execution module receives the control instruction sent by the monitoring module, processes the deep learning training task according to the control instruction, and returns a processing result to the monitoring module; the execution module receives the control instruction sent by the monitoring module, processes the deep learning training task according to the control instruction, and returns the processing result to the monitoring module, which specifically comprises:

2. The deep learning training task management and control method based on the replica mode as claimed in claim 1, wherein the binding of the deep learning training task input by the user and the training task replica specifically comprises:

acquiring deep learning training task parameter information input by a user;

3. The deep learning training task management and control method based on the replica mode as claimed in claim 2, wherein the deep learning training task parameter information input by the user comprises: iteration times, training frames, batch numbers, CPU usage numbers, GPU usage numbers.

4. The deep learning training task management and control method based on the replica mode as claimed in claim 1, wherein the reordering of the corresponding deep learning training tasks according to the control instruction corresponding to the deep learning training task ordering and the corresponding training task replica specifically comprises: and according to the latest sequencing information in the control instruction corresponding to the deep learning training task sequencing and the corresponding training task copy, sequentially recreating the corresponding deep learning training tasks according to the sequence.

5. The utility model provides a deep learning training task management and control device based on duplicate mode, characterized by includes:

the monitoring module monitors the training task copies and sends control instructions according to the operation input by the user; the monitoring unit includes:

the execution unit is used for receiving the control instruction sent by the monitoring module, processing the deep learning training task according to the control instruction and returning a processing result to the monitoring module; the execution unit specifically includes:

6. The deep learning training task management and control device based on the replica mode as claimed in claim 5, wherein the binding unit specifically comprises: