CN112508191A

CN112508191A - Method and device for training deep learning model, electronic equipment and storage medium

Info

Publication number: CN112508191A
Application number: CN202011467358.7A
Authority: CN
Inventors: 赵明; 韩来鹏; 陈阳雪; 柳笛; 杜艳冰
Original assignee: Beijing Horizon Information Technology Co Ltd
Current assignee: Beijing Horizon Information Technology Co Ltd
Priority date: 2020-12-14
Filing date: 2020-12-14
Publication date: 2021-03-16

Abstract

The utility model discloses a method and a device for training a deep learning model, an electronic device and a storage medium, and relates to the technical field of deep learning. The method comprises the following steps: determining the running state of the current deep learning model to be trained; determining training parameters related to the current deep learning model to be trained; and training the current deep learning model to be trained based on the running state and the training parameters. According to the embodiment of the application, when the current deep learning model to be trained is in different running states, the problem that the training is influenced is avoided, so that the training efficiency is improved, the time and energy of a user are avoided being wasted, and the waste of GPU and CPU resources is avoided.

Description

Method and device for training deep learning model, electronic equipment and storage medium

Technical Field

The application relates to the technical field of deep learning, in particular to a method and a device for training a deep learning model, electronic equipment and a storage medium.

Background

With the continuous and deep study, the deep learning technology obtains great achievements in the fields of computer vision, speech recognition, text processing and the like, and brings great convenience to the life of people. However, in the whole training process of the deep learning model, when a user submits a training task of the deep learning model to the GPU training platform, the training task of the deep learning model can be generally completed only by lasting hours or days, in the whole training process, not only many problems can occur before training operation, but also many problems can occur during training operation, thereby causing abnormal interruption of training or low efficiency of training, and the like, thereby not only wasting time and energy of the user, but also wasting GPU and CPU resources.

Disclosure of Invention

The present application is proposed to solve the above-mentioned technical problems. The embodiment of the application provides a method and device for training a deep learning model, electronic equipment and a storage medium.

According to an aspect of an embodiment of the present application, there is provided a method for training a deep learning model, including: determining the running state of the current deep learning model to be trained; determining training parameters related to the current deep learning model to be trained; and training the current deep learning model to be trained based on the running state and the training parameters.

According to another aspect of the embodiments of the present application, there is provided an apparatus for training a deep learning model, including: the first determination module is configured to determine the running state of the current deep learning model to be trained; a second determination module configured to determine a training parameter related to the current deep learning model to be trained; and the training module is configured to train the current deep learning model to be trained based on the running state and the training parameters.

According to another aspect of the embodiments of the present application, there is provided a computer-readable storage medium storing a computer program for executing the method for training a deep learning model according to the above embodiments.

According to another aspect of embodiments of the present application, there is provided an electronic device including: a processor; a memory for storing the processor-executable instructions; the processor is configured to execute the method for training the deep learning model according to the above embodiment.

According to the method for training the deep learning model, the running state of the current deep learning model to be trained is determined, the training parameters related to the current deep learning model to be trained are determined, and finally the current deep learning model to be trained is trained on the basis of the running state and the training parameters.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 is a schematic diagram of an implementation scenario provided in an embodiment of the present application.

Fig. 2 is a flowchart illustrating a method for training a deep learning model according to an exemplary embodiment of the present application.

Fig. 3 is a flowchart illustrating a method for training a deep learning model according to an exemplary embodiment of the present application.

FIG. 4 is a schematic diagram of a training resource utilization information and training log collection process provided by an exemplary embodiment of the present application.

Fig. 5 is a flowchart illustrating a method for training a deep learning model according to an exemplary embodiment of the present application.

Fig. 6 is a flowchart illustrating a method for training a deep learning model according to an exemplary embodiment of the present application.

Fig. 7 is a flowchart illustrating a method for training a deep learning model according to an exemplary embodiment of the present application.

Fig. 8 is a flowchart illustrating a method for training a deep learning model according to an exemplary embodiment of the present application.

Fig. 9 is a flowchart illustrating a method for training a deep learning model according to an exemplary embodiment of the present application.

Fig. 10 is a block diagram of an apparatus for training a deep learning model according to an exemplary embodiment of the present application.

Fig. 11 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.

Summary of the application

The GPU training platform in the industry generally includes a GPU resource pool, a scheduling system, and a storage system. On the basis, a user submits a deep learning model to the GPU training platform, the GPU training platform generally can finish training for hours or days according to the difference of the deep learning model, and meanwhile, a large number of problems can occur when the deep learning model is actually trained. For example, grammar errors, data directory misspellings, and inconsistencies between the training environment and the development environment, etc., can seriously affect the efficiency of training the deep learning model, and even quit training abnormally. The deep learning model with low training efficiency occupies a large amount of GPU and CPU resources, so that resource waste is caused, and time and energy of a user are wasted.

In order to solve the above problems, an embodiment of the present disclosure provides a method for training a deep learning model, which can avoid some problems affecting training when the current deep learning model to be trained is in different operating states by determining an operating state of the current deep learning model to be trained, determining a training parameter related to the current deep learning model to be trained, and finally training the current deep learning model to be trained based on the operating state and the training parameter, so as to improve training efficiency, avoid wasting time and energy of a user, and avoid wasting GPU and CPU resources.

Having described the general principles of the present disclosure, various non-limiting embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings.

Exemplary System

FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application. The implementation environment includes: a server 140 and a plurality of terminal devices 110, 120, 130.

The terminals 110, 120, 130 may be mobile terminal devices such as a mobile phone, a game console, a tablet Computer, a camera, a video camera, and a car-mounted Computer, or the terminals 110, 120, 130 may be Personal Computers (PCs), such as a laptop portable Computer and a desktop Computer. Those skilled in the art will appreciate that the types of terminals 110, 120, 130 described above may be the same or different, and that the number may be greater or fewer. For example, the number of the terminals may be one, or several tens or hundreds of the terminals, or more. The number of terminals and the type of the device are not limited in the embodiments of the present application.

The terminals 110, 120, 130 and the server 140 are connected via a communication network. Optionally, the communication network is a wired network or a wireless network.

The server 140 is a server, or is composed of a plurality of servers, or is a virtualization platform, or is a cloud computing service center.

In some optional embodiments, the server 140 receives the training application of the deep learning model submitted by the terminals 110, 120, 130, determines the operating state of the deep learning model and the training parameters related to the deep learning model, and finally trains the deep learning model based on the operating state and the training parameters. However, this is not limited in this embodiment of the present application, and in alternative embodiments, the terminal 110, 120, 130 receives a training application of the deep learning model, determines an operating state of the deep learning model and training parameters related to the deep learning model, and finally trains the deep learning model based on the operating state and the training parameters.

Through the implementation scenes, the problem that the training is influenced can be avoided when the deep learning model is in different running states, so that the training efficiency is improved, the waste of time and energy of a user is avoided, and the waste of GPU and CPU resources is avoided.

Exemplary method

Fig. 2 is a flowchart illustrating a method for training a deep learning model according to an exemplary embodiment of the present application. The method may be applied to the implementation scenario shown in fig. 1 and executed by the terminal device 110, 120, or 130 shown in fig. 1, but the embodiment of the present disclosure is not limited thereto, and the method may also be executed by the server 140. An exemplary embodiment of the present application will be described below, taking as an example the method performed by the server. As shown in fig. 2, the method for training the deep learning model includes the following steps.

S210: and determining the running state of the current deep learning model to be trained.

In an embodiment, the Deep learning model to be trained currently may refer to a Deep model based on Deep learning, such as at least one of a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), a Recurrent Neural Network (RNN), a Deep Residual Network (ResNet), or a Dense Convolutional Network (densneet). However, the present embodiment is not particularly limited thereto.

In an embodiment, the running state of the deep learning model to be trained currently constitutes a life cycle of the deep learning model to be trained currently, that is, a pre-running state of the deep learning model to be trained currently, an in-running state of the deep learning model to be trained currently, and an after-running state of the deep learning model to be trained currently.

In an embodiment, the pre-running state of the deep learning model to be trained refers to queuing and pre-running inspection before running after the deep learning model to be trained is submitted to the deep learning platform and before formal training. The running state of the current deep learning model to be trained means that the current deep learning model to be trained is formally trained, and after the current deep learning model to be trained is queued and checked before running and is allocated to cluster resources, the current deep learning model to be trained begins to enter the running state. The post-operation state of the current deep learning model to be trained means that the current deep learning model to be trained completes training, namely, operation is finished.

However, it should be noted that how to determine the running state is not specifically limited in the embodiments of the present application, and a person skilled in the art may make different selections according to actual requirements, for example, by determining whether to output a training log or other data, or by determining resource utilization rates of the GPU and the CPU, a pre-running state, a running state, and a post-running state may be distinguished.

S220: and determining training parameters related to the current deep learning model to be trained.

In one embodiment, the training parameters may be abnormal training parameters or normal training parameters. For example, when the deep learning model to be trained currently runs abnormally in the pre-running stage and the running stage, the training parameter may refer to an abnormal training parameter; when the current deep learning model to be trained is in the post-operation stage, the training parameters may refer to normal training parameters. The embodiments of the present application are not limited thereto.

However, it should be noted that how to determine the training parameters is not specifically limited in the embodiments of the present application, and a person skilled in the art may make different selections according to actual needs, for example, the training parameters in different operating states may be determined by judging the output condition of the training log or the GPU utilization rate and the CPU utilization rate.

S230: and training the current deep learning model to be trained based on the running state and the training parameters.

In one embodiment, the current deep learning model to be trained is trained according to training parameters in different running states. For example, when the training parameter in the operating state is an abnormal training parameter, the training may be resumed, or the training of the current deep learning model to be trained may be ended. The embodiments of the present application are not limited thereto.

Therefore, the current deep learning model to be trained is trained according to the training parameters in different running states, and the problem that the training is influenced when the deep learning model is in different running states can be avoided, so that the training efficiency is improved, the time and energy of a user are avoided being wasted, and the waste of GPU and CPU resources is avoided.

In another embodiment of the present application, as shown in fig. 3, step S210 shown in fig. 2 includes the following.

S310: training resource utilization information and/or training logs are obtained.

In an embodiment, the training resource utilization information refers to GPU utilization and CPU utilization of all working nodes of the deep learning model to be trained currently. The training log usually includes information such as the current number of training rounds, training speed, training precision, and testing precision.

In one embodiment, as shown in fig. 4, three GPU servers (i.e., GPU server 410, GPU server 420, and GPU server 430) are taken as an example to perform training of the deep learning model to be trained currently, and when resources are allocated, the GPU server 410, the GPU server 420, and the GPU server 430 inject an acquisition Agent process (i.e., an Agent process, which functions to check software update information on the internet and notify a user of the software update) while starting a training program of the deep learning model to be trained currently. The acquisition agent processes injected by the GPU server 410, the GPU server 420, and the GPU server 430 may acquire training resource utilization information of the working node of the current deep learning model to be trained, and send it to the training resource collection server 440 (i.e., the training resource Metric collection server); meanwhile, the collection agent processes injected by the GPU server 410, the GPU server 420, and the GPU server 430 may monitor the standard output, i.e., the training log, and send it to the training log collection server 450 in real time.

S320: and determining the running state of the current deep learning model to be trained based on training resource utilization information and/or a training log and a fourth preset condition.

In an embodiment, the fourth preset condition may refer to determining whether the training resource collection server 440 collects the training resource utilization information and/or the training log collection server 450 collects the training logs.

When the training resource collection server 440 does not start to collect the training resource utilization information and/or the training log collection server 450 does not start to collect the training logs, this indicates that the running state of the current deep learning model to be trained is pre-running.

When the training resource collection server 440 collects the training resource utilization information and/or the training log collection server 450 collects the training logs, this indicates that the running state of the deep learning model to be trained is currently running.

When the training resource collection server 440 stops collecting the training resource utilization information and/or the training log collection server 450 stops collecting the training logs, this indicates that the current running state of the deep learning model to be trained is after running.

In another embodiment of the present application, the method shown in fig. 5 is an example of the method shown in fig. 2, and the method shown in fig. 5 further includes the following.

S510: and determining the running state of the current deep learning model to be trained as pre-running.

In an embodiment, through the above embodiment of the method shown in fig. 3, the operation state of the current deep learning model to be trained may be determined, that is, when the fourth preset condition satisfies that the training resource collection server 440 does not start to collect the training resource utilization information and/or the training log collection server 450 does not start to collect the training logs, the operation state of the current deep learning model to be trained is determined to be a pre-operation state.

S520: and in a pre-operation resource pool, pre-operating the current deep learning model to be trained to obtain a pre-operation result, wherein the resource amount used by the pre-operation resource pool is less than that used by the formal operation resource pool.

In an embodiment, because the total resources constructed by the GPU cluster are limited and need to be shared by a plurality of users, the user enters a queuing stage after submitting the current deep learning model to be trained to the deep learning platform. In general, the larger the application resource of the current deep learning model to be trained is, the longer the queuing time is required. However, after the current deep learning model to be trained is queued for several hours, a situation of a quick operation failure exit often occurs, which results in a great waste of energy for the user.

Therefore, firstly, the configuration parameters of the current deep learning model to be trained are analyzed, and the resource amount used by the current deep learning model to be trained is reasonably reduced, for example, the current deep learning model to be trained should apply for 8 GPU servers with 8 cards, and through analysis, the current deep learning model to be trained can be reduced into 8 GPU servers with 1 card. However, the embodiment of the present application does not specifically limit how to determine the reduction ratio of the resource amount by analyzing the configuration parameters, and those skilled in the art may make different selections according to actual needs.

After the resource amount is reduced, the current deep learning model to be trained is placed into a single pre-running resource pool (the resource amount used by the pre-running resource pool is less than that used by the formal running resource pool) to perform pre-debugging pre-running check so as to obtain a pre-running result.

In an embodiment, the pre-operation result may refer to a pre-debugging result obtained after the current deep learning model to be trained is pre-operated, that is, there is no error or there is an error, for example, a syntax error or a training configuration error, and the like.

S530: and determining training parameters related to the current deep learning model to be trained based on the pre-operation result.

In an embodiment, since the pre-operation result includes the training parameters, the training parameters related to the current deep learning model to be trained, that is, the training parameters related to the pre-operation result, such as the grammar or the training configuration, may be directly obtained based on the pre-operation result.

S540: and pushing the pre-operation result so as to facilitate the user to update the training parameters of the current deep learning model to be trained.

In one embodiment, by pre-running in the pre-running resource pool, error feedback of the pre-running result can be obtained quickly without wasting hours of queuing time. And pushing the pre-operation result in a mode of pop-up window and the like so as to facilitate the user to update the training parameters of the current deep learning model to be trained, which are related to the pre-operation result, according to the pre-operation result.

S550: and training the current deep learning model to be trained according to the updated training parameters.

In an embodiment, the deep learning model to be trained currently can be trained in the formal running resource pool according to the updated training parameters.

Therefore, the current deep learning model to be trained is pre-run in the pre-run resource pool, so that the situation that the current deep learning model to be trained is directly put into the formal run resource pool to be formally run can be avoided, and after queuing time is wasted for hours, grammar errors or training configuration errors and the like occur, and the time and energy of a user are avoided being wasted.

In an embodiment of the present application, the in-operation state is the longest period of the deep learning model to be trained currently in the training process, and generally lasts for several hours or even days. The user will typically occasionally look at the status and log of the current deep learning model to be trained to confirm that the training is in accordance with the expectations of the design. However, the user cannot always detect the running state and the training log of the current deep learning model to be trained in time, so that the current deep learning model to be trained does not work as expected in practice and cannot normally stop, and therefore, the GPU and the CPU resources are wasted.

In another embodiment of the present application, the method shown in fig. 6 is an example of the method shown in fig. 2, and the method shown in fig. 6 further includes the following.

S610: and determining the running state of the current deep learning model to be trained as running.

In an embodiment, by the method described in the embodiment shown in fig. 3, the running state of the current deep learning model to be trained may be determined, that is, when the fourth preset condition satisfies that the training resource collecting server 440 collects the training resource utilization information and/or the training log collecting server 450 collects the training logs, the running state of the current deep learning model to be trained is determined to be running.

S620: and detecting the resource utilization rate of the current deep learning model to be trained during operation according to a first preset rule.

In an embodiment, resource utilization may refer to GPU utilization and/or CPU utilization.

In an embodiment, the first preset rule may refer to performing a first preset number of detections on the resource utilization rate of the current deep learning model to be trained during running, for example, 3 times, where a time interval between any two adjacent detections is a preset time interval, for example, 5 minutes. However, the value of the preset time interval and the value of the first preset number are not specifically limited in the embodiment of the present application, and those skilled in the art may select the values differently according to actual requirements.

In an embodiment, the first preset rule may also refer to continuously detecting a resource utilization rate of the current deep learning model to be trained during running in a preset time interval. However, the embodiment of the present application does not specifically limit the value of the preset time interval, and those skilled in the art may make different selections according to actual requirements.

It should be noted that the embodiment of the present application does not specifically limit the specific implementation manner of the first preset rule, and besides the two manners mentioned above, a person skilled in the art may also make different selections according to actual needs.

S630: and determining training parameters related to the current deep learning model to be trained based on the resource utilization rate.

In an embodiment, since the resource utilization rate includes the training parameters, the training parameters related to the current deep learning model to be trained, that is, the training parameters related to the resource utilization rate, can be directly obtained based on the resource utilization rate.

S640: and pushing training reference information according to the resource utilization rate and a first preset condition so that a user can update the training parameters of the current deep learning model to be trained according to the training reference information.

In an embodiment, the first preset condition may be to determine whether a resource utilization rate of the current deep learning model to be trained detected each time is lower than a preset threshold. And when the resource utilization rate of the current deep learning model to be trained, which is obtained by the detection of the second preset times, is lower than a preset threshold value, pushing training reference information so that a user can update the training parameters, related to the resource utilization rate, of the current deep learning model to be trained according to the training reference information.

The value of the second preset number is not specifically limited in the embodiment of the present application, and the value of the second preset number may refer to all detections or may refer to partial detections.

In an embodiment, the training reference information may be pushed in a pop-up window or the like to suggest a user to update the training parameters related to the resource utilization rate. The embodiment of the present application is not particularly limited to this.

S650: and continuing to train the current deep learning model to be trained according to the updated training parameters.

In an embodiment, the current deep learning model to be trained may be continuously trained in the formal running resource pool according to the updated training parameters.

Therefore, the resource utilization rate of the current deep learning model to be trained is detected through the first preset rule, and the user can be reminded to update the training parameters when the resource utilization rate is low, so that the training efficiency is improved, and the waste of GPU and CPU resources is avoided.

In another embodiment of the present application, the method shown in fig. 7 is an example of the method shown in fig. 2, and the method shown in fig. 7 further includes the following.

S710: and determining the running state of the current deep learning model to be trained as running.

S720: and judging abnormal information of the current deep learning model to be trained when the current deep learning model to be trained runs according to a second preset rule.

In one embodiment, when the deep learning model to be trained currently runs, a large amount of training data needs to be read, so that the stability of a network and a storage system is greatly relied on, but the reliability of 100% cannot be guaranteed in a distributed system. Therefore, the abnormal information may refer to an abnormal exit of the deep learning model to be trained currently, for example, the abnormal exit is caused by a jitter of a private network, or the abnormal exit is caused by a timeout of a read storage system, which is not specifically limited in the embodiment of the present application.

In an embodiment, the second predetermined rule may refer to determining a predetermined time for the training resource collecting server 440 to stop collecting the training resource utilization information as shown in fig. 4 and/or a predetermined time for the training log collecting server 450 to stop collecting the training logs as shown in fig. 4. And when the preset time meets a preset time threshold, judging that the current deep learning model to be trained exits abnormally when running.

However, it should be noted that, in the embodiment of the present application, the value of the preset time and the value of the preset time threshold are not specifically limited, and those skilled in the art may make different selections according to actual requirements.

S730: and determining training parameters related to the current deep learning model to be trained based on the abnormal information.

In an embodiment, since the anomaly information includes the training parameters, the training parameters related to the current deep learning model to be trained, that is, the training parameters related to the anomaly information, can be directly obtained based on the anomaly information.

S740: and determining a check point corresponding to the abnormal information according to the abnormal information and a second preset condition.

In an embodiment, the second preset condition may be to determine whether the abnormal information is a known fault type that can be fault-tolerant according to an analysis result obtained by analyzing a training log with abnormal information in the past. And continuously saving the training parameters of the current deep learning model to be trained in operation to generate a check point file, and searching a check point (checkpoint) from the check point file when the abnormal information is a known fault-tolerant error type so as to determine the check point with the abnormal information.

However, it should be noted that the embodiment of the present application does not specifically limit how to determine whether the abnormal information is a known fault-tolerant error type through analyzing the conclusion, for example, training a network model to learn different abnormal information to determine the type of the abnormal information, that is, learning different abnormal information caused by different reasons to obtain a trained network model, and determining whether the abnormal information occurring in the current deep learning model to be trained is a known fault-tolerant error type through the trained network model.

S750: and resuming training of the current deep learning model to be trained from the check point.

In one embodiment, when the abnormal information is determined to be a known fault-tolerant error type, the state of the current deep learning model to be trained is automatically changed, resources are not released, and the training progress is recovered from the latest check point so as to recover the training of the current deep learning model to be trained.

Therefore, by carrying out fault tolerance on the abnormal information, the influence of abnormal exit caused by accidents on the user can be reduced to the maximum extent, and the time for resubmission of the user is saved, so that the time and energy of the user are prevented from being wasted, and the robustness of the current deep learning model to be trained can be enhanced.

In another embodiment of the present application, the method shown in fig. 8 is an example of the method shown in fig. 2, and the method shown in fig. 8 further includes the following.

S810: and determining the running state of the current deep learning model to be trained as running.

S820: and acquiring training resource utilization information and/or a training log of the current deep learning model to be trained during running according to a third preset rule.

In an embodiment, the third preset rule may refer to collecting training resource utilization information and/or a training log of the current deep learning model to be trained during running according to the method shown in fig. 4, and specific implementation details please refer to the embodiment shown in fig. 4, which is not described herein again.

S830: determining training parameters related to the current deep learning model to be trained based on the training resource utilization information and/or the training log.

In an embodiment, since the training resource utilization information and/or the training log includes the training parameters, the training parameters related to the current deep learning model to be trained, that is, the training parameters related to the training resource utilization information and/or the training log, may be directly obtained based on the training resource utilization information and/or the training log.

S840: and finishing training the current deep learning model to be trained according to the training resource utilization information and/or the training log and a third preset condition.

In an embodiment, the third preset condition may refer to determining whether the training log continues to be updated in a rolling manner, and continuously detecting training resource utilization information (i.e., GPU utilization and CPU utilization); and/or judging whether the training precision and the testing precision in the training log are improved after multiple rounds of training. That is, the third preset condition may be one of the third preset conditions, or may be two of the third preset conditions, and this is not particularly limited in the embodiment of the present application.

In one embodiment, when the training log is not continuously updated in a rolling manner (for example, a multi-process abnormal exit), and the continuously detected training resource utilization information is low, the training of the current deep learning model to be trained is finished. Or when the training precision and the testing precision in the training log are not improved after multiple rounds of training, finishing training the current deep learning model to be trained. Or, when the training log is not continuously updated in a rolling manner, the continuously detected training resource utilization information is low, and the training precision and the testing precision in the training log are not improved after multiple rounds of training, ending the training of the current deep learning model to be trained. The embodiments of the present application are not limited thereto.

Therefore, by judging whether the training of the current deep learning model to be trained should be automatically finished or not, the GPU and the CPU resources can be reasonably recovered and released, so that the waste of time and energy of a user is avoided, and the waste of the GPU and the CPU resources is avoided.

In another embodiment of the present application, the method shown in fig. 9 is an example of the method shown in fig. 2, and the method shown in fig. 9 further includes the following.

S910: and determining the running state of the current deep learning model to be trained as running.

In an embodiment, by the method described in the embodiment shown in fig. 3, the running state of the current deep learning model to be trained may be determined, that is, when the fourth preset condition satisfies that the training resource collection server 440 stops collecting the training resource utilization information and/or the training log collection server 450 stops collecting the training logs, the running state of the current deep learning model to be trained is determined to be running, that is, the running is ended.

S920: and acquiring the training ending state of the current deep learning model to be trained after operation through a classification model according to the historical training data of the current deep learning model to be trained.

In an embodiment, after the deep learning model to be trained is trained, the attribute information of the deep learning model to be trained, such as configuration information, resource application information, runtime length information, average resource utilization rate information, etc., is collected and persistently stored in the database for use in subsequently determining the training completion state.

In an embodiment, the historical training data refers to multidimensional data, and the historical training data may include training resource utilization information acquired by the method described in the embodiment shown in fig. 4, may also include training logs acquired by the method described in the embodiment shown in fig. 4, and may also include attribute information of the current deep learning model to be trained, which is read from a database. The specific type of the historical training data is not limited in the embodiments of the present application.

In an embodiment, the classification model is a trained model, and is used to automatically classify an end training state of a deep learning model to be trained currently.

In an embodiment, historical training data of the current deep learning model to be trained is input into the classification model, and a training ending state of the current deep learning model to be trained after running can be automatically acquired.

In one embodiment, the end training state may be training normal-high efficiency; training is normal-inefficient; abnormal end-path configuration error; abnormal ending-GPU video memory exceeds the limit; abnormal termination-private network jitter, etc., which is not specifically limited in the embodiments of the present application.

S930: and determining training parameters related to the current deep learning model to be trained based on the training ending state.

In an embodiment, since the end training state includes the training parameters, the training parameters related to the deep learning model, that is, the training parameters related to the end training state, such as a path configuration error or a GPU video memory exceeding a limit, may be directly obtained based on the end training state.

S940: and pushing the training ending state so that the user can update the training parameters of the current deep learning model to be trained according to the training ending state.

In an embodiment, after the training completion state is obtained, the training completion state may be directly pushed, so that the user may update the training parameters of the current deep learning model to be trained, which are related to the training completion state, according to the training completion state.

In an embodiment, the training end state may be pushed through a pop window or the like. The embodiment of the present application is not particularly limited to this.

S950: and according to the updated training parameters, retraining the current deep learning model to be trained.

In an embodiment, according to the updated training parameters, the current deep learning model to be trained is retrained in the formal running resource pool according to the methods described in the embodiments shown in fig. 2 to 8, so as to obtain a deep learning model with better performance.

After the training of the current deep learning model to be trained is completed, the trained current deep learning model to be trained and a corresponding training log are generated. The user can confirm whether the requirements are met and subsequent optimization experiment strategies are met according to the actual effect of the current deep learning model to be trained and the training logs after training. The user can determine the reason for the error exit of the deep learning model to be trained currently, for example, the data format is not compatible, the model storage directory is incorrect, the parameter configuration is incorrect, the maximum configuration training time is exceeded, and the like. However, when an abnormal condition occurs, the user cannot determine the clear reason for ending the training based on the training log, and therefore, the parameters of the current deep learning model to be trained cannot be updated in a targeted manner to avoid the error.

Therefore, the training ending reason of the current deep learning model to be trained can be determined by determining the training ending state of the current deep learning model to be trained, so that a user can pertinently update the training parameters of the current deep learning model to be trained so as to avoid the error from appearing again, the training efficiency is improved, and meanwhile, the performance of the current deep learning model to be trained is improved.

In another embodiment of the present application, the method further comprises: acquiring historical training sample data of each historical training model in a plurality of historical training models; and training the classification model according to the multi-dimensional historical training sample data.

In one embodiment, each time a training model is trained, attribute information of the training model, such as configuration information, application resource information, runtime length information, average resource utilization information, and the like, is collected and persistently stored in the database for use by a subsequently trained classification model.

In an embodiment, the historical training sample data refers to multi-dimensional sample data, and the historical training sample data may include training resource utilization information of each historical training model acquired by the method described in the embodiment shown in fig. 4, may also include training logs acquired by the method described in the embodiment shown in fig. 4, and may also include attribute information of each historical training model read from a database. The embodiment of the present application is not particularly limited to this.

In one embodiment, partial features, for example, parameter configuration of the historical training model, duration distribution of the historical training model, resource utilization rate of the historical training model, and the last N rows of keyword digests in the training log of the historical training model, are extracted from the historical training sample data of each historical training model, so that the classification model learns the extracted partial features to obtain the training ending state of each historical training model.

However, it should be noted that the embodiment of the present application is not limited to the specific number of the extracted partial features, and those skilled in the art may make different selections according to actual needs.

In an embodiment, the classification model may be a shallow model based on machine learning, such as an SVM classifier, a logistic regression classifier, or a linear regression classifier, which is not specifically limited in this embodiment of the present application, and the classification model obtained by the shallow model based on machine learning may implement fast model classification to improve the efficiency of model classification.

In an embodiment, the classification model may also refer to a Deep model based on Deep learning, for example, at least one of Network structures such as a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), or a Recurrent Neural Network (RNN), and these networks may use ResNet, resext, or DenseNet as a backbone Network, and the classification model may include Neural Network layers such as an input layer, a Convolutional layer, a pooling layer, and a connection layer. The accuracy of model classification can be improved through the classification model obtained by the deep model based on deep learning.

However, it should be noted that the embodiment of the present application does not specifically limit the specific type of the classification model.

Through the continuous increase of historical training sample data of a plurality of historical training models, the classification accuracy of the classification model can be continuously improved, and therefore the training finishing state obtained through the classification model is more accurate. Through this classification model, can also help the user to fix a position the training problem fast, promote the efficiency of degree of depth learning model.

In another embodiment of the present application, the method further comprises: determining the training ending states of the historical deep learning models of all users according to the classification model; and determining the training priority of the current deep learning model to be trained of all the users according to the training ending state of the historical deep learning models of all the users, so that the current deep learning model to be trained of the user with high priority is trained preferentially.

The method includes the steps that the performance of a deep learning model designed by a user can be obtained according to the training ending state of a historical deep learning model of each user, namely, if the training ending state of the historical deep learning model of the user is low in training efficiency, it is indicated that the training efficiency of other deep learning models to be submitted to a GPU training platform by the user is possibly low, if the deep learning model of the user is trained preferentially, GPU and CPU resources are wasted, and time and energy of other users are wasted.

In conclusion, the method for self-defining the rules and analyzing the models can reduce the human input and time consumption in the key step of training the models by the user and the consumption of hardware resources of the GPU and the CPU.

It should be noted that, in the embodiment of the present application, the order of executing the methods in fig. 5, fig. 6, fig. 7, fig. 8, and fig. 9 is not specifically limited, for example, the method in the pre-operation phase in fig. 5 may be executed first, then the methods in the operation phases in fig. 6, fig. 7, and fig. 8 may be executed at the same time, and finally the method in the operation phases in fig. 9 may be executed, where the order of executing the methods in the operation phases in fig. 6, fig. 7, and fig. 8 is not limited to being executed at the same time, and the methods in the operation phases may also be executed according to the order; for another example, the method of the pre-operation phase shown in fig. 5, the method of the in-operation phase shown in fig. 6, 7 and 8, and the method of the post-operation phase shown in fig. 9 are performed simultaneously. Those skilled in the art can make different selections according to actual needs.

Exemplary devices

The embodiment of the device can be used for executing the embodiment of the method. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Fig. 10 is a block diagram of an apparatus for training a deep learning model according to an exemplary embodiment of the present application. As shown in fig. 10, the apparatus 1000 includes the following.

The first determining module 1010 is configured to determine an operating state of the deep learning model to be trained currently.

A second determining module 1020 configured to determine a training parameter related to the current deep learning model to be trained.

A training module 1030 configured to train the current deep learning model to be trained based on the operating state and the training parameters.

In an optional embodiment, the second determining module 1020 may include a pre-running unit 1021 configured to pre-run the current deep learning model to be trained in a pre-run resource pool to obtain a pre-run result, where an amount of resources used by the pre-run resource pool is less than an amount of resources used by a formal run resource pool; and determining training parameters related to the current deep learning model to be trained based on the pre-operation result.

In an optional embodiment, the training module 1030 may include a first training unit 1031 configured to push the pre-run result, so as to facilitate a user to update the training parameters of the current deep learning model to be trained; and training the current deep learning model to be trained according to the updated training parameters.

In an optional embodiment, the second determining module 1020 may further include a first running unit 1022 configured to detect, according to a first preset rule, a resource utilization rate of the current deep learning model to be trained during running; and determining training parameters related to the current deep learning model to be trained based on the resource utilization rate.

In an optional embodiment, the training module 1030 may further include a second training unit 1032 configured to push training reference information according to the resource utilization rate and a first preset condition, so that a user may update the training parameters of the current deep learning model to be trained according to the training reference information; and continuing to train the current deep learning model to be trained according to the updated training parameters.

In an optional embodiment, the second determining module 1020 may further include a second running unit 1023 configured to determine, according to a second preset rule, abnormal information of the deep learning model to be trained when running; and determining training parameters related to the current deep learning model to be trained based on the abnormal information.

In an optional embodiment, the training module 1030 may further include a third training unit 1033 configured to determine, according to the abnormal information and a second preset condition, a check point corresponding to the abnormal information; and resuming training of the current deep learning model to be trained from the check point.

In an optional embodiment, the second determining module 1020 may further include a third running unit 1024 configured to obtain, according to a third preset rule, training resource utilization information and/or a training log of the current deep learning model to be trained during running; determining training parameters related to the current deep learning model to be trained based on the training resource utilization information and/or the training log.

In an optional embodiment, the training module 1030 may further include a fourth training unit 1034 configured to finish training the deep learning model to be trained according to the training resource utilization information and/or the training log and a third preset condition.

In an optional embodiment, the second determining module 1020 may further include a post-operation unit 1025, configured to obtain, according to the historical training data of the current deep learning model to be trained, an end training state of the current deep learning model to be trained after operation through a classification model; and determining training parameters related to the current deep learning model to be trained based on the training ending state.

In an optional embodiment, the training module 1030 may further include a fifth training unit 1035 configured to push the end training state, so that the user updates the training parameters of the deep learning model to be currently trained according to the end training state; and according to the updated training parameters, retraining the current deep learning model to be trained.

In an optional embodiment, the apparatus further comprises: a classification model training module 1040 configured to obtain historical training sample data of each of the plurality of historical training models; and training the classification model according to the multi-dimensional historical training sample data.

In an optional embodiment, the first determining module 1010 is specifically configured to determine the operating state of the current deep learning model to be trained based on the training resource utilization information and/or the training log and a fourth preset condition.

The device of training degree of depth learning model that this application embodiment provided, through confirming the running state of waiting to train degree of depth learning model at present, confirm again with wait to train the relevant training parameter of degree of depth learning model at present, finally based on running state and training parameter, train and wait to train degree of depth learning model at present, can wait to train degree of depth learning model when being in different running states at present, avoid appearing some problems that influence the training, thereby improve the efficiency of training, avoid wasting user's time and energy, and avoid the waste of GPU and CPU resource.

Exemplary electronic device

Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 11. Fig. 11 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present disclosure. As shown in fig. 11, electronic device 1100 includes one or more processors 1110 and memory 1120.

The processor 1110 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 1100 to perform desired functions.

The memory 1120 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 1110 to implement the methods of training deep learning models of the various embodiments of the present disclosure described above and/or other desired functions. Various contents such as training resource utilization information, training logs and configuration information, application resource information, run-time length information, average resource utilization information, and the like may also be stored in the computer-readable storage medium.

In one example, the electronic device 1100 may further include: an input device 1130 and an output device 1140, which are interconnected by a bus system and/or other form of connection mechanism (not shown). The input device 1130 includes, but is not limited to, a keyboard, a mouse, a camera, and the like.

Of course, for simplicity, only some of the components of the electronic device 1100 relevant to the present disclosure are shown in fig. 11, omitting components such as buses, input/output interfaces, and the like. In addition, electronic device 1100 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the methods and apparatus described above, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in a method of training a deep learning model according to various embodiments of the present disclosure described in the "exemplary methods" section above of this specification.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the method of training a deep learning model according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method of training a deep learning model, comprising:

determining the running state of the current deep learning model to be trained;

determining training parameters related to the current deep learning model to be trained;

and training the current deep learning model to be trained based on the running state and the training parameters.

2. The method of claim 1, wherein the determining training parameters related to the current deep learning model to be trained comprises:

in a pre-operation resource pool, pre-operating the current deep learning model to be trained to obtain a pre-operation result, wherein the resource amount used by the pre-operation resource pool is less than that used by a formal operation resource pool;

determining training parameters related to the current deep learning model to be trained based on the pre-operation result;

wherein the training the current deep learning model to be trained based on the operating state and the training parameters comprises:

pushing the pre-operation result so as to facilitate a user to update the training parameters of the current deep learning model to be trained;

and training the current deep learning model to be trained according to the updated training parameters.

3. The method of claim 1, wherein the determining training parameters related to the current deep learning model to be trained comprises:

detecting the resource utilization rate of the current deep learning model to be trained during operation according to a first preset rule;

determining training parameters related to the current deep learning model to be trained based on the resource utilization rate;

pushing training reference information according to the resource utilization rate and a first preset condition so that a user can update the training parameters of the current deep learning model to be trained according to the training reference information;

and continuing to train the current deep learning model to be trained according to the updated training parameters.

4. The method of claim 1, wherein the determining training parameters related to the current deep learning model to be trained comprises:

judging abnormal information of the current deep learning model to be trained when the current deep learning model to be trained runs according to a second preset rule;

determining training parameters related to the current deep learning model to be trained based on the abnormal information;

determining a check point corresponding to the abnormal information according to the abnormal information and a second preset condition;

and resuming training of the current deep learning model to be trained from the check point.

5. The method of claim 1, wherein the determining training parameters related to the current deep learning model to be trained comprises:

acquiring training resource utilization information and/or a training log of the current deep learning model to be trained during operation according to a third preset rule;

determining training parameters related to the current deep learning model to be trained based on the training resource utilization information and/or the training log;

and finishing training the current deep learning model to be trained according to the training resource utilization information and/or the training log and a third preset condition.

6. The method of claim 1, wherein the determining training parameters related to the current deep learning model to be trained comprises:

acquiring a training ending state of the current deep learning model to be trained after running through a classification model according to historical training data of the current deep learning model to be trained;

determining training parameters related to the current deep learning model to be trained based on an end training state;

pushing the training ending state so that the user can update the training parameters of the current deep learning model to be trained according to the training ending state;

and according to the updated training parameters, retraining the current deep learning model to be trained.

7. The method of claim 6, further comprising:

acquiring historical training sample data of each historical training model in a plurality of historical training models;

and training the classification model according to the multi-dimensional historical training sample data.

8. An apparatus for training a deep learning model, comprising:

the first determination module is configured to determine the running state of the current deep learning model to be trained;

a second determination module configured to determine a training parameter related to the current deep learning model to be trained;

and the training module is configured to train the current deep learning model to be trained based on the running state and the training parameters.

9. A computer-readable storage medium, the storage medium storing a computer program for executing the method of training a deep learning model according to any one of the preceding claims 1 to 7.

10. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to perform the method for training a deep learning model according to any one of claims 1 to 7.