CN114528070A

CN114528070A - Convolutional neural network layered training method and system based on containerization and virtualization

Info

Publication number: CN114528070A
Application number: CN202210141065.2A
Authority: CN
Inventors: 江居正; 张勇; 石光银; 蔡卫卫; 高传集
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2022-02-16
Filing date: 2022-02-16
Publication date: 2022-05-24

Abstract

The invention discloses a convolutional neural network layered training method and a convolutional neural network layered training system based on containerization and virtualization, which belong to the technical field of machine learning, and comprise a sectional type machine learning framework, an abstract modular template, POD (dead man) and a virtual machine creation scheduling and monitoring platform; the segmented machine learning framework is used for providing a machine learning program with communication capability and generating complete machine learning model codes according to the templates; the abstract modular template provides a standard template of each module of the abstract convolutional neural network based on the hierarchical characteristics of the convolutional neural network structure, and defines the structure of each layer of the convolutional neural network in a declarative definition mode; the POD and virtual machine creation scheduling comprises machine learning code generation, POD and virtual machine creation and POD and virtual machine scheduling. The method can give full play to the performance of the computing equipment, reduce the cost of training the model by a model designer, improve the reuse rate of the cloud service provider equipment and simplify the design process of the convolutional neural network model.

Description

Convolutional neural network layered training method and system based on containerization and virtualization

Technical Field

The invention relates to the technical field of machine learning, in particular to a convolutional neural network layered training method and system based on containerization and virtualization.

Background

As computer hardware performance has improved, more and more excellent convolutional neural network models have been proposed in succession, from the first LeNet to the later AlexNet, VGG, google LeNet, ResNet, densnet, etc. The number of layers of the convolutional neural network model is generally deepened, and the structure of the model is also more complicated. In general, the number of parameters of the whole convolutional neural network increases exponentially for every layer of the convolutional neural network.

The training of the model relies on certain computer hardware resources. Whether a model can complete one training depends mainly on whether the computer hardware resources running the model can bear the computing resources needed by the model, and the time when the resources are stored most. In the whole model training process, the hardware performance of the computer can be exerted to the maximum only at the moment when the computing resources and the storage resources are most needed, and the hardware performance of the computer is in an idle state at other moments. On the premise of ensuring that a model can be normally trained, for a model designer, if a model is to be trained, the trade-off is usually made on computer hardware resources and the total time of model training. Even so, the designer of the model still needs to buy a bill for the part of time that the computer hardware resources can not be fully used in the process of training the model, which undoubtedly increases the training cost of the model.

Disclosure of Invention

The technical task of the invention is to provide a convolutional neural network layered training method and system based on containerization and virtualization, aiming at the defects, the performance of computing equipment can be fully exerted, the cost of a model designer for training the model is reduced, the reuse rate of cloud service provider equipment is improved, and the design flow of a convolutional neural network model is simplified.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a convolutional neural network layered training method based on containerization and virtualization comprises a sectional type machine learning framework, an abstract modular template, POD (POD), virtual machine creation scheduling and a monitoring platform;

the segmented machine learning framework is used for providing a machine learning program with communication capability and generating complete machine learning model codes according to the template;

the abstract modular template provides a standard template of each module of the abstract convolutional neural network based on the hierarchical characteristics of the convolutional neural network structure, defines the structure of each layer of the convolutional neural network by adopting a declarative definition mode, and defines the module type and related configuration in the declaration according to the difference of the template types, such as: resource type (virtual machine or POD), computational resource scale (memory, CPU, GPU, bandwidth), scheduling order, etc.;

POD and virtual machine establishing and scheduling, utilizing the different computing resources and storage resources required among modules of the convolutional neural network, and establishing corresponding POD or virtual machine through the abstracted template information; generating machine learning codes, creating PODs and virtual machines and scheduling the PODs and the virtual machines;

and the monitoring platform comprises POD and virtual machine performance monitoring and model training process monitoring, the monitoring platform monitors POD and virtual machine monitoring data from communication modules of each layer, and the monitoring platform collects the data into a data instrument panel of the monitoring platform.

The method designs a sectional machine learning framework and a set of templates of modules of the abstract convolutional neural network according to the hierarchical relationship among the modules of the convolutional neural network, provides a method for creating POD or virtual machine with corresponding computing resource and storage resource according to the computing resource and storage resource required by the abstracted modules, and constructs a platform for creating, dispatching and monitoring the POD and the virtual machine.

Preferably, the segmented machine learning framework encapsulates the codes of the machine learning modules in different HTTP services respectively; the segmented machine learning framework includes a library of code modules and a communication module,

the code module library mainly defines codes of all modules of machine learning, and comprises a convolution layer module, a pooling layer module and a full-connection layer module;

the communication module mainly provides data transmission capability for each module of the model, and comprises the steps of transmitting data to the message queue, pulling data to the message queue and feeding back real-time data to the monitoring platform.

The sectional type machine learning framework encapsulates each module of machine learning into a module with a communication function in a modularized form, each module can be flexibly combined into a complete machine learning network, and a machine learning model suitable for single machine deployment is automatically generated according to an imported template;

and all modules of machine learning are packaged into a module with a communication function, so that all modules can run on different machines to achieve the effect of training the model in a segmented manner.

Preferably, the abstract modular template includes a convolutional layer template, a batch standardization template, an activation template, a pooling template, a discard template, a full connection template, a data template, and a scheduling template, wherein the scheduling template is automatically generated based on other templates, and the main usage of the other templates is as follows:

the convolutional layer template is mainly used for configuring parameters of a convolutional neural network convolutional layer, and mainly comprises the following parameters:

in _ channels: the number of channels input by the network;

out _ channels: the number of channels output by the network;

kernel _ size: the size of the convolution kernel, if the parameter is an integer n, then the size of the convolution kernel is n x n;

stride: step size, representing the step size of movement in the convolution process, and being 1 by default; the general movement of the convolution kernel on the input image is from left to right and from top to bottom, if the parameter is an integer, the default is that the integer is in the horizontal and vertical directions; if the parameter is stride ═ 2, 1, where 2 represents high (h), row step size is 2; 1 represents the width (w) with a step size of 1;

padding: padding, default is all 0 padding;

and (2) dilation: expansion, in general, the calculation between the convolution kernel and the corresponding position of the input image is the same in size, and if the size of the convolution kernel is 3 × 3, the region of the convolution kernel acting on the input image every time is also 3 × 3, and in this case, the partition is 0;

groups: grouping, namely grouping input channels, if groups are 1, the input is a group, and the output is a group; if groups is 2, the input is divided into two groups, and the corresponding output is also two groups; meanwhile, it needs to be ensured that in _ channels and out _ channels must be able to divide groups;

a bias: a bias parameter, which is of the pool type, and when bias is True, indicates that a parameter b learned in backward feedback is applied;

padding _ mode: padding _ mode ═ zeros' represents 0 padding;

the batch standardization template is mainly used for configuring parameters of a batch standardization layer of the convolutional neural network, and mainly comprises the following parameters:

name-step: the method comprises the steps of defining the name of a template and the sequence of executing steps of the template;

the activation template is mainly used for configuring parameters of an activation layer of the convolutional neural network, and the activation layer template mainly comprises the following parameters:

type: activation function types, commonly used activation functions include sigmoid, tanh, relu, and the like;

the pooling template is mainly used for configuring parameters of a pooling layer of the convolutional neural network, and mainly comprises the following parameters:

kernel _ size: the size of the pooling window;

stride: the step size of the posing window move, the default value is kernel _ size;

padding: each input edge supplements the layer number of 0;

and (2) dilation: a parameter controlling the stride of an element in the window;

return _ indices: if True, returning the serial number of the output maximum value;

ceil _ mode: if True, the calculation output signal is rounded up when being too small, and the default rounding-down operation is replaced;

the abandon template is mainly used for configuring parameters of a abandon layer of the convolutional neural network, and the abandon layer template mainly comprises the following parameters:

and (3) Rate: the probability of rejection;

the full-connection template is mainly used for configuring parameters of a full-connection layer of the convolutional neural network, and mainly comprises the following parameters:

inputs: inputting data;

the Units: the number of neural unit nodes of the layer;

activation: activating a function;

use _ bias: boolean type, whether bias terms are used;

kernel _ initializer: an initializer of a convolution kernel;

bias _ initializer: an initializer of the bias item, which is initialized to 0 by default;

kernel _ regular izer: regularization of convolution kernel is optional;

bias _ regular izer: regularization of a bias term is optional;

activity _ regular: the regularization function of the output;

trainable: boolean type, which indicates whether the parameters of the layer participate in training;

reuse: boolean type, whether to reuse parameters;

the data template is mainly used for configuring data preprocessing information, and the data layer template mainly comprises the following parameters:

input _ src: a storage path of data;

output _ src: model configuration saving path;

rate: the proportion of training set to test set;

the parameters common to the templates are as follows:

a CPU: the CPU resource is used for defining the needed CPU resource, and the unit is m;

GPU: used for defining needed GPU resources;

memory: the unit of the memory resource is mi;

name-step: the method comprises the steps of defining the name of a template and the execution step sequence of the template;

create _ type: for defining the creation of PODs or virtual machines.

The parameter quantity of the convolutional neural network determines the storage resources required by the module to a great extent, such as memory, video memory and the like; the amount of computation determines the computational resources, such as CPU, GPU, etc., required by the module. Based on the obvious hierarchical relationship among the modules of the convolutional neural network and the large difference of the parameter and the calculated amount among the modules, the inconsistent hardware requirements of the modules of the convolutional neural network cause a large amount of storage resources and calculation resources to be wasted, and the model training cost is increased. By abstracting each module of the convolutional neural network and forming a template, the problem of insufficient utilization of computer hardware resources in the model training process can be solved. A set of standard sample templates of each module of the abstract convolutional neural network is provided based on a method for making each module of the abstract convolutional neural network into a template. The template configures the parameters of each module in a declaratively defined manner.

Preferably, the POD, virtual machine creates a schedule,

the code generation is obtained based on a sectional type machine learning framework, and corresponding deep learning codes are matched from the sectional type machine learning framework through analyzing templates of all modules of the model;

the segmented machine learning framework automatically generates corresponding machine learning code segments with communication capability by analyzing the templates of the modules of the model. The generated machine learning codes of the modules run in the POD or the virtual machine in the form of programs. Meanwhile, for the convenience of later model deployment, the method supports automatic generation of a machine learning model suitable for single machine deployment according to the imported template.

The creation of POD and virtual machine is based on the template created by the template definition part to generate and provide automatic expansion and contraction capacity service;

the scheduling of POD and virtual machine is realized based on the scheduling strategy in the scheduling template and the message queue.

Further, the creation of the POD and the virtual machine, and whether the POD or the virtual machine is to be created in each layer of the convolutional neural network, may be specified by a model designer through a create _ type field in a corresponding template; meanwhile, the values of CPU, GPU and memory fields can be set to configure the computing resources and storage resources required by POD or virtual machine; if no corresponding field value is configured, according to the calculated amount and the parameter amount of each layer, PODs or virtual machines matched with the calculated amount and the parameter amount of the layer are created layer by layer from the first layer according to a preset rule, so that the corresponding calculation resources and storage resources can meet the requirements of the current module;

the principle of the matching rule is as follows:

the theoretical peak value of the GPU is the number of GPU chips, GPU Boost host frequency, core number, floating point calculation times which can be processed in a single clock cycle;

the CPU single-period double-precision floating point calculation capability is FMA 2 512/64, wherein FMA refers to a floating point vector addition and multiplication fusion unit, and the default value is 2;

the CPU single-cycle double-precision floating-point computing capability is FMA quantity 2 512/32;

the parameter values are generally float types, the size of the parameter values occupies 4 bytes, and the storage resources are 4 parameter number a, wherein a is a parameter adjusting coefficient;

calculating resource, namely calculating quantity b, wherein b is a parameter adjusting coefficient;

estimating the computing resources and storage resources required by each layer based on the parameter and the calculated amount of each layer;

the created POD or virtual machine is mainly used for providing a reliable running environment for the code program of each module, and if the created POD is the POD, the program runs in the POD; if the virtual machine is created, the creating and scheduling module copies the code of the module into the virtual machine through a relevant operation and maintenance tool, such as an anchor, a SaltStack and the like, and starts the program in the form of a process.

Furthermore, the scheduling order of the PODs and the virtual machines is realized based on a scheduling strategy and a message queue in a scheduling template, and the scheduling template is automatically generated by a creating and scheduling module after all the PODs and the virtual machines required by the model are successfully created, and records ip addresses for accessing each POD and each virtual machine, the scheduling order of the PODs and the virtual machines and the information of model iteration times; after the scheduling template is generated, the scheduling template is copied to configuration files of all PODs and virtual machines created by the model by a creating scheduling module, so that each POD and virtual machine can know the transmission flow direction of data;

the specific scheduling steps are as follows:

1) according to the scheduling strategy in the scheduling template, firstly starting a data module, loading corresponding data and preprocessing the data;

2) after the layer data processing is finished, sending the information related to the scheduling strategy and the processed data to a message queue;

3) the messages in the polling message queues at other layers are regularly polled, if any message belongs to the layer, the message is pulled down, the data is processed according to the data processing flow of the layer, and after the data processing is finished, the step 2) is repeated until the scheduling strategy is executed;

when the model training is finished, the configuration information of the model is stored in a path specified by an output _ src field of a data template so as to provide configuration parameters for later model deployment;

and when the computing resources and the storage resources of a certain layer are insufficient or excessive in the training process of the model, automatically expanding the capacity pod or the virtual machine according to the specified rule.

The method can realize flexible creating and scheduling combination mode: POD creation and POD scheduling, virtual machine creation and virtual machine scheduling, POD virtual machine hybrid creation and hybrid scheduling.

And creating corresponding PODs or virtual machines according to a template abstracted from the convolutional neural network model, scheduling the PODs or the virtual machines according to a scheduling sequence defined in the scheduling template, and realizing communication between the PODs and the virtual machines based on a message queue.

Preferably, the monitoring of the performance of the POD and the virtual machine by the monitoring platform mainly comprises monitoring of a CPU, a GPU, a memory and flow;

the monitoring of the model training state mainly refers to the monitoring of the model training process, the communication module of each layer sends related log information to the monitoring platform after processing data once, the monitoring platform sorts the collected log information of each layer according to time and iteration times, and the state information of the model training is dynamically displayed to a trainer of the model; the log of each transmission of each layer will contain the following information:

data _ from: from which layer the received data came;

receive _ data: the received data can be selected and matched as follows:

create _ time: the time at which the data was received;

end _ time: time of data transmission to the next layer;

to: sending the data to the next layer;

send _ data: data sent to the next layer is optionally matched;

message: other information, such as error reporting.

The monitoring platform can monitor the performance of all created PODs and virtual machines and the training state of the model, and ensures that the model can be trained according to the scheduling sequence defined in the scheduling template.

The invention also claims a convolutional neural network layered training system based on containerization and virtualization, which comprises a segmented machine learning framework, an abstract modular template, a POD (POD-based virtual machine) and virtual machine creation scheduling module, a monitoring platform,

the system realizes the convolutional neural network layered training method based on containerization and virtualization.

The invention also claims a convolutional neural network layered training device based on containerization and virtualization, which comprises: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor is used for calling the machine readable program to execute the convolutional neural network layered training method based on containerization and virtualization.

The present invention also claims a computer readable medium having stored thereon computer instructions that, when executed by a processor, cause the processor to perform the containerization and virtualization-based convolutional neural network layered training described above.

Compared with the prior art, the convolutional neural network layered training method and system based on containerization and virtualization have the following beneficial effects:

compared with the traditional convolutional neural network model training method, the method has great benefits for designers of the model and providers of cloud platform services:

for a model designer, the model is designed by adopting the method, related model codes do not need to be compiled, and only corresponding parameters are filled in the declarative template according to the type of the template, so that the design and development process of the model is greatly simplified; the model training cost can be saved to the maximum extent by adopting the method to train the model;

for a cloud service provider, the method provides combined service for creating the corresponding POD and the virtual machine for the user, and can improve the use efficiency of equipment of the cloud service provider.

Drawings

Fig. 1 is a schematic structural diagram of a convolutional neural network layered training method based on containerization and virtualization according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a segmented machine learning framework provided by embodiments of the present invention;

fig. 3 is a diagram illustrating the amount of parameters and the amount of calculation provided by the embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific examples.

The embodiment of the invention provides a convolutional neural network layered training method based on containerization and virtualization, which comprises a sectional type machine learning framework, an abstract modular template, POD (POD) and virtual machine creation scheduling and a monitoring platform, wherein the POD and the virtual machine creation scheduling are carried out on the basis of the abstract modular template;

the segmented machine learning framework is used for providing a machine learning program with communication capability and generating complete machine learning model codes according to the template; packaging each module of machine learning into a module with a communication function in a modularized manner, wherein each module can be flexibly combined into a complete machine learning network, and automatically generating a machine learning model suitable for single machine deployment according to an imported template;

The sectional type machine learning framework encapsulates each module of machine learning into a module with a communication function, so that each module can run on different machines to achieve the effect of training the model in a sectional manner.

Because the computing resources and the storage resources required by each module are different, the method has a mechanism for flexibly creating the POD and the virtual machine with the best performance. The service developer can define the computing resource, the storage resource, the POD or the virtual machine; or the system possesses certain computing resources according to the characteristics of the template, and stores POD or virtual machine of the resources.

The method has flexible creating and scheduling combination modes: POD creation and POD scheduling; virtual machine creation and virtual machine scheduling; POD vm mix creation and mix scheduling.

The monitoring platform can monitor the performance of all created PODs and virtual machines and the training state of the model, and ensure that the model can be trained according to the scheduling sequence defined in the scheduling template.

The specific technical scheme is realized as follows:

sectional type machine learning frame

The segmented machine learning framework comprises a code module library and a communication module, and mainly has the functions of providing a machine learning program with communication capability and generating complete machine learning model codes according to templates, wherein the essence of the segmented machine learning framework is that the codes of the machine learning modules are respectively packaged in different HTTP services.

The code module library mainly defines the codes of all modules (such as a convolutional layer, a pooling layer, a full-link layer and the like) of machine learning.

The communication module mainly provides data transmission capability for each module of the model, such as transmitting data to a message queue, pulling data to the message queue and feeding back real-time data to a monitoring platform.

A segmented machine learning framework working schematic is shown in fig. 2.

Abstract modular template definition

Aiming at the problems that the layering relation among modules of the convolutional neural network is obvious and the difference between the parameters and the calculated quantity among the modules is large, the comparison between the parameters and the calculated quantity of each layer of the conventional convolutional neural network is shown in figure 3, wherein the parameters determine the storage resources, such as an internal memory, a video memory and the like, required by the module to a great extent; the amount of computation determines the computational resources, e.g., CPU, GPU, etc., required by the module. The inconsistent hardware requirements of each module of the convolutional neural network cause a large amount of storage resources and calculation resources to be wasted, and the model training cost is increased. The method for abstracting each module of the convolutional neural network and forming the template can solve the problem that computer hardware resources are not fully utilized in the model training process.

Based on the method for making modules of the abstract convolutional neural network into templates, the method provides a set of standard sample templates of the modules of the abstract convolutional neural network. The template configures the parameters of each module in a declaratively defined manner. The template mainly comprises:

convolutional layer templates (conditional. yaml), batch normalization templates (bn.yaml), activation templates (activation. yaml), pooling templates (posing.yaml), discard templates (dropout. yaml), fully-connected templates (fc.yaml), data templates (data.yaml), scheduling templates (scheduling. yaml), and the like.

The scheduling template is automatically generated based on other templates, and the main usage of the other templates is as follows:

in _ channels: the number of channels input by the network;

out _ channels: the number of channels output by the network;

stride: step size, representing the step size of movement in the convolution process, and being 1 by default; the general movement of the convolution kernel on the input image is from left to right and from top to bottom, if the parameter is an integer, then the default is that the integer is both in the horizontal and vertical directions, if the parameter is stride (2, 1), where 2 represents high (h) and the row step size is 2; 1 represents the width (w) with a step size of 1;

padding: padding, default is all 0 padding;

and (2) dilation: the expansion, in general, is such that the computation between the convolution kernel and the corresponding location of the input image is of the same size. If the convolution kernel size is 3 × 3, then the region it acts on the input image each time is also 3 × 3, where the disparity is 0;

groups: grouping, namely grouping input channels, if groups are 1, the input is a group, and the output is a group; if groups is 2, then the inputs are divided into two groups and the corresponding outputs are also two groups. Meanwhile, it needs to be ensured that in _ channels and out _ channels must be able to divide groups;

padding _ mode: padding mode, padding _ mode ═ zeros' means 0 padding.

name-step: for defining the name of the template and the order in which the steps are performed by the template.

type: the activation function type is commonly used and is sigmoid, tanh, relu, etc.

kernel _ size: the size of the pooling window;

padding: each input edge supplements the layer number of 0;

ceil _ mode: if True, the computed output signal will be rounded up too small, instead of the default rounding down operation.

and (3) Rate: probability of rejection.

inputs: inputting data;

the Units: the number of neural unit nodes of the layer;

activation: activating a function;

use _ bias: boolean type, whether bias terms are used;

kernel _ initializer: an initializer of a convolution kernel;

kernel _ regular izer: regularization of convolution kernel is optional;

bias _ regular izer: regularization of a bias term is optional;

activity _ regular: the regularization function of the output;

reuse: boolean type, whether to reuse parameters.

input _ src: a storage path of data;

output _ src: model configuration saving path;

rate: the proportion of training set to test set;

The parameters common to the templates are as follows:

GPU: used for defining needed GPU resources;

memory: the unit of the memory resource is mi;

create _ type: for defining the creation of PODs or virtual machines.

Third, POD, virtual machine creation scheduling

The method comprises the steps of code generation, POD (POD), virtual machine creation and POD and virtual machine scheduling, wherein the code generation is obtained based on a sectional type machine learning framework, and corresponding deep learning codes are matched from the sectional type machine learning framework through analyzing templates of modules of a model; the creation of POD and virtual machine is based on the template created by the template definition part to generate and provide automatic expansion and contraction capacity service; the scheduling of POD and virtual machine is realized based on the scheduling strategy in the scheduling template and the message queue.

Code generation:

POD, virtual machine creation:

the method provides a mechanism for flexibly creating PODs and virtual machines. Whether each layer of the convolutional neural network is to create POD or virtual machine can be specified by a model designer through a create _ type field in a corresponding template; meanwhile, the calculation resources and the storage resources required by the POD or the virtual machine can be configured by setting the values of fields such as the CPU, the GPU, the memory and the like. If no corresponding field value is configured, the method creates PODs or virtual machines which have the calculated amount and the parameter amount matched with the calculated amount and the parameter amount of each layer by layer from the first layer according to the calculated amount and the parameter amount of each layer and the preset rule, so that the corresponding calculation resources and storage resources can meet the requirements of the current module.

The principle of the matching rule is as follows:

and estimating the computing resources and storage resources required by each layer based on the parameters and the calculated amount of each layer.

The POD or virtual machine created by the method has the main function of providing a reliable running environment for the code program of each module, and if the POD is created, the program runs in the POD; if the virtual machine is created, the creating and scheduling module copies the code of the module into the virtual machine through relevant operation and maintenance tools such as an anchor, a SaltStack and the like, and starts the program in the form of a process.

POD and virtual machine scheduling:

the scheduling order of PODs and virtual machines created by the method is realized based on a scheduling strategy and a message queue in a scheduling template, the scheduling template is automatically generated by a creating and scheduling module after all PODs and virtual machines required by a model are successfully created, and the scheduling template records ip addresses for accessing each POD and virtual machine, the scheduling order of the PODs and the virtual machines, the model iteration times and other information. After the scheduling template is generated, the scheduling template is copied to configuration files of all PODs and virtual machines created by the model by a creating scheduling module, so that each POD and virtual machine can know the transmission flow direction of data.

The specific scheduling steps are as follows:

3) and the other layers periodically poll the messages in the message queue, pull down the messages belonging to the layer, process the data according to the data processing flow of the layer, and repeat the step 2) after the data processing is finished until the scheduling strategy is executed.

When the model training is finished, the configuration information of the model is stored in a path specified by an output _ src field of the data template so as to provide configuration parameters for later model deployment.

When the computing resources and the storage resources of a certain layer are insufficient or excessive in the training process of the model, the method can automatically expand the capacity pod or the virtual machine according to a certain rule.

Fourth, monitoring platform

The monitoring platform part comprises the monitoring of POD and virtual machine performance and the monitoring of model training states, monitoring data of the POD and the virtual machine are respectively from communication modules of each layer, and the communication modules of each layer gather the data into a data instrument panel of the monitoring platform.

The performance monitoring of POD and virtual machine mainly includes CPU, GPU, memory and flow rate monitoring.

The monitoring of the model training state mainly refers to monitoring of the model training process, after the communication module on each layer processes data once, relevant log information is sent to the monitoring platform, the monitoring platform sorts the collected log information on each layer according to time and iteration times, and the state information of the model training is dynamically displayed to a trainer of the model. The log of each transmission of each layer will contain the following information:

data _ from: the data received is from that layer;

receive _ data: the received data can be selected;

create _ time: the time at which the data was received;

end _ time: time of data transmission to the next layer;

to: sending the data to the next layer;

send _ data: data sent to the next layer can be selected;

message: other information, such as error reporting.

The embodiment of the invention also provides a convolutional neural network layered training system based on containerization and virtualization, which comprises a sectional type machine learning framework, an abstract modular template, a POD (power POD) and virtual machine creation scheduling module, a monitoring platform,

the system realizes the convolutional neural network layered training method based on containerization and virtualization in the embodiment of the invention.

The embodiment of the invention also provides a convolutional neural network layered training device based on containerization and virtualization, which comprises: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor is configured to invoke the machine readable program to execute the convolutional neural network hierarchical training method based on containerization and virtualization according to the above embodiment of the present invention.

The embodiment of the present invention further provides a computer readable medium, where a computer instruction is stored on the computer readable medium, and when the computer instruction is executed by a processor, the processor is enabled to execute the method for implementing distributed automatic alarm processing by a cloud computing platform according to the above embodiment of the present invention. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the embodiments described above are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.

In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.

Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, optical disks (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer by a communications network.

Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.

Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.

While the invention has been shown and described in detail in the drawings and in the preferred embodiments, it is not intended to limit the invention to the embodiments disclosed, and it will be apparent to those skilled in the art that various combinations of the code auditing means in the various embodiments described above may be used to obtain further embodiments of the invention, which are also within the scope of the invention.

Claims

1. A convolutional neural network layered training method based on containerization and virtualization is characterized by comprising a sectional type machine learning framework, an abstract modular template, a POD (POD), a virtual machine creation scheduling and a monitoring platform;

the abstract modular template provides a standard template of each module of the abstract convolutional neural network based on the hierarchical characteristics of the convolutional neural network structure, and defines the structure of each layer of the convolutional neural network in a declarative definition mode;

POD and virtual machine establishing and scheduling, utilizing the different computing resources and storage resources required among modules of the convolutional neural network, and establishing a corresponding POD or virtual machine through the abstracted template information; generating machine learning codes, creating PODs and virtual machines and scheduling the PODs and the virtual machines;

2. The convolutional neural network layered training method based on containerization and virtualization of claim 1, wherein the segmented machine learning framework encapsulates the codes of the machine learning modules in different HTTP services respectively; the segmented machine learning framework includes a library of code modules and a communication module,

3. The convolutional neural network layered training method based on containerization and virtualization of claim 1, wherein the abstract modular templates comprise convolutional layer templates, batch standardized templates, activated templates, pooled templates, discarded templates, fully-connected templates, data templates and scheduling templates, wherein the scheduling templates are automatically generated based on other templates, and the main usage of the other templates is as follows:

in _ channels: the number of channels input by the network;

out _ channels: the number of channels output by the network;

stride: step size, representing the step size of movement in the convolution process, and being 1 by default; padding: padding, default is all 0 padding;

and (2) dilation: expanding, if the convolution kernel size is 3 × 3, the region it acts on the input image every time is also 3 × 3, and then the disparity is 0;

padding _ mode: padding _ mode ═ zeros' represents 0 padding;

type: activation function types, commonly used activation functions include sigmoid, tanh, relu;

kernel _ size: the size of the pooling window;

padding: each input edge supplements the layer number of 0;

and (3) Rate: the probability of rejection;

inputs: inputting data;

the Units: the number of neural unit nodes of the layer;

activation: activating a function;

use _ bias: boolean type, whether bias terms are used;

kernel _ initializer: an initializer of a convolution kernel;

kernel _ regular izer: regularization of convolution kernel is optional;

bias _ regular izer: regularization of a bias term is optional;

activity _ regular: the regularization function of the output;

reuse: boolean type, whether to reuse parameters;

input _ src: a storage path of data;

output _ src: model configuration saving path;

rate: the proportion of training set to test set;

the parameters common to the templates are as follows:

GPU: used for defining needed GPU resources;

memory: the unit of the memory resource is mi;

create _ type: for defining the creation of PODs or virtual machines.

4. The convolutional neural network layered training method based on containerization and virtualization of claim 1 or 2, wherein the POD, virtual machine creates a schedule,

5. The hierarchical training method for convolutional neural network based on containerization and virtualization of claim 4, wherein the POD, creation of virtual machine, and whether each layer of convolutional neural network is to create POD or virtual machine, can be specified by model designer in corresponding template through create _ type field; meanwhile, the calculation resources and the storage resources required by the POD or the virtual machine can be configured by setting the values of the CPU, the GPU and the memory fields; if no corresponding field value is configured, according to the calculated amount and the parameter amount of each layer, POD or virtual machine which is matched with the calculated amount and the parameter amount of the layer is established layer by layer from the first layer according to a preset rule, so that the corresponding calculation resource and storage resource can meet the requirements of the current module;

the principle of the matching rule is as follows:

the storage resource is 4 parameter a, wherein a is a parameter adjusting coefficient;

if the created object is a POD, the program runs in the POD; if the virtual machine is created, the creating and scheduling module copies the code of the module into the virtual machine through a related operation and maintenance tool, and starts the program in the form of a process.

6. The convolutional neural network layered training method based on containerization and virtualization according to claim 4, wherein the scheduling order of PODs and virtual machines is implemented based on a scheduling policy in a scheduling template and a message queue, the scheduling template is automatically generated by a creation scheduling module after all PODs and virtual machines required by a model are successfully created, and the scheduling template records ip addresses for accessing each POD and virtual machine, and the scheduling order and model iteration number information of the PODs and the virtual machines; after the scheduling template is generated, the scheduling template is copied to configuration files of all PODs and virtual machines created by the model by a creating scheduling module, so that each POD and virtual machine can know the transmission flow direction of data;

the specific scheduling steps are as follows:

7. The convolutional neural network layered training method based on containerization and virtualization of claim 4, wherein the monitoring platform for the performance monitoring of POD and virtual machine mainly comprises monitoring of CPU, GPU, memory and flow;

data _ from: from which layer the received data came;

receive _ data: the received data can be selected and matched as follows:

create _ time: the time at which the data was received;

end _ time: time of data transmission to the next layer;

to: sending the data to the next layer;

send _ data: data sent to the next layer is optionally matched;

message: other information.

8. A convolutional neural network layered training system based on containerization and virtualization is characterized by comprising a segmented machine learning framework, an abstract modular template, a POD (POD), a virtual machine creation and scheduling module and a monitoring platform,

the system implements the containerization and virtualization-based convolutional neural network layered training method of any one of claims 1 to 7.

9. A convolutional neural network layered training device based on containerization and virtualization, comprising: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor, configured to invoke the machine readable program to perform the containerization and virtualization-based convolutional neural network layered training method of any of claims 1-7.

10. A computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform the containerization and virtualization-based convolutional neural network layered training of any one of claims 1-7.