CN113886026B

CN113886026B - Intelligent modeling method and system based on dynamic parameter configuration and process supervision

Info

Publication number: CN113886026B
Application number: CN202111480477.0A
Authority: CN
Inventors: 徐伟民; 崔隽; 吴姗姗; 后弘毅; 郝大鑫
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2022-03-15
Anticipated expiration: 2041-12-07
Also published as: CN113886026A

Abstract

The invention provides an intelligent modeling method and system based on dynamic parameter configuration and process supervision. For developers, providing a visual interface to dynamically edit and adjust parameters of a model building code script, so that the code script runs along with the subjective intention of the developers to be dynamically adapted and changed, and thus, an intelligent model meeting the needs of the developers is built and trained; meanwhile, a monitoring service is started through a code script integrated tensorbard framework method, changes of some key data information and parameter indexes in the operation process are recorded along with the operation of a code script for model building training, so that the process supervision of an intelligent modeling task is performed, developers can make decisions conveniently according to the change conditions of important parameters in the model training process, and script parameters are readjusted to perform secondary training building of the model. The above mutual cooperation and supplement each other, so that developers can efficiently and conveniently construct the intelligent model.

Description

Intelligent modeling method and system based on dynamic parameter configuration and process supervision

Technical Field

The invention belongs to the field of deep learning and neural network construction, and particularly relates to an intelligent modeling method and system based on dynamic parameter configuration and process supervision.

Background

Docker is an open source container project implemented based on the Go language, born in the early 2013, and the initiator was the dotCloud corporation. There are several related items (including Docker tri-swordsman, Kubernates, etc.) that have evolved into an ecosystem around Docker vessels. The concept of Docker is to achieve the purpose of one-time encapsulation and everywhere operation of application components by managing the encapsulation, distribution, deployment and operation life cycles of applications. The application component may be a Web application, a compiling environment, a set of database platform services, or even an operating system or cluster. Based on a plurality of open source technologies on a Linux platform, Docker provides an efficient, agile and lightweight container scheme and supports deployment to a local environment and a plurality of mainstream cloud platforms.

Kubernetes (K8 s) is an open-source container cluster management system from Google corporation, and its main functions include: container-based application deployment, maintenance, and rolling upgrades; load balancing and service discovery; cross-machine and cross-regional cluster scheduling; automatic expansion and contraction; stateless services and stateful services; extensive Volume support; the plug-in mechanism ensures extensibility. The two most core design concepts of the kubernets system: one is fault tolerance and one is scalability. The fault tolerance is the basis for ensuring the stability and the safety of the K8s system, and the expansibility is the basis for ensuring that the K8s is friendly to change and can quickly and iteratively add new functions.

Deep Learning (Deep Learning) is a machine Learning paradigm that has emerged in recent years, and utilizes a multi-layer (Deep) neural network structure to learn from big data a representation form (such as things in images, sounds in audio, etc.) in which various things in the real world can be directly used for computer computation, which is considered as a possible "brain structure" of an intelligent machine. In the research fields of voice recognition, image recognition, natural language processing and the like, the enthusiasm of one-time deep learning is raised. In some old problems, it is decayed and overturns the method used for years, and in other front problems, it is totally different from the prior popular method and shows a remarkable effect improvement. Neural networks are functions with parameters that can be fitted to different functions by adjusting the parameters, and machine learning is a process that allows a computer to automatically adjust the parameters of a function to fit a desired function. A plurality of functions with parameters can be nested to form a multilayer neural network, and functions needed in some practical problems can be better fitted. The essence of deep learning is that: machine learning is performed using a multi-layer neural network.

Tensorflow is a symbolic mathematical system based on data flow programming and is applied to programming realization of algorithms in the field of machine learning. The system has a multi-stage structure, can be deployed in various servers, PC terminals and webpages, supports high-performance numerical calculation of the GPU and the TPU, and is widely applied to intelligent product research and development and scientific research in various fields.

For most people, the deep neural network is like a black box, the internal organization, structure and training process are difficult to understand, great difficulty is brought to the principle understanding and engineering of the deep neural network, and particularly, in the intelligent model training process, the change and progress of the deep neural network in the process are lack of intuitive perception and understanding. The TensoborBoard is a visual tool built in the tensorflow, and enables the tensorflow program to be more efficient and convenient to understand, run and debug by visually displaying the log file information output by the tensorflow program. The Tensobord visualization depends on a log file output when a tensiorflow program runs, the log file needs to be obtained after the program runs, the current process and the whole model training process are mutually split, and the real-time supervision on the model training process cannot be realized. Therefore, the method capable of dynamically configuring parameters and visually supervising the whole intelligent modeling process can effectively overcome the defects and shortcomings in the existing modeling work and improve the modeling efficiency and the completion degree.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide an intelligent modeling method and system based on dynamic parameter configuration and process supervision, which realize that parameters of a model building code script are dynamically edited and adjusted in the intelligent model building process based on an input-output interactive visual interface, so that the parameters can be efficiently acted into the code script, the code script can be dynamically adapted and changed along with the subjective intention of developers during running, and an intelligent model meeting the needs of the developers can be built and trained. And the intelligent modeling task can be monitored in the process, so that developers can make decisions according to the change conditions of important parameters in the model training process.

The technical scheme is as follows: in order to achieve the above object, the invention provides an intelligent modeling method based on dynamic parameter configuration and process supervision, which mainly comprises the following steps:

receiving basic information of a modeling task, a training algorithm, a training data set and training container resources configured by a user through an interactive interface; the training algorithm is associated with the code script and the parameter configuration list, and a user configures the adjusting parameters associated with the selected training algorithm through an interface;

acquiring a related container mirror image, a code script storage path, an in-container loading path, a training data set loading path, a model output path and a command for running the algorithm in the container according to a training algorithm selected by a user; if the training algorithm has a corresponding pre-training model, a storage path and an in-container loading path of the pre-training model are also obtained;

calling a method for starting a container task in an API (application program interface) of a container cluster management system, setting a starting mirror image and a starting command, mounting a code script, a training data set and a pre-training model on a corresponding mounting path, setting CPU (central processing unit), GPU (graphic processing unit) and memory resources used by the container in operation, setting a path output by the model, and starting the training container; dotting in real time through a tensisorbard frame in the model training process, and recording the change condition of model training parameters;

calling a method for starting container service in an API (application program interface) of a container cluster management system, setting a tensorbard container mirror image, setting a working directory of a tensorbard service container, mounting a directory stored by running data in the running process of a training container onto the working directory of the tensorbard service container, starting the container service and providing a callable service address to the outside;

and providing the service address for a user, and displaying the visual chart of the operation parameter change in the model training process when the user accesses the service address.

Preferably, the code script, the training data set and the pre-training model are stored in an NFS file system, and the storage position is recorded through a database table; the database also stores parameter configuration lists and default values of the code scripts.

Preferably, a Kubernetes management container cluster environment is adopted, when a training container is established, the Kubernetes cluster is used for carrying out unified scheduling and resource allocation, the running state and the running process of the training container are monitored, and when the training is finished, the Kubernetes cluster is used for carrying out container removal and resource recovery.

Preferably, the user interaction interface uses Vue a front-end development framework, and when the user configures the training container resources, the back-end service queries the remaining available CPU, GPU and memory resources from the container cluster management system for selection by the user.

Preferably, the modeling task is started by one key of a user, and the real-time visual supervision on the training process of the modeling task is realized; and after the model training is finished or the user deletes the task, automatically recovering the resources occupied by the training container.

Based on the same inventive concept, the intelligent modeling system based on dynamic parameter configuration and process supervision provided by the invention mainly comprises the following parts: the system comprises a front-end interactive interface, a back-end service, a container cluster management system, a file system, a code script management module and a tensorbard monitoring service.

The front-end interactive interface is used for configuring basic information of a modeling task, a training algorithm, a training data set and training container resources by a user; the training algorithm is associated with the code script and the parameter configuration list, and after the user selects the training algorithm, the associated adjusting parameters are displayed for the user to adjust.

The back-end service is used for receiving information configured by a user through a front-end interactive interface, and acquiring a related container mirror image, a code script storage path, an in-container loading path, a training data set loading path, a model output path and a command for operating an algorithm in a container according to a training algorithm selected by the user; if the training algorithm has a corresponding pre-training model, a storage path and an in-container loading path of the pre-training model are also obtained; the method for starting the container task in the API of the container cluster management system is called, a starting mirror image and a starting command are set, a code script, a training data set and a pre-training model are mounted on a corresponding mounting path, a CPU, a GPU and memory resources used by the container in operation are set, a path output by the model is set, and a training container is started; dotting in real time through a tensisorbard frame in the model training process, and recording the change condition of model training parameters; and calling a method for starting the container service in the API of the container cluster management system, setting a tensorbard container mirror image, setting a working directory of a tensorbard service container, mounting the directory stored by the running data in the running process of the training container to the working directory of the tensorbard service container, starting the container service and providing a callable service address to the outside.

The file system is used for storing the code script, the training data set and the pre-training model and recording the storage position through the database table.

And the code script management module is used for storing the association relationship between the code script and the parameter configuration list.

The tensorbard monitoring service is used for monitoring the change condition of parameter indexes in the modeling process and showing the change condition through a visual change curve chart.

Preferably, the front-end interactive interface is mainly realized by Vue front-end development framework, and mainly provides a human-computer interactive interface used by a user in an intelligent modeling process, and the user can use the interface to perform modeling information entry, code script selection, mirror image selection, training data set loading, training container resource configuration, and most importantly, automatically recognize and bring out an interface for displaying code adjustment parameters according to the difference of code scripts selected by the user, so that the user can perform secondary editing and adjustment on default parameters.

The method comprises the steps that a rear-end service uses springboot, is a trunk part of a system and plays a role in carrying and switching, and firstly, the rear-end service receives a request transmitted from a front-end interface, analyzes the request and calls corresponding logics to process the request. Such as: selecting a code script, requiring background service to inquire a data table and obtain a storage position of the code script in a file system, reading the code file, finding a default parameter configuration list in the file, and finally returning a response to a front-end interface; loading a training data set, inquiring a data table by background service, acquiring a storage position of the training data set in a file system, and reading a file of the data set; most importantly, the background service calls an API (application programming interface) of the container cluster management system, loads a code script, a training data set and a container mirror image according to a request for establishing a modeling task at the front end, configures the limitation of usable resources of a container, and starts a container environment for model training; when the modeling container environment is started, the back-end service can also establish a container of the tensorbard monitoring service at the same time to monitor the modeling process.

The container cluster management system adopts Kubernets, the Kubernets cluster management mainly manages a container cluster environment established by a modeling task, so that the processes of creating, running, destroying and the like of containers can be automatically managed, and a Kubernets framework method API can be called by a back-end service and completes a series of operations.

The file system is a place where data used by the modeling method is stored and managed, records information such as data storage positions by using a database table, and stores the information by using NFS. The code script and the training data set file are stored and managed in the file system, and when the data need to be loaded, the data are found from the storage position of the file system through the index information of the file system.

The code script management module not only uses a file system to manage code storage, but also extracts and stores a parameter list of each different scene code script into a data table, and performs associated mapping on the code file and the parameter list so as to automatically load a corresponding parameter list and default values when the code script is selected by modeling task configuration and provide the parameter list and the default values for a user to edit or view.

the tensorbard monitoring service mainly monitors the change condition of important parameter indexes in codes in the modeling process and displays the change condition through a visual change curve chart. And when the modeling task is started, the back-end service starts a container of the tensorbard monitoring service and supervises the whole modeling process.

The intelligent modeling method for the intelligent modeling system to interact with the user comprises the following steps:

step 1, a user opens a modeling task interface from an intelligent modeling task entrance, and the first step is a basic information interface of a task and inputs the name and description information of the modeling task.

Step 2, clicking the next step by the user, and entering a selection training algorithm interface; the algorithm built in the algorithm type selection system selects the function options supporting the training visualization, and a flag bit field for using the visual monitoring of the modeling process is added in the training task information; when a specific algorithm is selected, the back-end service inquires the associated information mapped by the algorithm, inquires the operation parameter list and default value data which can be configured by the algorithm, and the system brings the operation parameter configuration list of the algorithm out to the interface, so that the user can change the parameter item to be changed.

Step 3, clicking the next step by the user, and entering a data set selection interface; and if the type of the data set selects the existing data set, using the data set which is built in the system and trained well as a training set for model training.

Step 4, clicking the next step by the user, and entering a training resource configuration interface; and configuring a training algorithm on the interface to run physical resources used by the container to be started, wherein the physical resources comprise a CPU (central processing unit), a GPU (graphic processing unit) and a memory.

Step 5, the user clicks to confirm, saves the information and submits the task; the front-end interactive interface sends task information input by a user to a back-end service, and the back-end service firstly inquires a mirror image used by an algorithm operation starting container, a storage path related to the algorithm, a container internal loading path, a model output path and an algorithm operation command according to a specific system internal algorithm selected by the user.

And 6, calling a method for starting a container task in a Kubernetes API by a back-end service, setting a starting mirror image, setting a starting command, mounting a code script, a training data set and a pre-training model on a corresponding mounting path, setting CPU (Central processing Unit), GPU (graphics processing Unit) and memory resources used by the container in operation, setting a path output by the model, and starting the training container.

And 7, dotting the training container in the operation process by a tensisorboard framework method, recording key operation parameters in the training process, and storing the files in a run directory in a model output path.

And 8, judging the marker bit field which is visually monitored by using the modeling process in the modeling task information after the training container is started by the back-end service, if the marker bit field is selected to be used, calling a method for starting the container service in the API of the container cluster management system, setting a tensorbard container mirror image, setting a working directory of a tensorbard service container, and mounting the/runs directory on the working directory of the tensorbard service container, so that the container can read a file with real-time running parameter change and exported by the model training container. The CPU, GPU and memory resources used for starting operation of the container are set, then an operation starting command is set, and the service exposure internal port is 6006. After the method is started, a back-end service starts a container service in a Kubernets container cluster, starts to run a tensorboard monitoring service container, provides a callable service address for the outside, and is in a form of 'ip + external port'.

And 9, recording the tensorbard service address into the modeling task information, clicking a training visualization button in the modeling task list by a user, popping up a tensorbard monitoring service page, and displaying a visualization chart of key operation parameter changes in the model training process by the page.

And step 10, after the training container is operated, namely model training is finished, according to the visual supervision and display of the model training process, a user judges and evaluates the quality of the modeling process, and determines whether to change the configuration operation parameters again to perform secondary training again.

And 11, after all the steps are completed, deriving a model file generated by the modeling task from the model output path, and obtaining an intelligent model after the training is completed.

Based on the same inventive concept, the invention provides a computer system, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the intelligent modeling method based on dynamic parameter configuration and process supervision when being loaded to the processor.

Has the advantages that: the method and the device realize dynamic editing and adjusting of the parameters of the model building code script in the intelligent model building process based on the visual interface of input and output interaction, thereby conveniently and efficiently acting the parameters in the code script, enabling the code script to run along with the subjective intention of developers to be dynamically adapted and changed, and building and training the intelligent model meeting the needs of the developers. Developers do not need to manually start a container at the back end of the server to run the code scripts, but simply select the code scripts to be run and the required training data set to realize one-key configuration, and operate the system through a plurality of simple operation steps to complete the workload which needs a large amount of manual operation in the past. The container cluster carries out unified resource management on a training container operated by the code script, when the script code needs to be operated, the system applies for resources to the container cluster to establish a container environment, and after the code script is operated, the container cluster can automatically stop the container and release the resources occupied by the container, so that unified scheduling and allocation of the resources are realized. Meanwhile, a tenasorboard monitoring service is started through a framework method that a code script is integrated with the tenasorboard in the model building process, changes of some key data information and parameter indexes in the running process are recorded along with the running of the code script trained by the model building, so that the process supervision is carried out on the intelligent modeling task, developers can make decisions conveniently according to the change conditions of important parameters in the model training process, and script parameters are readjusted to carry out secondary training building of the model. The above mutual cooperation and supplement each other, so that developers can efficiently and conveniently construct the intelligent model.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention.

FIG. 2 is a schematic diagram of interaction between a front-end interactive interface and a back-end service in an embodiment of the present invention.

FIG. 3 is a diagram of backend services, K8s API, and tenosorbard interactions in an embodiment of the present invention.

FIG. 4 is a diagram of an effect achieved by a modeling task front-end interactive interface in an embodiment of the present invention (first step).

FIG. 5 is a diagram of the effect achieved by the modeling task front-end interactive interface in the embodiment of the present invention (second step).

FIG. 6 is a diagram of an effect achieved by a modeling task front-end interactive interface in the embodiment of the present invention (third step).

FIG. 7 is a diagram of an effect achieved by a modeling task front-end interactive interface in the embodiment of the present invention (fourth step).

FIG. 8 is a flowchart of intelligent modeling task operation execution in an embodiment of the present invention.

Detailed Description

The embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, an intelligent modeling method based on dynamic parameter configuration and process supervision according to an embodiment of the present invention first receives basic modeling task information, a training algorithm, a training data set, training container resources, and the like configured by a user through an interactive interface; then, according to a training algorithm selected by a user, acquiring a related container mirror image, a code script storage path, an in-container mounting path, a training data set mounting path, a model output path and a command for operating the algorithm in the container; if the training algorithm has a pre-training model, a storage path and an in-container loading path of the pre-training model are also obtained; calling a method for starting a container task in an API (application program interface) of the container cluster management system, setting a starting mirror image and a starting command, mounting a code script, a training data set and a pre-training model on a corresponding mounting path, setting a CPU (central processing unit), a GPU (graphic processing unit) and memory resources used by the container in operation, setting a path output by the model, and starting the training container; dotting in real time through a tensisorbard frame in the model training process, and recording the change condition of model training parameters; calling a method for starting a container service in an API (application program interface) of a container cluster management system, setting a tensorbard container mirror image, setting a working directory of a tensorbard service container, mounting a directory stored by running data in the running process of a training container onto the working directory of the tensorbard service container, starting the container service and providing a callable service address to the outside; and finally, providing the service address for the user, and displaying the visual chart of the operation parameter change in the model training process when the user accesses the service address.

The intelligent modeling system based on dynamic parameter configuration and process supervision for realizing the method mainly comprises the following parts: the system comprises a front-end interactive interface, a springboot back-end service, a Kubernets container cluster management system, a file system, a code script management module and a tensorbard monitoring service. The specific construction process is as follows:

(1) using Vue front-end development framework, a visualization interface is developed for user interaction with the modeling work that can provide the user with some key information for entering the modeling, such as: selecting a code script to be used for training, selecting a training data set used for modeling, selecting a mirror image required for establishing a training container, selecting system resource limitation required by starting and running the container, adjusting information such as running parameters and the like when the code script is used for modeling and training, and the like. The specific effect style can be used for realizing the effect diagram by referring to the front-end interactive interfaces of FIGS. 4-7.

(2) The method comprises the steps of establishing a file system by using NFS, storing a code script, a training data set and data of a pre-training model to be used by a modeling method, recording storage information of the data through a database table, obtaining index information by inquiring the database table during modeling training, inquiring the data, and when establishing a training container by the modeling method, hanging the data into the container in a path hanging mode in the container, so that the container can be called conveniently in the running process of the code script.

(3) As shown in the table 1, the mapping relation table of the related information of the training algorithm is used for mapping the code script file and the list of the operation parameters which are exposed to the outside and can be adjusted in an associated mode by using the database table. As shown in fig. 4, when the training configuration interface selects the corresponding training code script, the running parameters can be automatically brought out, and the running parameters are displayed to the user through the interface for configuration and modification.

TABLE 1 training Algorithm-related information mapping Table

(4) Establishing a Kubernetes container cluster environment, cooperatively deploying a plurality of servers to form a cluster with unified resource management and virtualization container management, registering all available system resources of the modeling method into the container cluster environment, performing unified scheduling and resource allocation by the cluster when the training container is established by the modeling method, and simultaneously monitoring the running state and the running process of the training container by the container cluster. At the end of the modeling method, the removal of containers and reclamation of resources is also performed by the cluster.

(5) Developing a back-end service by using the springboot back-end framework, wherein the springboot back-end framework is responsible for receiving a user request of a front-end interactive interface and inquiring display data which are provided by the interactive interface and need to be displayed to a user, and the interaction of the front end and the back end can refer to a front-end and back-end interaction schematic diagram of FIG. 2; meanwhile, the backend service is also responsible for connecting an API of a Kubernetes container cluster and for a series of work such as starting, loading data and resources of a modeling task training container, and the interaction work of the backend and the K8s can refer to the backend service, the K8s API and the tensorboard interaction diagram of FIG. 3.

(6) The system needs to register a container mirror packaged based on tensorflow for starting a tensorbard container service to realize monitoring service of a modeling process when the training of a modeling task is started.

As shown in fig. 8, the intelligent modeling method based on the system has the following operation and execution steps:

step 1, a user opens a modeling task interface from an intelligent modeling task entrance, the first step is a basic information interface of a task, and the name of the modeling task and other description information of the task needing special explanation are firstly input.

And 2, clicking the next step by the user, and entering a selection training algorithm interface. And selecting the algorithm built in the algorithm type selection system, selecting the functional options supporting the training visualization, and adding a flag bit field for visualization monitoring by using a modeling process in the training task information. Taking a scene of image target detection as an example, when a specific algorithm is selected, selecting 'image-target detection', a back-end service inquires associated information mapped by the algorithm, inquires a running parameter list and data of default values which can be configured by the algorithm, at the moment, a system brings the running parameter configuration list of the algorithm out of an interface, and a user can change a parameter item to be changed; example parameter items of image target detection algorithm:

{

"PRE _ NMS _ TOP _ N _ TRAIN": number of initial regression blocks ",

"ANCHOR _ SIZES": Preset size of frame ",

"ASPECT ratio of the preset box",

"BASE _ LR": a "reference learning rate",

"STEPS": learning rate adjustment point ",

"MAX _ ITER": maximum number of iterations ",

IMS _ PER _ BATCH, single BATCH of pictures,

CHECKOINT _ PERIOD model retention PERIOD "

}

And 3, clicking the next step by the user, and entering a data set selection interface. And selecting the existing data set as the data set type, and using the data set which is built in the system and is trained as a training set for model training.

And 4, clicking the next step by the user, and entering a training resource configuration interface. And configuring a training algorithm on the interface to run physical resources used by the container to be started, wherein the physical resources comprise a CPU (central processing unit), a GPU (graphic processing unit) and a memory.

And 5, clicking and determining by the user, storing the information and submitting the task. The method comprises the steps that a front-end interactive interface sends task information input by a user to a back-end service, the back-end service firstly inquires a mirror image used by a starting container, a file storage address of a training algorithm, a file storage address of a pre-training model, a default mounting path of a training data set in the container, a default mounting path of the pre-training model in the container, a default mounting path of a code script executed in the container, an output default mounting path of the model in the container, a starting command of the code script running in the container and the like according to a specific system built-in algorithm selected by the user.

And 6, calling a method for starting a container task in a Kubernetes API by a back-end service, setting a starting mirror image, setting a starting command, mounting an algorithm training script onto a code script default mounting path executed in the container, mounting a training data set file onto the training data set default mounting path in the container, mounting a pre-labeled model file onto the pre-training model default mounting path in the container, setting CPU, GPU and memory resources used for starting and running the container, setting a path finally output by a model in the container, and conveniently exporting the model file which is run and trained from the container to the local of a server.

After the method starts, the back-end service starts a container task in the kubernets container cluster and starts to run the algorithm script code.

And 7, dotting the training container in the operation process by a tensisorboard framework method, recording key operation parameters in the training process, and storing the files in a run directory in a model output path. And dotting means that in the running process of the training code script, when the iteration of each model file is completed, the time point and the information of key parameters of the model, such as the convergence condition, are recorded in a log file and then output to the specified path.

And 8, judging a marker bit field which is visually monitored by using a modeling process in the modeling task information by the back-end service after the training container is started, if the marker bit field is selected to be used, calling a method for starting the container service in a Kubernetes API by the back-end service, setting a tensorbard container mirror image, setting a working directory of the tensorbard service container, and mounting the/runs directory on the working directory of the tensorbard service container, so that the container can read a file with real-time operation parameter change led out by the model training container. The CPU, GPU and memory resources used for starting operation of the container are set, then an operation starting command is set, and the service exposure internal port is 6006. After the method is started, a back-end service starts a container service in a Kubernets container cluster, starts to run a tensorboard monitoring service container, provides a callable service address for the outside, and is in a form of 'ip + external port'.

An example of a method for code implementation to create a container is as follows:

public HasMetadatacreateTensorboardDeployment(IdaCodeTrainingcodeTraining) {

String appName = String.format("ida%s-%s", codeTraining.getId(), "tensorboard");

Map<String, String> labels = new ImmutableMap.Builder<String, String>().put(LABEL_NAME, appName).build();

K8sSimpleContainer simpleContainer = new K8sSimpleContainer.Builder()

.name(appName)

.image("registry:5000/public/centos74c:v1.02")

.workingDir("/runs")

.commands(new ImmutableList.Builder<String>()

.add("python3").add("/usr/local/bin/tensorboard").add("--logdir=/runs")

.build())

.requestResource(new ImmutableMap.Builder<String, Quantity>()

.put(K8sConstant.CPU, new Quantity(String.valueOf(1)))

.put(K8sConstant.MEMORY, new Quantity(String.format("%sGi", 1)))

.put(K8sConstant.GPU, new Quantity(String.valueOf(0)))

.build())

.limitResource(new ImmutableMap.Builder<String, Quantity>()

.put(K8sConstant.CPU, new Quantity(String.valueOf(1)))

.put(K8sConstant.MEMORY, new Quantity(String.format("%sGi", 1)))

.put(K8sConstant.GPU, new Quantity(String.valueOf(0)))

.build())

.containerPortList(new ImmutableSet.Builder<ContainerPort>()

.add((new K8sSimplePort(6006)).transfer())

.build())

.volumeMountList(new ImmutableSet.Builder<VolumeMount>()

.add(new VolumeMountBuilder()

.withMountPath("/runs")

.withName("share")

.withSubPath(String.format("project/%s/codetraining/%s/%s/runs", codeTraining.getProjectId(), codeTraining.getId(), CodeTrainingConstant.CODE))

.build())

.build();

K8sSimpleDeployment simpleDeployment = new K8sSimpleDeployment();

simpleDeployment.setName(appName);

simpleDeployment.setNamespace(codeTraining.getProjectId());

simpleDeployment.setLabels(labels);

simpleDeployment.setReplicas(1);

simpleDeployment.setImagePullSecrets(new ImmutableList.Builder<LocalObjectReference>().add(new LocalObjectReference(codeTraining.getProjectId())).build());

simpleDeployment.addContainer(simpleContainer);

simpleDeployment.setVolumes(new ImmutableList.Builder<Volume>()

.add(new VolumeBuilder()

.withName("share")

.withPersistentVolumeClaim(new PersistentVolumeClaimVolumeSourceBuilder()

.withClaimName(String.format("pvc-share-%s", codeTraining.getProjectId()))

.build())

.build());

simpleDeployment.setNodeSelector(new ImmutableMap.Builder<String, String>()

.put("runtype", "deployment")

.build());

return k8sService.createResource(simpleDeployment);

}

and 9, recording the tensorbard service address into the modeling task information, clicking a training visualization button in the modeling task list by a user, popping up a tensorbard monitoring service page, and displaying a visualization chart of key operation parameter changes in the model training process by the page. And the user judges whether the training needs to be reconfigured according to the change condition of the key operation parameters in the model training process displayed by the visual chart, if the training effect is not ideal, the task can be deleted, the purposes of stopping the training container and deleting occupied training resources are achieved, and then the training parameters can be reconfigured for training.

And step 10, after the training container is operated, namely after the model training is finished, the container cluster can stop the operation of the container, and release the occupied training resources to be placed in a resource pool managed by the cluster for use when other training containers apply. And according to the visual supervision and display of the model training process, the user judges and evaluates the quality of the modeling process and determines whether to change the configuration operation parameters again for secondary training.

The embodiment of the invention also provides a computer system, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the computer program realizes the intelligent modeling method based on dynamic parameter configuration and process supervision when being loaded to the processor.

Claims

1. An intelligent modeling method based on dynamic parameter configuration and process supervision is characterized by comprising the following steps:

step 1, a user opens a modeling task interface from an intelligent modeling task entrance, wherein the first step is a basic information interface of a task and inputs the name and description information of the modeling task;

step 2, clicking the next step by the user, and entering a selection training algorithm interface; the algorithm built in the algorithm type selection system selects the function options supporting the training visualization, and a flag bit field for using the visual monitoring of the modeling process is added in the training task information; when a specific algorithm is selected, the back-end service inquires the associated information mapped by the algorithm, inquires the operation parameter list which can be configured by the algorithm and the data of the default value, the system brings the operation parameter configuration list of the algorithm out to an interface, and a user can change the parameter item to be changed;

step 3, clicking the next step by the user, and entering a data set selection interface; if the type of the data set selects the existing data set, using the data set which is built in the system and is trained as a training set for model training;

step 4, clicking the next step by the user, and entering a training resource configuration interface; configuring a training algorithm on the interface to run physical resources used by a container to be started, wherein the physical resources comprise a CPU (central processing unit), a GPU (graphic processing unit) and a memory;

step 5, the user clicks to confirm, saves the information and submits the task; the front-end interactive interface sends task information input by a user to a back-end service, and the back-end service firstly inquires a mirror image used by an algorithm operation starting container, a code script storage path related to the algorithm, a container internal mounting path, a model output path and an algorithm operation command according to a specific system built-in algorithm selected by the user; if the training algorithm has a corresponding pre-training model, a storage path and an in-container loading path of the pre-training model are also obtained;

step 6, calling an API (application program interface) of the container cluster management system by a back-end service to start a container task, setting a starting mirror image, setting a starting command, mounting a code script, a training data set and a pre-training model on a corresponding mounting path, setting a CPU (central processing unit), a GPU (graphic processing unit) and a memory resource used by the container in operation, setting a path output by the model, and starting the training container;

step 7, dotting the training container through a tensorbard framework in the operation process, recording key operation parameters in the training process, and storing files in a/runs directory in a model output path;

step 8, the back-end service judges the use of a mark bit field visually monitored in the modeling process in the modeling task information after the training container is started, if the use is selected, the back-end service calls a container cluster management system API to start the container service, sets a tensoord container mirror image, sets a working directory of a tensoord service container, and mounts the/runs directory on the tensoord service container working directory, so that the container can read a file with real-time running parameter change derived from the model training container; setting CPU, GPU and memory resources used for starting operation of the container, and then setting an operation starting command, wherein a service exposure internal port is 6006; after the modeling task is executed, a back-end service starts a container service in a container cluster, starts to run a tensorbard monitoring service container, provides a callable service address for the outside and takes the form of 'ip + external port';

step 9, recording the tensorbard service address into the modeling task information, clicking a training visualization button in a modeling task list by a user, popping up a tensorbard monitoring service page, and displaying a visualization chart on the change of key operation parameters in the model training process by the page;

step 10, after the training container is operated, according to the visual supervision and display of the model training process, a user judges and evaluates the quality of the modeling process and determines whether to change the configuration operation parameters again to perform secondary training again;

and 11, after all the steps are finished, deriving a model file generated by the modeling task from the model output path, and finishing training to obtain an intelligent model.

2. The intelligent modeling method based on dynamic parameter configuration and process supervision according to claim 1, characterized in that the code script, the training data set and the pre-training model are stored in the NFS file system, and the storage location is recorded by a database table; the database also stores parameter configuration lists and default values of the code scripts.

3. The intelligent modeling method based on dynamic parameter configuration and process supervision according to claim 1, characterized in that a kubernets management container cluster environment is adopted, when a training container is established, the kubernets cluster is used for uniform scheduling and resource allocation, and monitoring the operation state and the operation process of the training container, and when the training is finished, the kubernets cluster is used for removing the container and recycling the resource.

4. The intelligent modeling method based on dynamic parameter configuration and process supervision of claim 3, characterized in that the user interaction interface uses Vue front-end development framework, when the user configures the training container resources, the back-end service queries the remaining available CPU, GPU and memory resources from the container cluster management system for the user to select.

5. The intelligent modeling method based on dynamic parameter configuration and process supervision as claimed in claim 1, wherein a user is enabled to start a modeling task and to visually supervise the training process of the modeling task in real time by one key; and after the model training is finished or the user deletes the task, automatically recovering the resources occupied by the training container.

6. An intelligent modeling system based on dynamic parameter configuration and process supervision, comprising: the system comprises a front-end interactive interface, a back-end service, a container cluster management system, a file system, a code script management module and a tensorbard monitoring service;

the front-end interactive interface is used for configuring basic information of a modeling task, a training algorithm, a training data set and training container resources by a user; the training algorithm is associated with the code script and the parameter configuration list, and after the user selects the training algorithm, the associated adjusting parameters are displayed for the user to adjust;

the back-end service is used for receiving information configured by a user through a front-end interactive interface, and acquiring a related container mirror image, a code script storage path, an in-container loading path, a training data set loading path, a model output path and a command for operating an algorithm in a container according to a training algorithm selected by the user; if the training algorithm has a corresponding pre-training model, a storage path and an in-container loading path of the pre-training model are also obtained; calling the API of the container cluster management system to start a container task, setting a start mirror image and a start command, mounting a code script, a training data set and a pre-training model on a corresponding mounting path, setting a CPU (central processing unit), a GPU (graphic processing unit) and memory resources used by the container in operation, setting a path output by the model, and starting the training container; dotting in real time through a tensisorbard frame in the model training process, and recording the change condition of model training parameters; calling the API of the container cluster management system to start a container service, setting a tensorboard container mirror image, setting a working directory of a tensorboard service container, mounting a directory stored by running data in the running process of a training container onto the working directory of the tensorboard service container, starting the container service and providing a callable service address to the outside;

the file system is used for storing the code script, the training data set and the pre-training model and recording the storage position through a database table;

the code script management module is used for storing the association relation between the code script and the parameter configuration list;

the tensorbard monitoring service is used for monitoring the change condition of the parameter index in the modeling process and showing the change condition through a visual change curve chart;

step 5, the user clicks to confirm, saves the information and submits the task; the front-end interactive interface sends task information input by a user to a back-end service, and the back-end service firstly inquires a mirror image used by an algorithm operation starting container, a storage path related to the algorithm, a container internal loading path, a model output path and an algorithm operation command according to a specific system internal algorithm selected by the user;

7. The intelligent modeling system based on dynamic parameter configuration and process supervision as claimed in claim 6, wherein the container cluster management system employs a kubernets container cluster management system, when a training container is established, the kubernets cluster performs unified scheduling and resource allocation, and monitors the operation state and operation process of the training container, and when training is finished, the kubernets cluster performs container removal and resource recovery.

8. The intelligent modeling system based on dynamic parameter configuration and process supervision of claim 6, wherein the user interaction interface uses Vue front end development framework, and when the user configures the training container resources, the back end service queries the remaining available CPU, GPU and memory resources from the container cluster management system for selection by the user.

9. A computer system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program, when loaded into the processor, implements the intelligent modeling method based on dynamic parameter configuration and process supervision according to any of claims 1-5.