CN118095494A

CN118095494A - Model training method, device, computer equipment and readable storage medium

Info

Publication number: CN118095494A
Application number: CN202410365832.7A
Authority: CN
Inventors: 马海龙; 张继超; 刘俊; 章峰; 胡家豪
Original assignee: Dark Matter Beijing Intelligent Technology Co ltd; DMAI Guangzhou Co Ltd
Current assignee: Dark Matter Beijing Intelligent Technology Co ltd; DMAI Guangzhou Co Ltd
Priority date: 2024-03-28
Filing date: 2024-03-28
Publication date: 2024-05-28

Abstract

The application provides a model training method, a model training device, computer equipment and a readable storage medium, wherein a Kubernetes cluster is built, and the Kubernetes cluster comprises at least one node; packaging the model codes and the dependent items into a Docker mirror image, and uploading the Docker mirror image to a Harbor mirror image warehouse; constructing a model training task based on a Kubernetes cluster and a Harbor mirror image warehouse, wherein the Kubernetes cluster is a container resource of the model training task, and the Harbor mirror image warehouse is a training mirror image of the model training task; responding to a model training instruction input by a user, and executing a model training task through each target node indicated by the model training instruction. By adopting the method, the model training is ensured to be normally carried out, meanwhile, the model training time is reduced, and the model training speed is improved.

Description

Model training method, device, computer equipment and readable storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a model training method, a model training device, a computer device, and a readable storage medium.

Background

With the development of computer technology, data processing using machine learning is becoming more and more common. The machine learning process generally refers to a process in which a computer device builds an initial model, inputs sample data into the initial model, analyzes the input sample data through a series of algorithms, and updates model parameters of the initial model through iterative training to obtain a final suitable model.

In the prior art, a single device or a computer device is usually used for training a model to be trained by using model training data such as a training sample set, so as to perform a model training task. However, in the research, as the requirement of data processing on model accuracy is higher and higher, the number of training samples is also larger and larger, and if only a single computer is adopted to execute model training tasks, the model training is likely to take longer time due to the deficiency of single computing resources; even when the calculation pressure is too high, the situation that the computer processor is crashed, down and the like is likely to happen, so that model training cannot be normally conducted.

Disclosure of Invention

Accordingly, the present invention is directed to a model training method, apparatus, computer device and readable storage medium, so as to ensure that model training is performed normally, reduce model training duration, and increase model training speed.

In a first aspect, an embodiment of the present application provides a model training method, where the method includes:

constructing a Kubernetes cluster, wherein the Kubernetes cluster comprises at least one node;

Packaging model codes and dependent items into a Docker image, and uploading the Docker image to a Harbor image warehouse;

constructing a model training task based on the Kubernetes cluster and the Harbor mirror warehouse, wherein the Kubernetes cluster is a container resource of the model training task, and the Harbor mirror warehouse is a training mirror image of the model training task;

And responding to a model training instruction input by a user, and executing the model training task through each target node indicated by the model training instruction.

Optionally, the model training task includes a single-node training task and a distributed training task, when the model training task is the single-node training task, the number of target nodes is one, and when the model training task is the distributed training task, the number of target nodes is a plurality.

Optionally, before each target node indicated by the model training instruction performs the model training task, the method further comprises:

Creating MinIO an object storage server and a PVC persistent storage volume declaration;

Connecting the MinIO object storage server with a PVC persistent storage volume statement;

Mounting the PVC into the Kubernetes cluster;

In performing the model training task by each target node indicated by the model training instructions, the method further comprises:

And storing model training data generated when each target node in the Kubernetes cluster executes the model training task to the MinIO through the PVC.

Optionally, when each target node indicated by the model training instruction performs the model training task, the method further comprises:

Collecting system information of each node, wherein the system information comprises CPU utilization rate, memory utilization rate and disk utilization rate;

determining the alarm strategy of each node according to the system information of each node and the preset alarm rule;

and alarming based on the alarming strategy of each node.

and collecting log data generated when each target node executes the model training task.

screening out anomaly log data of ERROR level from log data generated when each target node executes the model training task at preset time intervals;

and sending the abnormal log data to a target mailbox.

and cleaning resources which are in a failure state and are overtime from the Kubernetes cluster every preset time period, wherein the resources comprise a Pod container group.

In a second aspect, an embodiment of the present application provides a model training apparatus, where the method includes:

The cluster building module is used for building a Kubernetes cluster, wherein the Kubernetes cluster comprises at least one node;

The image uploading module is used for packaging the model codes and the dependent items into a Docker image and uploading the Docker image to the Harbor image warehouse;

the task construction module is used for constructing a model training task based on the Kubernetes cluster and the Harbor mirror image warehouse, wherein the Kubernetes cluster is a container resource of the model training task, and the Harbor mirror image warehouse is a training mirror image of the model training task;

and the task execution module is used for responding to a model training instruction input by a user and executing the model training task through each target node indicated by the model training instruction.

Optionally, the task execution module is further configured to:

Creating MinIO an object store server and a PVC persistence storage volume declaration before executing the model training task by each target node indicated by the model training instructions;

Mounting the PVC into the Kubernetes cluster;

the task execution module is further configured to: and when each target node indicated by the model training instruction executes the model training task, storing model training data generated when each target node in the Kubernetes cluster executes the model training task to the MinIO through the PVC.

Optionally, the task execution module is further configured to:

Collecting system information of each node when each target node indicated by the model training instruction executes the model training task, wherein the system information comprises CPU (central processing unit) utilization rate, memory utilization rate and disk utilization rate;

and alarming based on the alarming strategy of each node.

Optionally, the task execution module is further configured to:

And collecting log data generated when each target node executes the model training task when each target node indicated by the model training instruction executes the model training task.

Optionally, the task execution module is further configured to:

When each target node indicated by the model training instruction executes the model training task, abnormal log data of ERROR level is screened out from log data generated when each target node executes the model training task at intervals of preset time length;

and sending the abnormal log data to a target mailbox.

Optionally, the task execution module is further configured to:

And when each target node indicated by the model training instruction executes the model training task, cleaning resources which are in a failure state and are overtime from the Kubernetes cluster every preset time length, wherein the resources comprise a Pod container group.

In a third aspect, an embodiment of the present application provides a computer apparatus, including: a processor, a memory and a bus, the memory storing machine readable instructions executable by the processor, the processor and the memory in communication via the bus when the computer device is running, the machine readable instructions when executed by the processor performing the steps of the model training method as described in any of the alternative embodiments of the first aspect above.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the model training method described in any of the alternative embodiments of the first aspect.

The technical scheme provided by the application comprises the following beneficial effects:

According to the application, by constructing the Kubernetes cluster comprising at least one node, a plurality of computer resources can be provided for executing the model training task, and model training which can only be carried out by means of a single computer is avoided. And then packaging the model codes and the dependent items into a Docker mirror image, uploading the Docker mirror image to a Harbor mirror image warehouse, and constructing a model training task based on the Kubernetes cluster and the Harbor mirror image warehouse, wherein the Kubernetes cluster is a container resource of the model training task, and the Harbor mirror image warehouse is a training mirror image of the model training task, so that model training data for executing the model training task can be provided for each node. And finally, responding to a model training instruction input by a user, executing the model training task through each target node indicated by the model training instruction, and executing the model training task in parallel through a plurality of computer nodes, so that the condition that a single computer resource possibly caused by overlarge model training data volume cannot meet the required training resource, and the model training cannot be normally performed is avoided, and the normal performance of the model training is ensured. Meanwhile, as the model training tasks can be executed by multiple nodes at the same time, compared with the single-computer single-line execution training tasks, the model training duration can be reduced, and the model training speed can be improved.

In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a model training method according to a first embodiment of the present invention;

FIG. 2 is a flowchart of model training based on a Kubernetes cluster according to an embodiment of the present invention;

fig. 3 shows a flowchart of a Kubernetes cluster preparation method according to an embodiment of the present invention;

FIG. 4 is a flowchart of a node alarm method according to an embodiment of the present invention;

FIG. 5 is a flowchart of an anomaly log data transmission method according to an embodiment of the present invention;

FIG. 6 is a flow chart illustrating a Pod resource cleaning provided by one embodiment of the present invention;

FIG. 7 illustrates a flow chart of a PV resource clean-up provided in accordance with a first embodiment of the present invention;

FIG. 8 is a flow chart illustrating a method for cleaning PVC resources according to a first embodiment of the invention;

FIG. 9 is a flow chart of training anomaly log monitoring provided in accordance with one embodiment of the present invention;

FIG. 10 is a flow chart illustrating the creation of a mirror image according to one embodiment of the present invention;

FIG. 11 is a flow chart of a creation model provided by a first embodiment of the present invention;

FIG. 12 is a flow chart of a method for creating a data set according to a first embodiment of the present invention;

FIG. 13 shows a flow chart of a creation training provided by a first embodiment of the present invention;

FIG. 14 is a flow chart of one embodiment of the present invention for running training;

Fig. 15 shows a schematic structural diagram of a model training device according to a second embodiment of the present invention;

fig. 16 is a schematic structural diagram of a computer device according to a third embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.

Example 1

For the convenience of understanding the present application, the following describes the first embodiment of the present application in detail with reference to the flowchart of the first embodiment of the present application shown in fig. 1.

Referring to fig. 1, fig. 1 shows a flowchart of a model training method according to a first embodiment of the present invention, where the method includes steps S101 to S104:

S101: and constructing a Kubernetes cluster, wherein the Kubernetes cluster comprises at least one node.

Specifically, a Kubeadm tool (a tool for rapidly deploying Kubernetes clusters) is used for building a Kubernetes cluster including at least one node (a cloud server or a physical machine), and each node in the Kubernetes cluster belongs to the same local area network or virtual network. The construction process comprises the following steps: configuring a Docker and Kubeadm tool; initializing a Kubernetes cluster; adding each node into a cluster; the network plug-in (Kubernetes requires a network plug-in to achieve communication and load balancing between its Pod sets, the present application uses Calico plug-ins). The construction of the Kubernetes cluster is completed.

S102: and packaging the model codes and the dependent items into a Docker image, and uploading the Docker image to a Harbor image warehouse.

Specifically, the Harbor is an open-source container mirror image management system, which can be used for storing, managing and calling the Docker mirror image, and packaging model codes and dependent items into the Docker mirror image and storing the Docker mirror image into a Harbor mirror image warehouse, so that the Kubernetes cluster can call the Docker mirror image from the Harbor mirror image warehouse to carry out model training.

Before executing step S102, a Harbor mirror image warehouse needs to be built, and the process of building the Harbor mirror image warehouse includes: the configuration Helm tool (Helm is a package management tool of Kubernetes); adding a Helm warehouse of a Harbor; creating a namespace of the Harbor; a Harbor mirror warehouse is deployed. Thus, the configuration of the image management Harbor image warehouse required by the Kubernetes is completed.

After the construction of the Harbor mirror image warehouse is completed, a Docker file is written, and model training codes and dependent items are packed into a Docker mirror image. The specific writing and packaging process comprises the following steps: a file named Docker file is created in the directory where the model training code is located, and is used to define the rule of construction of Docker images, for example, a Tensor Flow (a symbolic mathematical system programmed based on data streams) may be used as a base image, and the required Python library and other dependencies are installed therein, where requirements. Txt is a text file containing the Python library that needs to be installed, and train. Py is an entry file for the model training code.

Finally, the Docker image is uploaded to the Harbor image repository through the Harbor API (Application Programming Interface ).

The Kubernetes cluster is built through steps S101 to S102, a Harbor mirror warehouse is deployed therein, the mirror images of the model training codes and the dependent items are packaged by using a Docker, and the model training codes and the dependent items are uploaded into the Harbor mirror warehouse for subsequent calling and management. Next, the model training task needs to be deployed in KUBERNETES clusters and managed by step S103.

S103: and constructing a model training task based on the Kubernetes cluster and the Harbor mirror image warehouse, wherein the Kubernetes cluster is a container resource of the model training task, and the Harbor mirror image warehouse is a training mirror image of the model training task.

Specifically, the model training task is a single-node training task or a distributed training task. For ordinary single-node training, the Job construction can be performed directly using the Job function of Kubernetes (Job is a Kubernetes resource object that defines a set of Pods running a single or periodic Job). For distributed training, task construction can be performed based on DEEPSPEED frames (DEEPSPEED is a distributed training tool kit of PyTorch frames, which can conveniently realize multiple distributed training modes such as data parallelism, model parallelism, pipeline parallelism and the like, pyTorch is an open source Python machine learning library), and deployment is performed in a Kubernetes cluster by adopting a Pod mode.

For a single-node training task, in Kubernetes, the Job function may be used for deployment. When all Pod's run successfully, the Job is complete. When constructing a single-node training task based on the Kubernetes cluster and the Harbor mirror warehouse, a Docker mirror image including model training codes, environment dependencies and the like needs to be prepared first and uploaded into a Docker Hub (public mirror warehouse) or a private mirror warehouse. Kubernetes Job is then created, i.e., a YAML file is created defining the name, mirror image, command, etc. of Job. The Kubernetes profile is then applied, i.e., the application profile is commanded using kubectl apply, and finally the Job's running state is checked using kubectl commands.

For distributed training tasks, the building may be based on DEEPSPEED frames. When constructing a distributed training task based on the Kubernetes cluster and the Harbor image warehouse, a Docker image containing DEEPSPEED environments and model training codes needs to be prepared first and uploaded into a Docker Hub or private image warehouse. Kubernetes Pod is then created, i.e., in the Pod definition file, information on the number of containers, container images, commands, etc. is defined. The Kubernetes profile is then applied, i.e., the command application profile is used kubectl apply. Finally, using kubectl commands to view the running state of Pod.

S104: and responding to a model training instruction input by a user, and executing the model training task through each target node indicated by the model training instruction.

Specifically, when a user needs to perform a model training task through the Kubernetes cluster, a model training instruction may be input through a control node in the Kubernetes cluster. The Kubernetes cluster responds to a model training instruction input by a user, and the model training task is executed through each target node indicated by the model training instruction. When the model training task is the single-node training task, the number of target nodes is one, and when the model training task is the distributed training task, the number of target nodes is a plurality.

In practical application, through steps S101 to S104 provided by the present application, model training can be performed based on Kubernetes clusters, see fig. 2, fig. 2 shows a flowchart of model training based on Kubernetes clusters provided by the first embodiment of the present application, where first, K8S cluster initialization processing is performed after performing Docker installation in Kubernetes (i.e., K8S) clusters (equivalent to K8S clusters) by using kubsadm tools, and each node (sub-node 1, sub-node 2, sub-node 3, each sub-node can perform Calico communication and load) in the K8S clusters is added to a master node (control node) in the K8S clusters to perform unified management and resource allocation, so as to provide container resources for start training. And uploading the model training mirror image to a Harbor mirror image system for storage, and providing a training mirror image for starting training. When model training is needed to be started, a user triggers model training starting through a certain node, then training type selection is carried out, if normal training (single-node training) is selected, training data (training data set) preparation is directly carried out, if distributed training is selected, a distributed training framework is firstly configured, and then training data preparation is carried out. After the preparation is completed, training K8S resource allocation is carried out according to system information such as a CPU (Central Processing Unit, a central processing unit), a GPU (Graphics Processing Unit, a graphic processor), a memory, disk utilization rate and the like of each node, and a target node for executing model training is determined. Training is then initiated by the target node based on the provided container resources and the training images. K8S log monitoring and K8S resource monitoring are carried out in the training process, and meanwhile whether model data are stored in the MinIO object storage server is verified in real time in the training process. And (5) outputting the model after model training is finished, and ending the model training task.

In an alternative embodiment, the model training task includes a single-node training task and a distributed training task, and when the model training task is the single-node training task, the number of target nodes is one, and when the model training task is the distributed training task, the number of target nodes is a plurality.

In an alternative implementation, referring to fig. 3, fig. 3 shows a flowchart of a Kubernetes cluster preparation method according to an embodiment of the present invention, where before each target node indicated by the model training instruction performs the model training task, the method further includes steps S301 to S303:

S301: an object store server and PVC persistence storage volume declaration is created MinIO.

Specifically, minIO is an open-source object storage server, which can be used in a local or cloud environment. The server MinIO needs to be configured before the training data is stored using MinIO. The MinIO server configuration process comprises the following steps: downloading the MinIO server of the latest version in the Kubernetes cluster; start MinIO; a new bucket is created and the file is uploaded through the Web network interface of MinIO. PVC (Persistent Volume Claim) is a persistent storage volume declaration that expresses a user's request for storage.

S302: and connecting the MinIO object storage server with the PVC persistent storage volume statement.

In particular, in Kubernetes, PVC may be used to connect to MinIO servers and provide persistent storage for Pod. The PVC may map one or more buckets in MinIO to volumes inside Kubernetes. The PVC creation process comprises the following steps: creating a YAML file, defining the name, the size, the storage class, the access information of MinIO and the like of the PVC; the PVC profile is applied using kubectl apply commands.

S303: the PVC was mounted into the Kubernetes cluster.

Specifically, when building a model training task, PVC may be mounted into Pod of Kubernetes cluster, so that during model training, data can be written into PVC. The PVC mounting process comprises the following steps: creating a YAML file, and defining information such as names, mirror images, commands, mounting PVC and the like of the Pod; the Pod profile is commanded to apply using kubectl apply.

Specifically, model training data generated when each target node in the Kubernetes cluster performs a model training task is stored to MinIO by the PVC, and meanwhile, kubectl commands can be used to check the running states of the PVC and Pod to confirm whether the PVC has been successfully mounted in the Pod, and whether the data has been written into the bucket in MinIO.

Through the steps, the training data can be successfully stored by MinIO, and the training data is connected to the Kubernetes cluster through PVC, so that the reliability and the flexibility of data storage are improved.

In an alternative implementation, referring to fig. 4, fig. 4 shows a flowchart of a node alarm method according to an embodiment of the present invention, where, when each target node indicated by the model training instruction performs the model training task, the method further includes steps S401 to S404:

S401: collecting system information of each node, wherein the system information comprises CPU utilization rate, memory utilization rate and disk utilization rate.

Specifically, for each node in the Kubernetes cluster, system information of each node is collected respectively, including actual CPU usage, actual memory usage and actual disk usage of each node. And monitoring the node disk and the node memory utilization rate and the node CPU utilization rate according to the system information.

S402: and determining the alarm strategy of each node according to the system information of each node and the preset alarm rules.

Specifically, the alarm rule includes: triggering an alarm when the disk usage rate of a certain node exceeds a first specified threshold; triggering an alarm when the memory usage rate of a certain node exceeds a second designated threshold; and triggering an alarm when the CPU utilization of a certain node exceeds a third specified threshold. When any of the conditions for triggering the alarm is not met, the alarm is not triggered.

And for each node, when the system information of the node meets any alarm rule, determining the alarm strategy of the node as triggering alarm. For example, for the first node, the CPU usage rate of the first node is 20%, the memory usage rate is 40%, and the disk usage rate is 40%; for the second node, the CPU utilization rate of the first node is 20%, the memory utilization rate is 20%, and the disk utilization rate is 40%; the alarm rule has a first specified threshold of 20%, a second specified threshold of 25% and a third specified threshold of 70%. For the first node, although the CPU usage rate of the first node does not exceed the first specified threshold and the disk usage rate does not exceed the third specified threshold, the memory usage rate of the first node exceeds the second specified threshold, and therefore, the alarm policy of the first node is determined to trigger an alarm. For the second node, its CPU usage does not exceed the first specified threshold, its memory usage exceeds the second specified threshold, and its disk usage does not exceed the third specified threshold, thus determining the alarm policy of the second node as not triggering an alarm.

S403: and alarming based on the alarming strategy of each node.

Specifically, when the alarm strategy of the node is determined to trigger the alarm, the abnormal alarm of the node is carried out, and when the alarm strategy of the node is determined to not trigger the alarm, the abnormal alarm of the node is not carried out.

Besides the above resource statistics and alarm based on Kubernetes, the CPU, memory and other resource usage conditions of each node in the cluster can be monitored, and the system information of each node can be displayed in a chart form by adopting Grafana visualization tools, so as to grasp the state of the cluster in time.

In practical application, the node alarm in steps S401 to S403 is realized by the following steps:

Step one: node-Exporter is deployed on the Kubernetes cluster. Node-Exporter is a exporter (a specific format or structure for data transmission) written based on Go language, and is used for collecting system information on nodes, such as CPU usage, memory usage, disk usage, etc., for acquisition by promethaus (an open source service monitoring system and time series database). Deployment (a software publisher) was created in the Kubernetes cluster under the name "Node-Exporter" to deploy Node-Exporter. Service services are created to expose Node-Exporter to promethaus for collection.

Step two: monitoring tasks are configured on Prometheus. Prometheus is a highly scalable open source monitoring system for collecting, storing and querying various index data. The monitoring task may be configured for a node in the Kubernetes cluster by: the Prometheus profile Prometheus. Yml was opened. The promethaus container is restarted to load the new configuration file. Prometheus will start to acquire data periodically from Node-Exporter and store it in TSDB (TIME SERIES Database ).

Step three: and configuring alarm rules. The Prometheus alert rule file rules. Yml is opened. In the alarm rule, three alarms are defined first: node disk monitoring: and triggering an alarm when the disk usage rate of a certain node exceeds a specified threshold. The node memory usage is too high: and triggering an alarm when the memory usage rate of a certain node exceeds a specified threshold. Node CPU utilization is too high: an alarm is triggered when the CPU utilization of a certain node exceeds a specified threshold.

Step four: deployment ALERT MANAGER (alarm module). Create a namespace named ALERT MANAGER: create a configuration file ALERTMANAGER-config. Yml for defining the configuration items of ALERTMANAGER: create a duplicate and service named ALERTMANAGER: confirm ALERTMANAGER if the deployment is correct. Prometheus' alert rules are associated with ALERTMANAGER.

Step five: and verifying an alarm function. Creating a test Pod; confirm whether the Pod has been started; simulating an alarm condition, such as deleting the Pod; and checking whether alarm information is received in a slot (a communication tool inside an enterprise). If everything is normal, a message containing detailed warning information is received; confirming whether the alarm is triggered correctly; the test Pod is deleted.

Through the steps, the alarm monitoring of the Kubernetes cluster can be realized, and the alarm notification can be timely received when the abnormality occurs.

In an alternative embodiment, when each target node indicated by the model training instructions performs the model training task, the method further comprises:

Specifically, the ELK technology stack deployed based on Kubernetes is a popular set of log management solutions, consisting of ELASTICSEARCH, LOGSTASH and Kibana open source tools. Wherein the elastiscearch is a distributed search engine and analysis platform; logstack is a tool for processing log data; kibana is a visualization tool that can assist a user in querying and analyzing log information stored in the elastsearch. The logging is deployed in the Kubernetes cluster and the log data is sent to the elastic search for storage, which can conveniently track and debug the model training process. The method comprises the following specific implementation steps:

Step one: installing an elastomer search; creating a namespace named ELASTICSEARCH; mounting ELASTICSEARCH OPERATOR; defining a configuration file es-config.yml of the elastomer search; install elastic search using kubectl apply commands; confirm whether the elastiscearch is deployed correctly.

Step two: the logstack is installed. Creating a namespace named logstash; creating a configuration file Logstash-config. Yml for defining the configuration item of the logstack; confirm whether the logstack is deployed correctly.

Step three: visual query and analysis was performed using Kibana. Creating a namespace named kibana; creating a configuration file kibana-config. Yml for defining Kibana configuration items; installation Kibana is commanded using kubectl apply; confirm Kibana if the deployment is correct.

Step four: the log data of the application is sent to the logstack.

To send log data of an application into the logstack, an open source tool such as Fluentd (an open source data collector) or Filebeat (a lightweight log collection tool) may be used. Here, filebeat is taken as an example: creating a namespace named filebeat; creating a configuration file filebeat-config. Yml for defining Filebeat configuration items; installation Filebeat is commanded using kubectl apply; confirm Filebeat if the deployment is correct.

To this end, ELK technology stack deployment based on Kubernetes is complete. The model training log data may be queried and analyzed using Kibana tools, or custom query and aggregation operations may be performed using APIs provided by the elastomer search. Meanwhile, the Logstar can perform various filtering, analyzing and converting operations, and log data can be conveniently processed and processed.

In an alternative implementation, referring to fig. 5, fig. 5 shows a flowchart of an anomaly log data sending method according to an embodiment of the present invention, where, when each target node indicated by the model training instruction performs the model training task, the method further includes steps S501 to S502:

S501: and screening out abnormal log data of ERROR level from log data generated when each target node executes the model training task at preset time intervals.

Specifically, in order to realize the functions of training abnormal log monitoring, timing resource cleaning, overtime cleaning and the like, the application introduces xxjob (distributed task scheduling) framework for timing task scheduling. The containers were arranged using Kubernetes and the logs were collected and processed using ELK technology stacks.

The construction process of xxjob frames is as follows: creating a namespace named xxjob; installation xxjob is commanded using kubectl apply; confirm xxjob if the deployment is correct.

After xxjob framework construction is completed, acquiring abnormal log data, wherein the specific acquisition process comprises the following steps: modifying the Logstar configuration file, defining mail output and restarting the Logstar; creating a timing task for checking an abnormal log in the training process; creating a task named CheckLogJob in xxl-job-admin page; setting the Cron expression to be executed once per minute; logic codes are written in JobHandler classes, log data are queried through an elastic search, and log information of ERROR level is screened out.

S502: and sending the abnormal log data to a target mailbox.

Specifically, log information of ERROR level is sent to a designated mailbox, so that a user can know abnormal conditions in the model training process in time.

Specifically, the process of timing resource cleaning and timeout cleaning includes: creating a timing task for cleaning unused resources such as PV (PERSISTENT VOLUME ), PVC, service, etc.: creating a task named CleanUnusedResourcesJob in xxl-job-admin page; setting a Cron expression to be executed once a day at 1 am; writing logic code in JobHandler classes, querying unused resources through KubernetesAPI, and deleting them; creating a timing task for automatically cleaning up Pod, job and other resources which are in failure state and overtime: in xxl-job-admin page, a task named CleanFailedResourcesJob is created. Setting a Cron expression to be executed every 30 minutes; logic code is written in the JobHandler class, and resources in a failed state and timed out are queried through KubernetesAPI and deleted. Configuration xxl-job-executor-springboot, which is integrated with KubernetesAPI; adding a configuration item, and designating KubernetesAPI addresses and authentication information; the resources are queried and operated on using KubernetesAPI in JobHandler. In performing Pod cleaning, referring to fig. 6, fig. 6 shows a flow chart of Pod resource cleaning provided in an embodiment of the present invention, where Pod cleaning is performed once at intervals of 1 hour. Firstly, checking the Pod state at regular time, judging whether the Pod state is finished for more than 2 hours, if not, reserving the Pod resource, and if so, deleting the Pod resource.

In cleaning resources that are in a failed state and time out, there is a need to clean PV and PVC at regular times in addition to the Pod set. In performing the PV resource cleaning, referring to fig. 7, fig. 7 shows a flowchart of a PV resource cleaning according to an embodiment of the invention, where the PV cleaning is performed once every 1 hour. Firstly, checking the PV regularly, judging whether the PV has PVC binding, if not, reserving the PV resource, and if so, deleting the PV resource. In the process of cleaning PVC resources, referring to fig. 8, fig. 8 shows a flowchart of cleaning PVC resources provided in the first embodiment of the present invention, where PVC cleaning is performed once at intervals of 1 hour. Firstly, checking PVC at regular time, judging whether PVC is bound with Pod, if not, reserving PVC resources, if so, deleting PVC resources.

Besides the above-mentioned timed cleaning task, the timed task for training abnormal log monitoring is included. Referring to fig. 9, fig. 9 shows a flowchart of training anomaly log monitoring according to an embodiment of the present invention, wherein training anomaly log monitoring is performed once every 5 minutes. Firstly, checking the training operation log at regular time, judging whether an abnormal log exists or not, if not, keeping the training operation log, and if so, ending training.

The steps realize the functions of monitoring the training abnormal log, cleaning the timing resources, cleaning overtime, and the like by introducing xxjob frames to schedule the timing tasks, and in the specific implementation, the inquiry and the operation of the resources are carried out by using a Kubernetes API by using an ELK technical stack to collect and process the log, so that the inquiry management of the model training log can be realized.

To facilitate an understanding of the disclosed model training method, the disclosed model training method is described herein as divided into a number of sections, including creating a mirror image, creating a model, creating a dataset, creating a training, and running the training. In the image creation section, referring to fig. 10, fig. 10 shows a flow chart of creating an image provided by the first embodiment of the present application, wherein the image is created first, and then whether the image source is an internal image or a third party image is determined; if the mirror image is an internal mirror image, the Harbor mirror image is used for inquiring, and if the mirror image is a third-party mirror image, the Harbor mirror image is used for inquiring; after mirror image inquiry is completed, a mirror image environment variable is configured, then a mirror image running command is configured, and after the mirror image is obtained by executing the command, the mirror image is stored. In the model creation section, referring to fig. 11, fig. 11 shows a flowchart of creating a model according to the first embodiment of the present application, wherein a model is created first, and model version setting is performed; judging whether the set model source is uploaded locally or imported through training; if the model file is uploaded through local uploading, and if the model file is imported through training, the training record is imported through list selection; and after the model is imported, model classification selection is carried out, and the model is saved after classification is completed for the imported model. In the part for creating a data set, referring to fig. 12, fig. 12 shows a flowchart for creating a data set according to a first embodiment of the present application, in which a data set is created first, then a data set version setting is performed, and then a data set file is saved after uploading the data set file. In the training creation section, referring to fig. 13, fig. 13 shows a flowchart of a training creation provided in the first embodiment of the present application, in which training tasks are created first, and it is determined whether the task type of the training tasks is normal training (single-node training) or distributed training; if the training is the normal training, normal training mirror image selection is performed, and if the training is the distributed training, distributed training mirror image selection is performed; after the mirror image is selected, model classification selection is carried out, and the categories comprise texts, graphics and voices; after the model is selected, selecting a data set, configuring through mirror image operation commands, and editing mirror image environment variables according to the configuration result of the operation commands; and finally, carrying out node rule configuration (node specification comprises CPU, memory, GPU and other system information) based on the edited mirror image environment variables to obtain a model training task, and finally, storing the model training task. In the operation training part, referring to fig. 14, fig. 14 shows a flowchart of an operation training provided in the first embodiment of the present application, where after a user issues a model training instruction indicating the operation training, the K8S cluster acquires a mirror image, acquires a data set, and acquires a model to perform model training; in the model training process, performing K8S resource verification (the resource comprises system information such as CPU, memory, GPU and the like of each node), judging whether the resource is enough, ending the model training task if the resource is insufficient, judging whether the PV mounting path is correct if the resource is enough, ending the model training task if the PV mounting path is incorrect, creating PV and PVC through the K8S if the PV mounting path is correct, creating Job & Pod through the K8S, performing Pod log monitoring, ending the model training task if the log information is found to be wrong in the log monitoring, continuing the model training task if the log information is not wrong, and finishing the execution of the whole model training task after the operation is completed.

Based on the specific description of the model training method disclosed by the application, the model training is performed by the model training method disclosed by the application, so that the model training efficiency can be improved, the expandability is realized, the monitoring and debugging functions are provided, and the training process and the training result are safer and more reliable.

In the aspect of improving the model training efficiency, the application adopts the Kubernetes to carry out model training, can fully utilize cluster resources, and realizes distributed calculation and concurrent training, thereby greatly improving the training efficiency. For example, the resource allocation can be automatically adjusted by means of dynamic expansion and contraction capacity and the like, so that the resource waste and bottleneck problems are avoided.

In the aspect of realizing expandability, the model training scheme based on the Kubernetes can be rapidly adapted to service growth and data volume increase, support training on a plurality of nodes, and can expand the cluster scale through a newly added node, so that the model training scheme is more flexible and expandable.

In the aspect of providing monitoring and debugging functions, the application can monitor and statistically analyze resource use conditions, log records and the like in the training process by utilizing Prometheus, ELK and other tools, thereby realizing problem positioning and debugging. Meanwhile, a xxjob framework is introduced to perform task scheduling and management, so that training tasks are more stable and reliable.

In the aspect of improving the safety and reliability of training, when the Kubernetes cluster is deployed, the system can adopt Harbor management mirror image to ensure the safety of the system; the use MinIO of data storage can improve the reliability and the security of the data, and ensure that the data cannot be lost or illegally stolen.

Therefore, the model training method disclosed by the application can be used for actually solving the problems of model training efficiency, expandability, safety and the like in the machine learning field, and has remarkable beneficial effects.

Example two

Referring to fig. 15, fig. 15 shows a schematic structural diagram of a model training apparatus according to a second embodiment of the present invention, where the apparatus includes:

A cluster building module 1501, configured to build a Kubernetes cluster, where the Kubernetes cluster includes at least one node;

An image uploading module 1502, configured to package a model code and a dependency term into a Docker image, and upload the Docker image to a Harbor image repository;

A task construction module 1503, configured to construct a model training task based on the Kubernetes cluster and the Harbor mirror warehouse, where the Kubernetes cluster is a container resource of the model training task, and the Harbor mirror warehouse is a training mirror image of the model training task;

And the task execution module 1504 is configured to respond to a model training instruction input by a user, and execute the model training task through each target node indicated by the model training instruction.

In an alternative embodiment, the task execution module is further configured to:

Mounting the PVC into the Kubernetes cluster;

and alarming based on the alarming strategy of each node.

and sending the abnormal log data to a target mailbox.

Example III

Based on the same application concept, referring to fig. 16, fig. 16 shows a schematic structural diagram of a computer device provided in a third embodiment of the present application, where, as shown in fig. 16, a computer device 1600 provided in the third embodiment of the present application includes:

A processor 1601, a memory 1602 and a bus 1603, the memory 1602 storing machine-readable instructions executable by the processor 1601, the processor 1601 and the memory 1602 in communication via the bus 1603 when the computer device 1600 is running, the machine-readable instructions being executed by the processor 1601 to perform the steps of the model training method as described in the first embodiment.

Example IV

Based on the same application concept, the embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the model training method in any one of the above embodiments are performed.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.

The computer program product for performing model training provided by the embodiment of the present invention includes a computer readable storage medium storing program codes, where the instructions included in the program codes may be used to perform the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment and will not be described herein.

The model training device provided by the embodiment of the invention can be specific hardware on equipment or software or firmware installed on the equipment. The device provided by the embodiment of the present invention has the same implementation principle and technical effects as those of the foregoing method embodiment, and for the sake of brevity, reference may be made to the corresponding content in the foregoing method embodiment where the system embodiment portion is not mentioned. It will be clear to those skilled in the art that, for convenience and brevity, the specific operation of the system, apparatus and unit described above may refer to the corresponding process in the above method embodiment, which is not described in detail herein.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments provided in the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It should be noted that: like reference numerals and letters in the following figures denote like items, and thus once an item is defined in one figure, no further definition or explanation of it is required in the following figures, and furthermore, the terms "first," "second," "third," etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above examples are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention, but it should be understood by those skilled in the art that the present invention is not limited thereto, and that the present invention is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the corresponding technical solutions. Are intended to be encompassed within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of model training, the method comprising:

2. The method of claim 1, wherein the model training tasks comprise a single-node training task and a distributed training task, the number of target nodes being one when the model training task is the single-node training task and the number of target nodes being a plurality when the model training task is the distributed training task.

3. The method of claim 1, wherein prior to performing the model training task by each target node indicated by the model training instructions, the method further comprises:

Mounting the PVC into the Kubernetes cluster;

4. The method of claim 1, wherein upon execution of the model training task by each target node indicated by the model training instructions, the method further comprises:

and alarming based on the alarming strategy of each node.

5. The method of claim 1, wherein upon execution of the model training task by each target node indicated by the model training instructions, the method further comprises:

6. The method of claim 5, wherein upon execution of the model training task by each target node indicated by the model training instructions, the method further comprises:

and sending the abnormal log data to a target mailbox.

7. The method of claim 5, wherein upon execution of the model training task by each target node indicated by the model training instructions, the method further comprises:

8. A model training apparatus, the apparatus comprising:

9. A computer device, comprising: a processor, a memory and a bus, said memory storing machine readable instructions executable by said processor, said processor and said memory communicating over the bus when the computer device is running, said machine readable instructions when executed by said processor performing the steps of the model training method according to any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the model training method according to any of claims 1 to 7.