CN112241368A - Kubernetes-based automatic model training method and device - Google Patents

Kubernetes-based automatic model training method and device Download PDF

Info

Publication number
CN112241368A
CN112241368A CN202011065445.XA CN202011065445A CN112241368A CN 112241368 A CN112241368 A CN 112241368A CN 202011065445 A CN202011065445 A CN 202011065445A CN 112241368 A CN112241368 A CN 112241368A
Authority
CN
China
Prior art keywords
training
data set
training data
script
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011065445.XA
Other languages
Chinese (zh)
Inventor
刘润芝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Moviebook Technology Corp ltd
Original Assignee
Beijing Moviebook Technology Corp ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Moviebook Technology Corp ltd filed Critical Beijing Moviebook Technology Corp ltd
Priority to CN202011065445.XA priority Critical patent/CN112241368A/en
Publication of CN112241368A publication Critical patent/CN112241368A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3664Environments for testing or debugging software
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/368Test management for test version control, e.g. updating test cases to a new software version
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3692Test management for test results analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45591Monitoring or debugging support

Abstract

The application discloses an automatic model training method and device based on Kubernetes, and relates to the field of artificial intelligence. The method comprises the following steps: the training data set and the training script are sorted, a virtual container is created by using Kubernetes, a needed environment is installed and executed in the virtual container, the training data set is copied into the virtual container, the training script is called to automatically execute a training code by using the training data set, a log of the training is recorded, and a training result is stored. The device includes: the device comprises a sorting module, a creating module, an installing module, a copying module, a training module and a storage module. The method and the device realize automatic training of the deep learning network model based on the Kubernetes mechanism, reduce manual redundancy and repeated operation, and improve the efficiency of model training.

Description

Kubernetes-based automatic model training method and device
Technical Field
The application relates to the field of artificial intelligence, in particular to an automatic model training method and device based on Kubernetes.
Background
With the rapid development of machine learning and artificial intelligence, many open-source machine learning platforms are emerging in the industry. At present, most of scenes still share one or more machines by multiple people, before data training is carried out each time, a user manually logs in a remote server, downloads codes and configures an environment and a code package according to requirements, wherein conflict of installation package versions is easy to occur, and the process is tedious and long in time consumption. For example: some rely on CUDA9, while some require CUDA 10; there is also a reliance on different versions of the deep learning framework, such as pytorch0.x version, pytorch1.x version, tensoflow version 1.8, etc.
Training of deep learning network models refers to training a model on data using tags and predicting the tags of unknown data. For multidimensional model training, if a traditional manual deployment mode is adopted, namely, environment construction is carried out aiming at different dimensions and downloading of a code packet is completed, various operations of manual SSH are required to be relied on in the model capability training level, and all the training processes need manual participation.
Disclosure of Invention
It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.
According to one aspect of the application, an automated model training method based on Kubernetes is provided, and comprises the following steps:
the training data set and the training script are sorted;
creating a virtualized container using Kubernetes;
installing and executing a desired environment in the virtualized container;
copying the training data set into the virtualized container;
calling the training script to automatically execute training codes by using the training data set;
and recording the log of the training and storing the training result.
Optionally, the sorting the training data set and the training script includes:
the training data set and training script are saved in a designated location of the shared storage.
Optionally, copying the training data set into the virtualized container comprises:
copying the training data set from a specified location of the shared storage into the virtualized container.
Optionally, the method further comprises:
destroying the virtualized container using the Kubernets.
Optionally, the method further comprises:
and inquiring logs of historical training, and comparing the difference between the training and the historical training.
According to another aspect of the present application, there is provided a kubernets-based automated model training apparatus, including:
a collation module configured to collate the training data set and the training script;
a creation module configured to create a virtualized container using Kubernets;
an installation module configured to install and execute a desired environment in the virtualized container;
a copy module configured to copy the training data set into the virtualization container;
a training module configured to invoke the training script to automatically execute training code using the training data set;
and the storage module is configured to record the log of the training and store the training result.
Optionally, the sorting module is specifically configured to:
the training data set and training script are saved in a designated location of the shared storage.
Optionally, the copy module is specifically configured to:
copying the training data set from a specified location of the shared storage into the virtualized container.
Optionally, the apparatus further comprises:
a destruction module configured to destroy the virtualized container using the Kubernets.
Optionally, the apparatus further comprises:
a comparison module configured to query logs of historical training and compare differences between the current training and the historical training.
According to yet another aspect of the application, there is provided a computing device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, wherein the processor implements the method as described above when executing the computer program.
According to yet another aspect of the application, a computer-readable storage medium, preferably a non-volatile readable storage medium, is provided, having stored therein a computer program which, when executed by a processor, implements a method as described above.
According to yet another aspect of the application, there is provided a computer program product comprising computer readable code which, when executed by a computer device, causes the computer device to perform the method described above.
The technical scheme that this application provided arranges in order through training data set and training script, uses Kubernetes to establish the virtualization container installation and execution needs 'environment in the virtualization container will the training data set copies to in the virtualization container, call the training script uses training data set automatic execution training code, the log of this training of record, this training result is saved, has realized the automated training of the degree of depth learning network model based on Kubernetes mechanism, has reduced artifical redundancy and operation repeatedly, all need artifical the participation to compare with prior art training's all flows, has greatly improved the efficiency of model training.
The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.
Drawings
Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:
FIG. 1 is a flow diagram of a Kubernetes-based automated model training method according to one embodiment of the present application;
FIG. 2 is a flow diagram of a Kubernetes-based automated model training method according to another embodiment of the present application;
FIG. 3 is a block diagram of a Kubernetes-based automated model training apparatus according to another embodiment of the present application;
FIG. 4 is a block diagram of a computing device according to another embodiment of the present application;
fig. 5 is a diagram of a computer-readable storage medium structure according to another embodiment of the present application.
Detailed Description
The embodiment of the application is realized based on Kubernets, which is an open source and is used for managing containerized applications on a plurality of hosts in a cloud platform, the aim of the embodiment is to ensure that the containerized applications are simply and efficiently deployed, and the embodiment provides a mechanism for deploying, planning, updating and maintaining the applications. Kubernetes provides an API for life cycle management, a user can complete training of model tasks in a one-click mode based on the API, and user experience and efficiency can be greatly improved.
FIG. 1 is a flow chart of a Kubernetes-based automated model training method according to one embodiment of the present application. Referring to fig. 1, the method includes:
101: the training data set and the training script are sorted;
the training data set and the training script may be set as needed, which is not specifically limited in this embodiment.
102: creating virtualized Containers using Kubernetes;
103: installing and executing a desired environment in the virtualized container;
104: copying the training data set into a virtualized container;
105: calling a training script to automatically execute the training code by using the training data set;
in the embodiment, the training process is automatically completed by the execution of the training script without manual operation, so that the dependence on manpower is reduced, and the efficiency is improved.
106: and recording the log of the training and storing the training result.
In this embodiment, optionally, the sorting of the training data set and the training script includes:
the training data set and training script are saved in a designated location of the shared storage.
In this embodiment, optionally, copying the training data set into the virtualized container includes:
the training data set is copied from the specified location of the shared storage into the virtualized container.
In this embodiment, optionally, the method further includes:
the virtualized container is destroyed using Kubernetes.
In this embodiment, optionally, the method further includes:
and inquiring logs of historical training, and comparing the difference between the training and the historical training.
In the method provided by this embodiment, a training data set and a training script are sorted, a virtualized container is created by using kubernets, a required environment is installed and executed in the virtualized container, the training data set is copied into the virtualized container, the training script is called to automatically execute a training code by using the training data set, a log of the training is recorded, and a training result of the training is stored.
FIG. 2 is a flow diagram of a Kubernetes-based automated model training method according to another embodiment of the present application. Referring to fig. 2, the method includes:
201: storing the training data set and the training script in a designated position of shared storage;
the designated location of the shared storage may be set as needed, and is not limited specifically.
202: creating a virtualized container using Kubernetes;
203: installing and executing a desired environment in the virtualized container;
in this embodiment, because the required environment is installed in the virtualization container, which is a single environment installation, and does not involve installation of multiple environments, the problem of version conflict of multiple installation packages is avoided.
204: copying a training data set from a designated location of a shared storage into a virtualized container;
205: calling a training script to automatically execute the training code by using the training data set;
206: recording the log of the training and storing the training result;
after each training is finished, the log of the training is recorded, so that the training results are conveniently compared, and the difference of different training processes is determined for improvement and perfection.
207: destroying the virtualized container using Kubernetes;
208: and inquiring logs of historical training, and comparing the difference between the training and the historical training.
Further, on the basis of comparing the difference between the training and the historical training, the training data set and the training script can be adjusted so as to obtain a better training effect.
In the method provided by this embodiment, a training data set and a training script are sorted, a virtualized container is created by using kubernets, a required environment is installed and executed in the virtualized container, the training data set is copied into the virtualized container, the training script is called to automatically execute a training code by using the training data set, a log of the training is recorded, and a training result is stored.
FIG. 3 is a block diagram of a Kubernetes-based automated model training apparatus according to another embodiment of the present application. Referring to fig. 3, the apparatus includes:
a collating module 301 configured to collate the training data set and the training script;
a creation module 302 configured to create a virtualized container using kubernets;
an installation module 303 configured to install and execute a desired environment in the virtualized container;
a copy module 304 configured to copy the training data set into a virtualized container;
a training module 305 configured to invoke a training script to automatically execute training code using a training data set;
and a storage module 306 configured to record a log of the current training and store a result of the current training.
In this embodiment, optionally, the sorting module is specifically configured to:
the training data set and training script are saved in a designated location of the shared storage.
In this embodiment, optionally, the copy module is specifically configured to:
the training data set is copied from the specified location of the shared storage into the virtualized container.
In this embodiment, optionally, the apparatus further includes:
a destruction module configured to destroy virtualized containers using Kubernets.
In this embodiment, optionally, the apparatus further includes:
a comparison module configured to query logs of historical training and compare differences between the current training and the historical training.
The apparatus provided in this embodiment may perform the method provided in any of the above method embodiments, and details of the process are described in the method embodiments and are not described herein again.
According to the device provided by the embodiment, the training data set and the training script are sorted, the Kubernets are used for creating the virtualization container, the needed environment is installed and executed in the virtualization container, the training data set is copied into the virtualization container, the training script is called to automatically execute the training codes by using the training data set, the log of the training is recorded, the training result is stored, the automatic training of the deep learning network model based on the Kubernets mechanism is realized, the manual redundancy and the repeated operation are reduced, and compared with the condition that all processes of the training in the prior art need manual participation, the efficiency of the model training is greatly improved.
The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.
Embodiments also provide a computing device, referring to fig. 4, comprising a memory 1120, a processor 1110 and a computer program stored in said memory 1120 and executable by said processor 1110, the computer program being stored in a space 1130 for program code in the memory 1120, the computer program, when executed by the processor 1110, implementing the method steps 1131 for performing any of the methods according to the invention.
The embodiment of the application also provides a computer readable storage medium. Referring to fig. 5, the computer readable storage medium comprises a storage unit for program code provided with a program 1131' for performing the steps of the method according to the invention, which program is executed by a processor.
The embodiment of the application also provides a computer program product containing instructions. Which, when run on a computer, causes the computer to carry out the steps of the method according to the invention.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. An automated model training method based on Kubernetes comprises the following steps:
the training data set and the training script are sorted;
creating a virtualized container using Kubernetes;
installing and executing a desired environment in the virtualized container;
copying the training data set into the virtualized container;
calling the training script to automatically execute training codes by using the training data set;
and recording the log of the training and storing the training result.
2. The method of claim 1, wherein collating the training data set and the training script comprises:
the training data set and training script are saved in a designated location of the shared storage.
3. The method of claim 2, wherein copying the training data set into the virtualization container comprises:
copying the training data set from a specified location of the shared storage into the virtualized container.
4. The method of claim 1, further comprising:
destroying the virtualized container using the Kubernets.
5. The method according to any one of claims 1-4, further comprising:
and inquiring logs of historical training, and comparing the difference between the training and the historical training.
6. An automated Kubernetes-based model training device, comprising:
a collation module configured to collate the training data set and the training script;
a creation module configured to create a virtualized container using Kubernets;
an installation module configured to install and execute a desired environment in the virtualized container;
a copy module configured to copy the training data set into the virtualization container;
a training module configured to invoke the training script to automatically execute training code using the training data set;
and the storage module is configured to record the log of the training and store the training result.
7. The apparatus of claim 6, wherein the collation module is specifically configured to:
the training data set and training script are saved in a designated location of the shared storage.
8. The apparatus of claim 7, wherein the copy module is specifically configured to:
copying the training data set from a specified location of the shared storage into the virtualized container.
9. The apparatus of claim 6, further comprising:
a destruction module configured to destroy the virtualized container using the Kubernets.
10. The apparatus according to any one of claims 6-9, further comprising:
a comparison module configured to query logs of historical training and compare differences between the current training and the historical training.
CN202011065445.XA 2020-09-30 2020-09-30 Kubernetes-based automatic model training method and device Pending CN112241368A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011065445.XA CN112241368A (en) 2020-09-30 2020-09-30 Kubernetes-based automatic model training method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011065445.XA CN112241368A (en) 2020-09-30 2020-09-30 Kubernetes-based automatic model training method and device

Publications (1)

Publication Number Publication Date
CN112241368A true CN112241368A (en) 2021-01-19

Family

ID=74168558

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011065445.XA Pending CN112241368A (en) 2020-09-30 2020-09-30 Kubernetes-based automatic model training method and device

Country Status (1)

Country Link
CN (1) CN112241368A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408062A (en) * 2018-11-01 2019-03-01 郑州云海信息技术有限公司 A kind of method and apparatus of automatic deployment model training environment
US20190228303A1 (en) * 2018-01-25 2019-07-25 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for scheduling resource for deep learning framework
CN110378463A (en) * 2019-07-15 2019-10-25 北京智能工场科技有限公司 A kind of artificial intelligence model standardized training platform and automated system
CN110825705A (en) * 2019-11-22 2020-02-21 广东浪潮大数据研究有限公司 Data set caching method and related device
CN111090438A (en) * 2019-11-07 2020-05-01 苏州浪潮智能科技有限公司 Method, equipment and medium for FPGA virtualization training based on kubernets
CN111209077A (en) * 2019-12-26 2020-05-29 中科曙光国际信息产业有限公司 Deep learning framework design method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190228303A1 (en) * 2018-01-25 2019-07-25 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for scheduling resource for deep learning framework
CN109408062A (en) * 2018-11-01 2019-03-01 郑州云海信息技术有限公司 A kind of method and apparatus of automatic deployment model training environment
CN110378463A (en) * 2019-07-15 2019-10-25 北京智能工场科技有限公司 A kind of artificial intelligence model standardized training platform and automated system
CN111090438A (en) * 2019-11-07 2020-05-01 苏州浪潮智能科技有限公司 Method, equipment and medium for FPGA virtualization training based on kubernets
CN110825705A (en) * 2019-11-22 2020-02-21 广东浪潮大数据研究有限公司 Data set caching method and related device
CN111209077A (en) * 2019-12-26 2020-05-29 中科曙光国际信息产业有限公司 Deep learning framework design method

Similar Documents

Publication Publication Date Title
CN107766126B (en) Container mirror image construction method, system and device and storage medium
US11403094B2 (en) Software pipeline configuration
US10489591B2 (en) Detection system and method thereof
US10416979B2 (en) Package installation on a host file system using a container
CN108319858B (en) Data dependency graph construction method and device for unsafe function
US20140372998A1 (en) App package deployment
US10185648B2 (en) Preservation of modifications after overlay removal from a container
CN106371881B (en) Method and system for updating program version in server
US9542173B2 (en) Dependency handling for software extensions
US11144292B2 (en) Packaging support system and packaging support method
US20170371641A1 (en) Multi-tenant upgrading
CN110442371A (en) A kind of method, apparatus of release code, medium and computer equipment
CN112214221B (en) Method and equipment for constructing Linux system
CN112702195A (en) Gateway configuration method, electronic device and computer readable storage medium
CN114281653B (en) Application program monitoring method and device and computing equipment
CN111651352A (en) Warehouse code merging method and device
CN113743895A (en) Component management method and device, computer equipment and storage medium
WO2019041891A1 (en) Method and device for generating upgrade package
CN112241368A (en) Kubernetes-based automatic model training method and device
CN115145634A (en) System management software self-adaption method, device and medium
US11562105B2 (en) System and method for module engineering with sequence libraries
CN115964061A (en) Plug-in updating method and device, electronic equipment and computer readable storage medium
CN114358302A (en) Artificial intelligence AI training method, system and equipment
CN112181606A (en) Container configuration updating method, device and system, storage medium and electronic equipment
CN114253595A (en) Code warehouse management method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination