WO2020186899A1 - Method and apparatus for extracting metadata in machine learning training process - Google Patents

Method and apparatus for extracting metadata in machine learning training process Download PDF

Info

Publication number
WO2020186899A1
WO2020186899A1 PCT/CN2020/070577 CN2020070577W WO2020186899A1 WO 2020186899 A1 WO2020186899 A1 WO 2020186899A1 CN 2020070577 W CN2020070577 W CN 2020070577W WO 2020186899 A1 WO2020186899 A1 WO 2020186899A1
Authority
WO
WIPO (PCT)
Prior art keywords
metadata
machine learning
training
type
container
Prior art date
Application number
PCT/CN2020/070577
Other languages
French (fr)
Chinese (zh)
Inventor
刘烨东
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2020186899A1 publication Critical patent/WO2020186899A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5077Logical partitioning of resources; Management or configuration of virtualized resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45562Creating, deleting, cloning virtual machine instances
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45583Memory management, e.g. access or allocation

Definitions

  • This application relates to the field of cloud computing, and more specifically, to a method, device, and computer-readable storage medium for extracting metadata in a machine learning training process.
  • Machine learning is a multi-field interdisciplinary subject that specializes in computer simulation or realization of human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its performance .
  • Machine learning is the core of artificial intelligence intelligence, the fundamental way to make computers intelligent, and its applications cover all areas of artificial intelligence.
  • the workflow of a machine learning task can include environment construction, model training process, and model inference process.
  • the trained model will be provided to other developers. If other developers want to reproduce the training process, they need to fully reproduce the source development environment. However, other developers need to spend a lot of time to build and debug a training environment compatible with the target machine learning task in the process of reproducing the source development environment, which brings great inconvenience to the dissemination of the model.
  • This application provides a method and device for extracting metadata in the process of machine learning training. Developers can automatically extract some relevant metadata needed to reproduce a specific training environment during the training process of the target machine learning task. , When other developers want to reproduce a specific training environment, they can reproduce the specific training environment according to the stored related metadata, which speeds up the spread of the model.
  • a method for extracting metadata in a machine learning training process is provided.
  • the method is applied to a virtualized environment, and the method includes: running in the virtualized environment according to a machine learning program code input by a user Machine learning task; extracting metadata from the machine learning program code, the metadata being used to reproduce the operating environment of the machine learning task; storing the metadata in a first storage space.
  • the metadata is extracted from the machine learning program code according to the type of the metadata by way of keyword search.
  • the virtualized environment runs the machine learning task through at least one training container
  • the metadata includes the first type of metadata.
  • the first type of metadata may be extracted from the input training container startup script according to the type of the first type of metadata, and the training container startup script is used to start the at least one training container.
  • the type of the first type of metadata includes any one or more of the following: a framework used by the machine learning task, a model used by the machine learning task, and the machine learning task The dataset used in the training process.
  • the virtualized environment runs the machine learning task through at least one training container, and the metadata includes the second type of metadata.
  • the metadata may be extracted from the input training program code according to the type of the second type of metadata, the training program code is stored in a second storage space mounted on the at least one training container, and the training The program code is used to run the model training process of the machine learning task in the at least one training container.
  • the type of the second type of metadata includes any one or more of the following: the processing method of the data set used in the training process of the machine learning task, the processing method of the machine learning task The structure of the model used in the training process, and the training parameters used in the training process of the machine learning task.
  • a device for extracting metadata in a machine learning training process runs in a virtualized environment, and the device includes:
  • the running module is used to run the machine learning task in the virtualized environment according to the machine learning program code input by the user;
  • a metadata extraction module configured to extract metadata from the machine learning program code, and the metadata is used to replicate the operating environment of the machine learning task
  • the metadata extraction module is further configured to store the metadata in the first storage space.
  • the metadata extraction module is specifically configured to extract the metadata from the machine learning program code according to the type of the metadata by way of keyword search.
  • the virtualized environment runs the machine learning task through at least one training container, and the metadata includes the first type of metadata;
  • the metadata extraction module is specifically configured to: extract the first type metadata from the input training container startup script according to the type of the first type metadata, and the training container startup script is used to start the at least A training container.
  • the type of the first type of metadata includes any one or more of the following: a framework used by the machine learning task, a model used by the machine learning task, and the machine learning task The dataset used in the training process.
  • the virtualized environment runs the machine learning task through at least one training container, and the metadata includes the second type of metadata;
  • the metadata extraction module is specifically configured to: extract the metadata from the input training program code according to the type of the second type of metadata, and the training program code is stored in the at least one training container mounted In the second storage space, the training program code is used to run the model training process of the machine learning task in the at least one training container.
  • the type of the second type of metadata includes any one or more of the following: the processing method of the data set used in the training process of the machine learning task, the processing method of the machine learning task The structure of the model used in the training process, and the training parameters used in the training process of the machine learning task.
  • a system for extracting metadata in a machine learning training process includes at least one server, each server includes a memory and at least one processor, the memory is used for program instructions, and the at least one server At runtime, at least one processor executes the program instructions in the memory to execute the method in the first aspect or any one of the possible implementation manners of the first aspect, or is used to implement the second aspect or any one of the second aspect Run module and metadata extraction module in one possible implementation.
  • the running module may run on the multiple servers, and the metadata extraction module may run on each of the multiple servers.
  • the metadata extraction module may run on a part of multiple servers.
  • the metadata extraction module may run on any server other than the above-mentioned multiple servers.
  • the processor may be a general-purpose processor, which may be implemented by hardware or software.
  • the processor may be a logic circuit, integrated circuit, etc.; when implemented by software, the processor may be a general-purpose processor, which is implemented by reading software codes stored in the memory, and the memory may Integrated in the processor, can be located outside of the processor, and exist independently.
  • a non-transitory readable storage medium including program instructions.
  • the program instructions When the program instructions are executed by a computer, the computer executes any possible implementation as in the first aspect and the first aspect. The method in the way.
  • a computer program product including program instructions.
  • the program instructions When the program instructions are executed by a computer, the computer executes the method in any one of the first aspect and the first aspect.
  • Fig. 1 is a schematic block diagram of an apparatus 100 for running a machine learning task provided by an embodiment of the present application.
  • Fig. 2 is a schematic flowchart of a machine learning task execution provided by an embodiment of the present application.
  • FIG. 3 is a schematic block diagram of a container environment 300 provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a method for extracting metadata by a metadata extraction module according to an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a system 500 for extracting metadata in a machine learning training process provided by an embodiment of the present application.
  • Machine learning is a multi-field interdisciplinary subject that specializes in computer simulation or realization of human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its performance .
  • Machine learning is the core of artificial intelligence intelligence, the fundamental way to make computers intelligent, and its applications cover all areas of artificial intelligence.
  • the workflow of a machine learning task can include environment construction, model training process, and model inference process.
  • Fig. 1 is a schematic block diagram of an apparatus 100 for running a machine learning task provided by an embodiment of the present application.
  • the device 100 may include an operating module 110, a metadata extraction module 120, and a second storage space mounted by the metadata extraction module 120.
  • the above modules are described in detail below.
  • the running module 110 may include multiple sub-modules, such as an environment construction sub-module 111, a training sub-module 112, an inference sub-module 113, and an environment destruction sub-module 114.
  • running module 110 can be implemented by using a container, or for another example, can also be implemented by using a virtual machine. Make specific restrictions.
  • the environment building submodule 111 is used to build a training environment for machine learning tasks.
  • the construction of the machine learning task environment is actually the scheduling of computer hardware resources.
  • the hardware resources may include, but are not limited to: computing resources and storage resources.
  • the container technology represented by docker gradually matures, and it uses mirroring to create a virtualized operating environment.
  • Related components can be deployed in the container.
  • the container technology provides computing resources and storage resources, so that the computing resources and storage resources on the physical machine can be directly called to provide hardware resources for machine learning tasks.
  • the open source container scheduling platform represented by kubernetes can effectively manage containers.
  • docker is an open source application container engine that allows source developers to package their applications and dependent packages into a portable container, and then publish it to any popular Linux machine, which can also be virtualized.
  • any popular Linux machine which can also be virtualized.
  • the following uses a container as an example to describe in detail the technical solutions provided by the embodiments of the present application.
  • the virtualization environment where the running module 110 and the metadata extraction module 120 are located is a virtual machine
  • the running module 110, the metadata extraction module 120 and their submodules may be implemented by a virtual machine.
  • the embodiments of this application do not specifically limit computing resources. It may be a central processing unit (central processing unit, CPU) or a graphics processing unit (graphics processing unit, GPU).
  • CPU central processing unit
  • GPU graphics processing unit
  • the source developer can pull the container image of the packaged related component, such as the container image of the training component, into the container environment by packaging the image. And through the command line or container startup script input by the source developer, the training container is created and started, and the training process of the model is performed in the training container.
  • the container image of the packaged related component such as the container image of the training component
  • the training sub-module 112 can run in the container environment built above, and perform the model training process according to the training program code input by the source developer.
  • source developers can use network file system (NFS) shared storage or other storage products on the cloud platform, for example, distributed file system (DFS) to store the training program code in
  • NFS network file system
  • DFS distributed file system
  • the training sub-module 112 may train the model according to the stored training program code obtained in the first storage space 115.
  • the training sub-module 112 may also store the trained model in the first storage space 115 during the training process.
  • the inference sub-module 113 can access the first storage space 115, and can perform an inference process based on the trained model stored in the first storage space 115. Specifically, the inference sub-module 113 may determine the predicted output value according to the input training data and the trained model. And it can be determined whether the model trained by the training sub-module 112 is correct according to the error between the predicted output value and the prior knowledge of the training data.
  • prior knowledge is also called ground truth, and generally includes prediction results corresponding to training data provided by people.
  • the training data input to the model trained by the training sub-module 112 is the pixel information of the image, and the prior knowledge corresponding to the training data is that the label of the image is "dog".
  • the environment destruction sub-module 114 can destroy the created container environment.
  • the first storage space 115 will not be destroyed, and the trained model is stored in the first storage space 115, so that the inference sub-module 113 can perform the inference process according to the stored trained model.
  • Metadata extraction module 120
  • the metadata extraction module 120 can automatically extract metadata from the machine learning program code input by the source developer during the machine learning task performed by the running module 110, and the metadata can be used to analyze the operating environment of the machine learning task. Reproduce.
  • the metadata extraction module 120 may also generate a description file from the extracted metadata, and store the generated description file in the second storage space 121. So that when other source developers want to reproduce the running environment of the above-mentioned machine learning tasks, they obtain the stored description file from the second storage space 121, and directly configure and debug the development environment according to some related metadata included in the description file , So as to reproduce the target training environment and accelerate the spread of the model.
  • the source developer in the prior art usually provides one or more of the following three description standards of related metadata, for example, the deep learning selected by the source developer The framework (framework), the model used by the source developer, and the dataset used by the source developer.
  • the framework framework
  • the model used by the source developer and the dataset used by the source developer.
  • Attributes Types of description Name string The name of the deep learning framework chosen by the source developer Version string The version of the deep learning framework selected by the source developer
  • the framework of deep learning can include, but is not limited to: tensorflow (tensorflow), convolutional neural network framework (convolutional neural network framework, CNNF), convolutional architecture for fast feature embedding fast feature embedding, CAFFE).
  • tensorflow tensorflow
  • CNNF convolutional neural network framework
  • CAFFE convolutional architecture for fast feature embedding fast feature embedding
  • CNN convolutional neural network
  • RNN recursive neural network
  • Attributes Types of description Name string The name of the model used by the source developer Version string The model version used by the source developer Source string The source of the model used by the developer File name (file) object The model file name used by the source developer Author (creator) string Author of the model used by the source developer Time ISO-8601 Creation time of the model used by the source developer
  • the models used by the source developers can include, but are not limited to: image recognition models, text recognition models, and so on.
  • the model used by the source developer can be a public model or a private model. If the source developer uses a public model, the public model provides a uniform resource location (URL) link.
  • URL uniform resource location
  • the file name of the model used by the source developer is not directly stored in the metadata description file, and the file name of the model can be packaged with the metadata description file in a file name description form. If the model used by the source developer is a public model, the metadata description file is a URL link.
  • Attributes Types of description Name string The name of the dataset used by the source developer Version string The version of the data set used by the source developer Source string The source of the data set used by the source developer
  • the URL link of the data set used by the source developer or the compressed file of the data set itself can be packaged with the metadata description file.
  • metadata such as the aforementioned framework, model, and data set is usually determined by the source developer by packaging the container image of the training component in the environment building submodule 111. Take the container image that the source developer writes and launches another markup language (yet another markup language, YAML) file packaging training component as an example.
  • YAML another markup language
  • the input program code includes the source development
  • the key metadata such as the framework, model, and dataset selected and used by the user.
  • the embodiment of this application also provides one or more of the following three description standards of related metadata, for example, the data-process used by the source developer, and the structure of the model used by the source developer (model-architecture), training parameters (training-parameters) used by the source developer during the training process.
  • the following describes the above-mentioned metadata in detail in conjunction with Table 4 to Table 6.
  • the data set segmentation method defined by the source developer can be a process of processing the input data set, for example, a part of the input data set is used in the training process of the model, that is, the part of the data set It can be used as training data during model training. A part of the input data set is used for the reasoning process of the model, that is, this part of the data set can be used as the test data in the model reasoning process.
  • one or more of the metadata such as the above-mentioned data set processing method, model structure, training parameters, etc. are usually hidden in the training program code stored in the first storage space 150 by the source developer.
  • the metadata shown in Table 1 to Table 6 is automatically obtained in the process of the above-mentioned running module 110 performing the machine learning task. And directly configure and debug the development environment according to the metadata, so as to reproduce the training environment of the target machine learning task.
  • the metadata extraction module 120 can obtain the metadata stored on the physical host computer as shown in Table 1 to Table 3 by sending a query command to the physical host computer.
  • the metadata extraction module 120 can extract the above-mentioned several types of metadata in a keyword search manner.
  • the complete flowchart of the machine learning task provided by the embodiment of the present application will be described in detail below with reference to FIGS.
  • the complete flow chart of the machine learning task may include the environment building process, the training process, and the inference process.
  • the above three processes will be described in detail below.
  • Step 210 The source developer packages the image of the training component and the image of the metadata extraction module.
  • the source developer When packaging the training component image, the source developer will determine the metadata such as the framework, model, and data set shown in Table 1 to Table 3.
  • the training component can be jupyter notebook.
  • the jupyter notebook is an interactive web application. The source developer can use the jupyter notebook to input and adjust the model training program code online.
  • Step 215 The source developer starts the container image.
  • the source developer can store the training component image packaged in step 210 and the image of the metadata extraction module 120 in the container warehouse.
  • the container warehouse can manage, store and protect container images.
  • the container warehouse may be a container registry.
  • Source developers can enter container startup scripts or command lines to pull different versions of container images from the container warehouse to the container environment, and start the corresponding components in the container.
  • the training component is run in the training container
  • the metadata extraction module 120 is run in the extraction container.
  • container images may correspond to different metadata such as frameworks, models, and data sets.
  • the container startup script or command line may include information such as the name and version of the pulled container image, and the time when the container image was started.
  • the container group 310 providing training functions may include a training container and an extraction container.
  • the training container is attached to the first storage space 115 and the extraction container is attached to the second storage space 121.
  • the container group may be called a pod.
  • Pod is the smallest scheduling unit in kubernetes, and a pod can include multiple containers. For a pod, it can run on a physical host. When scheduling is required, kubernetes will schedule the pod as a whole.
  • the storage space mounted by the container can be a persistent volume (PV) in kubernetes.
  • PV is a network storage area allocated by a network administrator. PV has a life cycle independent of any single pod, that is, after the life cycle of the pod ends, the container in the pod will be destroyed, but the PV mounted by the container in the pod will not be destroyed.
  • Step 220 The source developer inputs the training program code.
  • the source developer can input the training program code according to the metadata description standard shown in Table 1 to Table 6 through the training component (such as jupyter notebook) running in the training container.
  • the training program code includes metadata such as the data set processing mode, model structure, and training parameters shown in Table 4 to Table 6.
  • the input training program code may be stored in the first storage space 115 mounted on the training container.
  • the trained model will also be stored in the first storage space mounted on the training container.
  • Step 225 The metadata extraction module 120 extracts metadata and stores it in the second storage space 121 mounted on the extraction container.
  • the metadata extraction module 120 running in the extraction container extracts the above-mentioned metadata according to the key extraction method and the metadata description standards shown in Table 1 to Table 6.
  • the metadata extraction module 120 extracts metadata such as frameworks, models, and data sets shown in Tables 1 to 3 from the container startup script and command line input by the source developer.
  • the metadata extraction module 120 extracts the data set processing method, model structure, and training shown in Table 4 to Table 6 from the training program code stored in the first storage space 115 by the source developer. Metadata such as parameters. For details, refer to the description in FIG. 4, which will not be repeated here.
  • the pod providing the training function will be destroyed, but the mounted first storage space 115 and the second storage space 121 will not be destroyed.
  • Step 230 Start the inference component image and the container image of the metadata extraction module 120.
  • step 215. The process of creating and starting the inference container image and the container image of the metadata extraction module 120 corresponds to step 215. For details, please refer to the description in step 215, which will not be repeated here.
  • Step 235 The inference container performs inference services according to the trained model.
  • the container group 320 that provides inference functions may include inference containers and extraction containers.
  • the first storage space 115 mounted by the training container can be remounted to the inference container, and the second storage space 121 mounted by the extraction container in the training container group can be remounted to the extraction container group providing the inference function. container.
  • the inference container can perform inferences based on the trained model stored in the mounted first storage space 115.
  • the extraction container in the container group that provides the inference function can also obtain the metadata that may be generated during the inference process, and combine the metadata Stored in the mounted second storage space 121.
  • FIG. 4 is a schematic flowchart of a method for extracting metadata by the metadata extraction module 120 according to an embodiment of the present application.
  • the method shown in FIG. 4 may include steps 410-420, and steps 410-420 will be described in detail below.
  • the metadata extraction module 120 shown in FIG. 1 can be divided into two parts, which are a first metadata extraction module and a second metadata extraction module.
  • the first metadata extraction module may be used to extract metadata such as the framework, model, and data set shown in Tables 1 to 3, which are determined by the source developer when packaging the container image of the training component from the physical host.
  • the second metadata extraction module can be used to extract the data set processing mode, model structure, and training parameters shown in Table 4 to Table 6 from the training program code stored in the storage space mounted to the training container by the source developer And other metadata.
  • the resource scheduling platform is kubernetes
  • the first metadata extraction module may be a job extractor
  • the job extractor is a kubectl command line.
  • the second metadata extraction module is a code extractor.
  • the resource scheduling platform is kubernetes as an example for description.
  • Step 410 The first metadata extraction module sends a query command to the physical host side to extract metadata such as the framework, model, and data set shown in Table 1 to Table 3.
  • Metadata such as the framework, model, and data set shown in Table 1 to Table 3 has been determined by the source developer by packaging the container image of the training component, and the container image is stored in the container warehouse.
  • the source developer will enter the container startup script or command line to pull different versions of the container image from the container warehouse.
  • Different versions of the container image correspond to different metadata such as frameworks, models, and datasets.
  • the gateway (for example, egress) can be configured to enable the first metadata extraction module (for example, job extractor) to access the Internet protocol (IP) address of the physical host through the egress, and Obtain metadata such as frameworks, models, and datasets by sending query command lines.
  • IP Internet protocol
  • the resource scheduling platform When the resource scheduling platform is kubernetes, you can send the "kubectl get" command line to dynamically extract the relevant metadata information from the container startup script and command line on the physical host side by keyword extraction, for example, The name and version of the container image, the start time of the container image, the framework, the model, the data set and other metadata. And it is stored in the mounted second storage space 121 in the format of java script object notation (JSON) or other file formats.
  • JSON java script object notation
  • Step 420 The second metadata extraction module extracts metadata such as the data set processing mode, model structure, training parameters, etc. shown in Table 4 to Table 6 from the first storage space 115 mounted on the training container.
  • Metadata such as the data set processing method, model structure, and training parameters shown in Table 4 to Table 6 has been stored by the source developer in the first storage space 115 mounted on the training container. Therefore, the second metadata extraction module (For example, code extractor) You can extract the entry and exit table from the training program code stored in the first storage space 115 mounted on the training container in accordance with the description standard of the metadata shown in Table 4 to Table 6 by way of keyword search 4- Metadata such as the data set processing method, model structure, training parameters shown in Table 6. And it is stored in the mounted second storage space 121 in the form of JSON or other file formats.
  • the extracted metadata can be integrated, and the integrated metadata can be used as the second storage space in the form of "metadata description file + model + data set" 121 in.
  • the source developer can use the above-mentioned work flow of the machine learning task to automatically obtain and store the information used by the source developer in the machine learning task in the process of environment building or model training through the metadata extraction module. Metadata shown in Table 1-Table 6. After the end of the machine learning task, if the source developer or other developers need to reproduce the source development environment, they can use the saved metadata to realize the workflow construction of the entire life cycle of the machine learning task, thereby reproducing the source development environment.
  • FIG. 5 is a schematic structural diagram of a system 500 for extracting metadata in a machine learning training process provided by an embodiment of the present application.
  • the system 500 may include at least one server.
  • FIG. 5 takes the server 510 and the server 520 as examples for description.
  • the structure of the server 510 and the server 520 are similar.
  • the running module 110 shown in FIG. 1 may be run on at least one server, for example, the running module 110 is run on the server 510 and the server 520 respectively.
  • the metadata extraction module 120 may run on each of at least one server, for example, the metadata extraction module 120 runs on the server 510 and the server 520 respectively. As another example, the metadata extraction module 120 may also run on a part of at least one server. For example, the metadata extraction module 120 runs on the server 510 or runs on the server 520. As another example, the metadata extraction module 120 may also run on other servers besides the aforementioned at least one server. For example, the metadata extraction module 120 runs on the server 530.
  • the system 500 may execute the above-mentioned method for extracting metadata in the machine learning training process.
  • at least one server in the system 500 may include at least one processor and a memory.
  • the memory is used to store program instructions.
  • the processor included in at least one server can execute the program instructions stored in the memory to implement the above-mentioned method for extracting metadata in the machine learning training process, or to implement the operating module shown in Figure 1 in Figure 1 110.
  • Metadata extraction module 120 is used to implement the above-mentioned method for extracting metadata in the machine learning training process.
  • the server 510 may include: at least one processor (for example, the processor 511 and the processor 516), a memory 512, a communication interface 513, and an input/output interface 514.
  • At least one processor may be connected to the memory 512.
  • the memory 512 can be used to store program instructions.
  • the memory 512 may be a storage unit inside at least one processor, or an external storage unit independent of at least one processor, and may also include a storage unit internal to at least one processor and an external storage unit independent of at least one processor. Parts of the storage unit.
  • the memory 512 can be a solid state drive (SSD), a hard disk drive (HDD), a read-only memory (ROM), a random access memory (random access memory) , RAM) etc.
  • SSD solid state drive
  • HDD hard disk drive
  • ROM read-only memory
  • RAM random access memory
  • the server 510 may further include a bus 515.
  • the memory 512, the input/output interface 514, and the communication interface 513 may be connected to at least one processor through a bus 515.
  • the bus 515 may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • the bus 515 can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one line is used to represent in FIG. 5, but it does not mean that there is only one bus or one type of bus.
  • the system 500 may further include a cloud storage 540.
  • the cloud storage 540 can be used as an external storage and connected to the system 500.
  • the above-mentioned program instructions may be stored in the memory 512 or the cloud storage 540.
  • At least one processor may be a central processing unit (central processing unit, CPU), or other general-purpose processors, digital signal processors (digital signal processors, DSP), and application specific integrated circuits (application specific integrated circuits). integrated circuit, ASIC), ready-made programmable gate array (field programmable gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. Alternatively, one or more integrated circuits are used to execute related programs to implement the technical solutions provided in the embodiments of the present application.
  • the processor 511 is taken as an example, and the processor 511 runs an operation module 110.
  • the running module 110 may include multiple sub-modules, for example, the environment construction sub-module 111, the training sub-module 112, the inference sub-module 113, and the environment destruction sub-module 114 shown in FIG.
  • the first storage space 115 of the memory 512 stores the training program code input by the active developer.
  • the training program code includes metadata such as the data set processing method, model structure, and training parameters described in Table 4 to Table 6.
  • the metadata extracted by the metadata extraction module 120 is stored in the second storage space 121.
  • the third storage space 5121 stores the training container startup script input by the active developer.
  • the training container startup script includes one or more metadata such as the framework, model, and data set shown in Table 1 to Table 3. Kind.
  • the processor 511 obtains the stored program instructions from the memory 512 to run the above-mentioned machine learning tasks.
  • the environment building sub-module 111 in the running module 110 obtains the container startup script from the third storage space 5121 of the memory 512, and executes the above-mentioned container environment building process.
  • the training sub-module 112 in the running module 110 obtains the training program code from the first storage space 115 of the memory 512 to execute the training process of the above model, and can store the training result of the model in the first storage space 115.
  • the metadata extraction module 120 can extract the data set processing method and model structure described in Table 4 to Table 6 from the training program code stored in the first storage space 115 of the memory 512 One or more of metadata such as training parameters.
  • the metadata extraction module 120 may also extract one or more of metadata such as frameworks, models, and data sets as shown in Tables 1 to 3 from the container startup script stored in the third storage space 5121.
  • the metadata extraction module 120 may also generate a description file from the extracted metadata, and store the generated description file in the second storage space 121 of the memory 512.
  • the metadata extraction module 120 may also generate a description file from the extracted metadata, and store the generated description file in the second storage space 121 of the memory 512.
  • the size of the sequence number of the above-mentioned processes does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, rather than corresponding to the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of this application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Stored Programmes (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for extracting metadata in a machine learning training process. The method is applied to a virtualized environment. The method comprises: running a machine learning task in a virtualized environment according to a machine learning program code input by a user; extracting metadata from the machine learning program code, the metadata being used for reproducing the running environment of the machine learning task; and storing the metadata in the first storage space. In the technical solution involved in the present method, relevant metadata required for reproducing a specific training environment during the training process of a target machine learning task is automatically extracted; and when other developers want to reproduce a specific training environment, the specific training environment can be reproduced according to the stored metadata, so that the propagation of models are accelerated.

Description

一种提取机器学习训练过程中的元数据的方法、装置Method and device for extracting metadata in machine learning training process 技术领域Technical field
本申请涉及云计算领域,并且更具体地,涉及一种提取机器学习训练过程中的元数据的方法、装置及计算机可读存储介质。This application relates to the field of cloud computing, and more specifically, to a method, device, and computer-readable storage medium for extracting metadata in a machine learning training process.
背景技术Background technique
机器学习(machine learning,ML)是一门多领域的交叉学科,专门研究计算机模拟或实现人类的学习行为,以获取新的知识或者技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。Machine learning (ML) is a multi-field interdisciplinary subject that specializes in computer simulation or realization of human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its performance . Machine learning is the core of artificial intelligence intelligence, the fundamental way to make computers intelligent, and its applications cover all areas of artificial intelligence.
机器学习任务的工作流程可以包括环境搭建、模型训练过程以及模型推理过程。在源开发者通过上述过程训练出一个模型之后,会向其他开发者提供其训练出的模型。其他开发者想要重现该训练过程就需要完全复现源开发环境。但是,其他开发者在复现源开发环境的过程中需要花费大量的时间来搭建和调试与目标机器学习任务所兼容的训练环境,给模型的传播带来了极大的不便。The workflow of a machine learning task can include environment construction, model training process, and model inference process. After the source developer trains a model through the above process, the trained model will be provided to other developers. If other developers want to reproduce the training process, they need to fully reproduce the source development environment. However, other developers need to spend a lot of time to build and debug a training environment compatible with the target machine learning task in the process of reproducing the source development environment, which brings great inconvenience to the dissemination of the model.
发明内容Summary of the invention
本申请提供一种提取机器学习训练过程中的元数据的方法、装置,开发者可以在目标机器学习任务的训练过程中,自动提取复现一个特定的训练环境时所需要的一些相关的元数据,在其他开发者想要复现一个特定的训练环境时,可以根据存储的相关的元数据对特定的训练环境进行复现,加快了模型的传播。This application provides a method and device for extracting metadata in the process of machine learning training. Developers can automatically extract some relevant metadata needed to reproduce a specific training environment during the training process of the target machine learning task. , When other developers want to reproduce a specific training environment, they can reproduce the specific training environment according to the stored related metadata, which speeds up the spread of the model.
第一方面,提供了一种提取机器学习训练过程中的元数据的方法,所述方法应用于虚拟化环境,所述方法包括:根据用户输入的机器学习程序代码在所述虚拟化环境中运行机器学习任务;从所述机器学习程序代码中提取元数据,所述元数据用于对所述机器学习任务的运行环境进行复现;将所述元数据存储在第一存储空间。In a first aspect, a method for extracting metadata in a machine learning training process is provided. The method is applied to a virtualized environment, and the method includes: running in the virtualized environment according to a machine learning program code input by a user Machine learning task; extracting metadata from the machine learning program code, the metadata being used to reproduce the operating environment of the machine learning task; storing the metadata in a first storage space.
在一种可能的实现方式中,通过关键字搜索的方式,按照所述元数据的类型从所述机器学习程序代码中提取出所述元数据。In a possible implementation manner, the metadata is extracted from the machine learning program code according to the type of the metadata by way of keyword search.
在另一种可能的实现方式中,所述虚拟化环境通过至少一个训练容器运行所述机器学习任务,所述元数据包括第一类元数据。可以按照所述第一类元数据的类型从输入的训练容器启动脚本中提取出所述第一类元数据,所述训练容器启动脚本用于启动所述至少一个训练容器。In another possible implementation manner, the virtualized environment runs the machine learning task through at least one training container, and the metadata includes the first type of metadata. The first type of metadata may be extracted from the input training container startup script according to the type of the first type of metadata, and the training container startup script is used to start the at least one training container.
在另一种可能的实现方式中,所述第一类元数据的类型包括以下任何一个或多个:所述机器学习任务使用的框架、所述机器学习任务使用的模型、所述机器学习任务的训练过程中使用的数据集。In another possible implementation manner, the type of the first type of metadata includes any one or more of the following: a framework used by the machine learning task, a model used by the machine learning task, and the machine learning task The dataset used in the training process.
在另一种可能的实现方式中,所述虚拟化环境通过至少一个训练容器运行所述机器学习任务,所述元数据包括第二类元数据。可以按照所述第二类元数据的类型从输入的训练程序代码中提取出所述元数据,所述训练程序代码存储在所述至少一个训练 容器挂载的第二存储空间中,所述训练程序代码用于在所述至少一个训练容器中运行所述机器学习任务的模型训练过程。In another possible implementation manner, the virtualized environment runs the machine learning task through at least one training container, and the metadata includes the second type of metadata. The metadata may be extracted from the input training program code according to the type of the second type of metadata, the training program code is stored in a second storage space mounted on the at least one training container, and the training The program code is used to run the model training process of the machine learning task in the at least one training container.
在另一种可能的实现方式中,所述第二类元数据的类型包括以下任何一个或多个:所述机器学习任务的训练过程中使用的数据集的处理方式、所述机器学习任务的训练过程中使用的模型的结构、所述机器学习任务的训练过程中使用的训练参数。In another possible implementation manner, the type of the second type of metadata includes any one or more of the following: the processing method of the data set used in the training process of the machine learning task, the processing method of the machine learning task The structure of the model used in the training process, and the training parameters used in the training process of the machine learning task.
第二方面,提供了一种提取机器学习训练过程中的元数据的装置,所述装置运行于虚拟化环境,所述装置包括:In a second aspect, a device for extracting metadata in a machine learning training process is provided, the device runs in a virtualized environment, and the device includes:
运行模块,用于根据用户输入的机器学习程序代码在所述虚拟化环境中运行机器学习任务;The running module is used to run the machine learning task in the virtualized environment according to the machine learning program code input by the user;
元数据提取模块,用于从所述机器学习程序代码中提取元数据,所述元数据用于对所述机器学习任务的运行环境进行复现;A metadata extraction module, configured to extract metadata from the machine learning program code, and the metadata is used to replicate the operating environment of the machine learning task;
所述元数据提取模块,还用于将所述元数据存储在第一存储空间。The metadata extraction module is further configured to store the metadata in the first storage space.
在一种可能的实现方式中,所述元数据提取模块具体用于:通过关键字搜索的方式,按照所述元数据的类型从所述机器学习程序代码中提取出所述元数据。In a possible implementation manner, the metadata extraction module is specifically configured to extract the metadata from the machine learning program code according to the type of the metadata by way of keyword search.
在另一种可能的实现方式中,所述虚拟化环境通过至少一个训练容器运行所述机器学习任务,所述元数据包括第一类元数据;In another possible implementation manner, the virtualized environment runs the machine learning task through at least one training container, and the metadata includes the first type of metadata;
所述元数据提取模块具体用于:按照所述第一类元数据的类型从输入的训练容器启动脚本中提取出所述第一类元数据,所述训练容器启动脚本用于启动所述至少一个训练容器。The metadata extraction module is specifically configured to: extract the first type metadata from the input training container startup script according to the type of the first type metadata, and the training container startup script is used to start the at least A training container.
在另一种可能的实现方式中,所述第一类元数据的类型包括以下任何一个或多个:所述机器学习任务使用的框架、所述机器学习任务使用的模型、所述机器学习任务的训练过程中使用的数据集。In another possible implementation manner, the type of the first type of metadata includes any one or more of the following: a framework used by the machine learning task, a model used by the machine learning task, and the machine learning task The dataset used in the training process.
在另一种可能的实现方式中,所述虚拟化环境通过至少一个训练容器运行所述机器学习任务,所述元数据包括第二类元数据;In another possible implementation manner, the virtualized environment runs the machine learning task through at least one training container, and the metadata includes the second type of metadata;
所述元数据提取模块具体用于:按照所述第二类元数据的类型从输入的训练程序代码中提取出所述元数据,所述训练程序代码存储在所述至少一个训练容器挂载的第二存储空间中,所述训练程序代码用于在所述至少一个训练容器中运行所述机器学习任务的模型训练过程。The metadata extraction module is specifically configured to: extract the metadata from the input training program code according to the type of the second type of metadata, and the training program code is stored in the at least one training container mounted In the second storage space, the training program code is used to run the model training process of the machine learning task in the at least one training container.
在另一种可能的实现方式中,所述第二类元数据的类型包括以下任何一个或多个:所述机器学习任务的训练过程中使用的数据集的处理方式、所述机器学习任务的训练过程中使用的模型的结构、所述机器学习任务的训练过程中使用的训练参数。In another possible implementation manner, the type of the second type of metadata includes any one or more of the following: the processing method of the data set used in the training process of the machine learning task, the processing method of the machine learning task The structure of the model used in the training process, and the training parameters used in the training process of the machine learning task.
第三方面,提供了一种提取机器学习训练过程中的元数据的系统,所述系统包括至少一个服务器,每个服务器包括存储器和至少一个处理器,存储器用于程序指令,所述至少一个服务器运行时,至少一个处理器执行所述存储器中的程序指令以执行第一方面或第一方面中任一种可能的实现方式中的方法,或者用于实现第二方面或第二方面中任一种可能的实现方式中的运行模块、元数据提取模块。In a third aspect, a system for extracting metadata in a machine learning training process is provided. The system includes at least one server, each server includes a memory and at least one processor, the memory is used for program instructions, and the at least one server At runtime, at least one processor executes the program instructions in the memory to execute the method in the first aspect or any one of the possible implementation manners of the first aspect, or is used to implement the second aspect or any one of the second aspect Run module and metadata extraction module in one possible implementation.
在一种可能的实现方式中,运行模块可以运行在所述多个服务器上,元数据提取模块可以运行在多个服务器中的每一个上。In a possible implementation manner, the running module may run on the multiple servers, and the metadata extraction module may run on each of the multiple servers.
在另一种可能的实现方式中,元数据提取模块可以运行在多个服务器中的一部分 服务器上。In another possible implementation manner, the metadata extraction module may run on a part of multiple servers.
在另一种可能的实现方式中,元数据提取模块可以运行在除上述多个服务器之外的其它任意一个服务器上。In another possible implementation manner, the metadata extraction module may run on any server other than the above-mentioned multiple servers.
可选地,该处理器可以是通用处理器,可以通过硬件来实现也可以通过软件来实现。当通过硬件实现时,该处理器可以是逻辑电路、集成电路等;当通过软件来实现时,该处理器可以是一个通用处理器,通过读取存储器中存储的软件代码来实现,该存储器可以集成在处理器中,可以位于该处理器之外,独立存在。Optionally, the processor may be a general-purpose processor, which may be implemented by hardware or software. When implemented by hardware, the processor may be a logic circuit, integrated circuit, etc.; when implemented by software, the processor may be a general-purpose processor, which is implemented by reading software codes stored in the memory, and the memory may Integrated in the processor, can be located outside of the processor, and exist independently.
第四方面,提供了一种非瞬态的可读存储介质,包括程序指令,当所述程序指令被计算机运行时,所述计算机执行如第一方面及第一方面中任一种可能的实现方式中的方法。In a fourth aspect, a non-transitory readable storage medium is provided, including program instructions. When the program instructions are executed by a computer, the computer executes any possible implementation as in the first aspect and the first aspect. The method in the way.
第五方面,提供了一种计算机程序产品,包括程序指令,当所述程序指令被计算机运行时,所述计算机执行如第一方面及第一方面中任一种可能的实现方式中的方法。In a fifth aspect, a computer program product is provided, including program instructions. When the program instructions are executed by a computer, the computer executes the method in any one of the first aspect and the first aspect.
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。On the basis of the implementation manners provided by the above aspects, this application can be further combined to provide more implementation manners.
附图说明Description of the drawings
图1是本申请实施例提供的一种运行机器学习任务的装置100的示意性框图。Fig. 1 is a schematic block diagram of an apparatus 100 for running a machine learning task provided by an embodiment of the present application.
图2是本申请实施例提供的一种执行机器学习任务的示意性流程图。Fig. 2 is a schematic flowchart of a machine learning task execution provided by an embodiment of the present application.
图3是本申请实施例提供的一种容器环境300的示意性框图。FIG. 3 is a schematic block diagram of a container environment 300 provided by an embodiment of the present application.
图4是本申请实施例提供的一种元数据提取模块提取元数据的方法的示意性流程图。FIG. 4 is a schematic flowchart of a method for extracting metadata by a metadata extraction module according to an embodiment of the present application.
图5是本申请实施例提供的一种提取机器学习训练过程中的元数据的系统500的示意性结构图。FIG. 5 is a schematic structural diagram of a system 500 for extracting metadata in a machine learning training process provided by an embodiment of the present application.
具体实施方式detailed description
下面将结合附图,对本申请中的技术方案进行描述。The technical solution in this application will be described below in conjunction with the drawings.
机器学习(machine learning,ML)是一门多领域的交叉学科,专门研究计算机模拟或实现人类的学习行为,以获取新的知识或者技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习任务的工作流程可以包括环境搭建、模型训练过程以及模型推理过程。Machine learning (ML) is a multi-field interdisciplinary subject that specializes in computer simulation or realization of human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its performance . Machine learning is the core of artificial intelligence intelligence, the fundamental way to make computers intelligent, and its applications cover all areas of artificial intelligence. The workflow of a machine learning task can include environment construction, model training process, and model inference process.
图1是本申请实施例提供的一种运行机器学习任务的装置100的示意性框图。装置100中可以包括运行模块110、元数据提取模块120以及元数据提取模块120挂载的第二存储空间。下面分别对上述几种模块进行详细描述。Fig. 1 is a schematic block diagram of an apparatus 100 for running a machine learning task provided by an embodiment of the present application. The device 100 may include an operating module 110, a metadata extraction module 120, and a second storage space mounted by the metadata extraction module 120. The above modules are described in detail below.
运行模块110中可以包括多个子模块,例如:环境搭建子模块111、训练子模块112、推理子模块113、环境销毁子模块114。The running module 110 may include multiple sub-modules, such as an environment construction sub-module 111, a training sub-module 112, an inference sub-module 113, and an environment destruction sub-module 114.
应理解,运行模块110、元数据提取模块120及其子模块可以运行在虚拟化环境中,例如,可以采用容器来实现,又如,还可以采用虚拟机来实现,本申请实施例对此不做具体限定。It should be understood that the running module 110, the metadata extraction module 120, and their sub-modules can be run in a virtualized environment, for example, can be implemented by using a container, or for another example, can also be implemented by using a virtual machine. Make specific restrictions.
(1)环境搭建子模块111:(1) Environment building sub-module 111:
环境搭建子模块111用于对机器学习任务的训练环境进行的搭建。机器学习任务环境的搭建其实是对计算机硬件资源的调度,该硬件资源可以包括但不限于:计算资源、存储资源。The environment building submodule 111 is used to build a training environment for machine learning tasks. The construction of the machine learning task environment is actually the scheduling of computer hardware resources. The hardware resources may include, but are not limited to: computing resources and storage resources.
随着机器学习任务越来越复杂,计算量越来越大,机器学习任务的云化和容器化成为一个发展趋势。以docker为代表的容器技术逐渐成熟,其使用镜像创建一个虚拟化的运行环境。可以在容器中部署相关的组件。通过该容器技术提供计算资源、存储资源,从而可以直接调用物理机上的计算资源、存储资源,为机器学习任务提供硬件资源。例如,以kubernetes为代表的开源容器调度平台,可以有效地对容器进行管理。With the increasing complexity of machine learning tasks and the increasing amount of calculations, cloudification and containerization of machine learning tasks has become a development trend. The container technology represented by docker gradually matures, and it uses mirroring to create a virtualized operating environment. Related components can be deployed in the container. The container technology provides computing resources and storage resources, so that the computing resources and storage resources on the physical machine can be directly called to provide hardware resources for machine learning tasks. For example, the open source container scheduling platform represented by kubernetes can effectively manage containers.
应理解,docker是一个开源的应用容器引擎,让源开发者可以打包他们的应用以及依赖包到一个可移植的容器中,然后发布到任何流行的Linux机器上,也可以实现虚拟化。为了便于描述,下面以容器为例,对本申请实施例提供的技术方案进行详细描述。运行模块110、元数据提取模块120所在的虚拟化环境为虚拟机的情况下,运行模块110、元数据提取模块120及其子模块可以通过虚拟机实现.It should be understood that docker is an open source application container engine that allows source developers to package their applications and dependent packages into a portable container, and then publish it to any popular Linux machine, which can also be virtualized. For ease of description, the following uses a container as an example to describe in detail the technical solutions provided by the embodiments of the present application. When the virtualization environment where the running module 110 and the metadata extraction module 120 are located is a virtual machine, the running module 110, the metadata extraction module 120 and their submodules may be implemented by a virtual machine.
本申请实施例对计算资源不做具体限定。可以是中央处理器(central processing unit,CPU),或者还可以是图形处理器(graphics processing unit,GPU)。The embodiments of this application do not specifically limit computing resources. It may be a central processing unit (central processing unit, CPU) or a graphics processing unit (graphics processing unit, GPU).
具体的,源开发者可以通过打包镜像的方式,将打包的相关组件的容器镜像,例如训练组件的容器镜像拉取到容器环境中。并通过源开发者输入的命令行或者容器启动脚本,创建和启动训练容器,并在该训练容器中进行模型的训练过程。Specifically, the source developer can pull the container image of the packaged related component, such as the container image of the training component, into the container environment by packaging the image. And through the command line or container startup script input by the source developer, the training container is created and started, and the training process of the model is performed in the training container.
(2)训练子模块112:(2) Training sub-module 112:
训练子模块112可以运行在上述搭建的容器环境中,并根据源开发者输入的训练程序代码进行模型的训练过程。The training sub-module 112 can run in the container environment built above, and perform the model training process according to the training program code input by the source developer.
具体的,源开发者可以采用网络文件系统(network file system,NFS)共享存储或者云平台上的其他存储产品,例如,分布式文件系统(distributed file system,DFS)的方式将训练程序代码存储在第一存储空间115中,该第一存储空间115可以挂载在启动的训练容器中。训练子模块112可以根据第一存储空间115中获取到存储的训练程序代码对模型进行训练。Specifically, source developers can use network file system (NFS) shared storage or other storage products on the cloud platform, for example, distributed file system (DFS) to store the training program code in In the first storage space 115, the first storage space 115 can be mounted in the booted training container. The training sub-module 112 may train the model according to the stored training program code obtained in the first storage space 115.
训练子模块112还可以在训练的过程中,将已训练的模型存储在第一存储空间115中。The training sub-module 112 may also store the trained model in the first storage space 115 during the training process.
(3)推理子模块113:(3) Inference sub-module 113:
推理子模块113可以访问第一存储空间115,并可以根据第一存储空间115中存储的已训练好的模型进行推理过程。具体的,推理子模块113可以将根据输入的训练数据以及已训练好的模型确定预测的输出值。并可以根据预测的输出值与训练数据的先验知识(prior knowledge)之间的误差,确定训练子模块112训练出的模型是否正确。The inference sub-module 113 can access the first storage space 115, and can perform an inference process based on the trained model stored in the first storage space 115. Specifically, the inference sub-module 113 may determine the predicted output value according to the input training data and the trained model. And it can be determined whether the model trained by the training sub-module 112 is correct according to the error between the predicted output value and the prior knowledge of the training data.
应理解,先验知识也被称为真实值(ground truth),一般包括由人提供的训练数据对应的预测结果。It should be understood that prior knowledge is also called ground truth, and generally includes prediction results corresponding to training data provided by people.
例如,机器学习任务应用于图像识别领域。上述训练子模块112训练出的模型输入的训练数据为图像的像素信息,训练数据所对应的先验知识为该图像的标签为“dog”。 将图像的标签为“dog”的训练数据输入到已训练好的模型中,判断该模型输出的预测值是否为“dog”。如果该模型的输出为“dog”,可以确定该模型可以预测准确。For example, machine learning tasks are applied to the field of image recognition. The training data input to the model trained by the training sub-module 112 is the pixel information of the image, and the prior knowledge corresponding to the training data is that the label of the image is "dog". Input the training data labeled "dog" of the image into the trained model, and judge whether the predicted value output by the model is "dog". If the output of the model is "dog", it can be determined that the model can predict accurately.
(4)环境销毁子模块114:(4) Environmental destruction sub-module 114:
在上述训练过程结束之后,环境销毁子模块114可以销毁创建的容器环境。但是,第一存储空间115不会被销毁,该第一存储空间115中存储有已训练好的模型,以便于推理子模块113根据存储的已训练好的模型进行推理过程。After the above training process ends, the environment destruction sub-module 114 can destroy the created container environment. However, the first storage space 115 will not be destroyed, and the trained model is stored in the first storage space 115, so that the inference sub-module 113 can perform the inference process according to the stored trained model.
(5)元数据提取模块120:(5) Metadata extraction module 120:
元数据提取模块120可以在上述运行模块110进行机器学习任务的过程中,从源开发者输入的机器学习程序代码中自动提取出元数据,该元数据可以用于对上述机器学习任务的运行环境进行复现。The metadata extraction module 120 can automatically extract metadata from the machine learning program code input by the source developer during the machine learning task performed by the running module 110, and the metadata can be used to analyze the operating environment of the machine learning task. Reproduce.
元数据提取模块120还可以将提取出的元数据生成描述文件,并将生成的描述文件存储在第二存储空间121中。以便于其他源开发者想要对上述机器学习任务的运行环境进行复现时,从第二存储空间121中获取存储的描述文件,并根据描述文件中包括的一些相关元数据直接配置和调试开发环境,从而复现出目标训练环境,加快模型的传播。The metadata extraction module 120 may also generate a description file from the extracted metadata, and store the generated description file in the second storage space 121. So that when other source developers want to reproduce the running environment of the above-mentioned machine learning tasks, they obtain the stored description file from the second storage space 121, and directly configure and debug the development environment according to some related metadata included in the description file , So as to reproduce the target training environment and accelerate the spread of the model.
为了复现出一个特定的机器学习任务的训练环境,现有技术中源开发者通常会提供以下三种相关元数据的描述标准中的一种或多种,例如,源开发者选择的深度学习的框架(framework)、源开发者使用的模型(model)、源开发者使用的数据集(dataset)。下面结合表1-表3对上述几种元数据进行详细描述。In order to reproduce the training environment of a specific machine learning task, the source developer in the prior art usually provides one or more of the following three description standards of related metadata, for example, the deep learning selected by the source developer The framework (framework), the model used by the source developer, and the dataset used by the source developer. The following describes the above-mentioned metadata in detail with reference to Table 1 to Table 3.
表1框架(framework)Table 1 framework (framework)
属性Attributes 类型Types of 描述description
名称(name)Name stringstring 源开发者选择的深度学习框架的名称The name of the deep learning framework chosen by the source developer
版本(version)Version stringstring 源开发者选择的深度学习框架的版本The version of the deep learning framework selected by the source developer
如表1所示,深度学习的框架可以包括但不限于:张量流(tensorflow)、卷积神经网络框架(convolutional neural network framework,CNNF)、用于快速特征嵌入的卷积结构(convolutional architecture for fast feature embedding,CAFFE)。As shown in Table 1, the framework of deep learning can include, but is not limited to: tensorflow (tensorflow), convolutional neural network framework (convolutional neural network framework, CNNF), convolutional architecture for fast feature embedding fast feature embedding, CAFFE).
应理解,tensorflow除了支持常见的网络结构,例如,卷积神经网络(convolutional neural network,CNN)、循环神经网络(recurent neural network,RNN)之外,还可以支持深度强化学习乃至其他计算密集的科学计算(如偏微分方程求解等)。It should be understood that in addition to supporting common network structures, such as convolutional neural network (CNN) and recursive neural network (RNN), tensorflow can also support deep reinforcement learning and other computationally intensive sciences. Calculation (such as solving partial differential equations, etc.).
表2模型(model)Table 2 model (model)
属性Attributes 类型Types of 描述description
名称(name)Name stringstring 源开发者使用的模型名称The name of the model used by the source developer
版本(version)Version stringstring 源开发者使用的模型版本The model version used by the source developer
来源(source)Source stringstring 源开发者使用的模型的来源The source of the model used by the developer
文件名(file)File name (file) objectobject 源开发者使用的模型文件名The model file name used by the source developer
作者(creator)Author (creator) stringstring 源开发者使用的模型的作者Author of the model used by the source developer
时间(time)Time ISO-8601ISO-8601 源开发者使用的模型的创建时间Creation time of the model used by the source developer
如表2所示,源开发者使用的模型可以包括但不限于:图像识别模型、文字识别 模型等。As shown in Table 2, the models used by the source developers can include, but are not limited to: image recognition models, text recognition models, and so on.
需要说明的是,源开发者使用的模型可以是公开的模型,也可以是私有的模型。如果源开发者使用的是公开的模型,该公开模型提供统一资源定位(uniform resource location,URL)链接。It should be noted that the model used by the source developer can be a public model or a private model. If the source developer uses a public model, the public model provides a uniform resource location (URL) link.
还需要说明的是。源开发者使用的模型的文件名并不直接存储元数据的描述文件中,该的模型的文件名可以以文件名的描述形式与元数据描述文件一起打包。如果源开发者使用的模型是公开的模型,元数据描述文件为URL链接。It also needs to be explained. The file name of the model used by the source developer is not directly stored in the metadata description file, and the file name of the model can be packaged with the metadata description file in a file name description form. If the model used by the source developer is a public model, the metadata description file is a URL link.
表3数据集(dataset)Table 3 Dataset (dataset)
属性Attributes 类型Types of 描述description
名称(name)Name stringstring 源开发者使用的数据集的名称The name of the dataset used by the source developer
版本(version)Version stringstring 源开发者使用的数据集的版本The version of the data set used by the source developer
来源(source)Source stringstring 源开发者使用的数据集的来源The source of the data set used by the source developer
如表3所示,源开发者使用的数据集的URL链接或者数据集本身的压缩文件可以与元数据描述文件一起打包。As shown in Table 3, the URL link of the data set used by the source developer or the compressed file of the data set itself can be packaged with the metadata description file.
参见表1-表3,上述框架、模型、数据集等元数据通常在环境搭建子模块111中,由源开发者通过打包训练组件的容器镜像的方式确定。以源开发者编写和启动另一种标记语言(yet another markup language,YAML)文件打包训练组件的容器镜像为例,源开发者在编写和启动YAML文件时,输入的程序代码中包括了源开发者选择和使用的深度学习的框架(framework)、模型(model)、数据集(dataset)等关键的元数据。Referring to Table 1 to Table 3, metadata such as the aforementioned framework, model, and data set is usually determined by the source developer by packaging the container image of the training component in the environment building submodule 111. Take the container image that the source developer writes and launches another markup language (yet another markup language, YAML) file packaging training component as an example. When the source developer writes and starts the YAML file, the input program code includes the source development The key metadata such as the framework, model, and dataset selected and used by the user.
但是,仅仅依靠表1-表3中的提供的元数据难以复现出一个机器学习任务的训练环境。本申请实施例还提供了以下三种相关元数据的描述标准中的一种或多种,例如,源开发者使用的数据集的处理方式(data-process),源开发者使用的模型的结构(model-architecture)、源开发者在训练过程中使用的训练参数(training-parameters)。下面结合表4-表6对上述几种元数据进行详细描述。However, it is difficult to reproduce a training environment for machine learning tasks only by relying on the metadata provided in Table 1 to Table 3. The embodiment of this application also provides one or more of the following three description standards of related metadata, for example, the data-process used by the source developer, and the structure of the model used by the source developer (model-architecture), training parameters (training-parameters) used by the source developer during the training process. The following describes the above-mentioned metadata in detail in conjunction with Table 4 to Table 6.
表4数据集的处理方式(data-process)Table 4 Data set processing method (data-process)
Figure PCTCN2020070577-appb-000001
Figure PCTCN2020070577-appb-000001
参见表4,源开发者定义的数据集的分割方式可以是对输入的数据集进行处理的过程,例如,将输入的数据集的一部分用于模型的训练过程,也就是说,该部分数据集可以作为模型训练过程中的训练数据。将输入的数据集的一部分用于模型的推理过程,也就是说,该部分数据集可以作为模型推理过程中的测试数据。See Table 4, the data set segmentation method defined by the source developer can be a process of processing the input data set, for example, a part of the input data set is used in the training process of the model, that is, the part of the data set It can be used as training data during model training. A part of the input data set is used for the reasoning process of the model, that is, this part of the data set can be used as the test data in the model reasoning process.
表5模型的结构(model-architecture)Table 5 Model structure (model-architecture)
Figure PCTCN2020070577-appb-000002
Figure PCTCN2020070577-appb-000002
Figure PCTCN2020070577-appb-000003
Figure PCTCN2020070577-appb-000003
表6训练参数(training-params)Table 6 Training parameters (training-params)
Figure PCTCN2020070577-appb-000004
Figure PCTCN2020070577-appb-000004
参见表4-表6,上述数据集处理方式、模型结构、训练参数等元数据中的一种或者多种通常隐藏在源开发者存储在第一存储空间150中的训练程序代码中。Referring to Table 4 to Table 6, one or more of the metadata such as the above-mentioned data set processing method, model structure, training parameters, etc. are usually hidden in the training program code stored in the first storage space 150 by the source developer.
本申请实施例通过在上述运行模块110进行机器学习任务的过程中,自动获取到表1-表6所示的元数据。并根据该元数据直接配置和调试开发环境,从而复现出目标机器学习任务的训练环境。In the embodiment of the present application, the metadata shown in Table 1 to Table 6 is automatically obtained in the process of the above-mentioned running module 110 performing the machine learning task. And directly configure and debug the development environment according to the metadata, so as to reproduce the training environment of the target machine learning task.
根据上述表1-表6所示的元数据的描述标准,总共有6大项的元数据需要提取,即源开发者选择的深度学习的框架(framework)、模型(model)、数据集(dataset)、数据集的处理方式(data-process)、模型的结构(model-architecture)、训练过程中使用的训练参数(training-params)。由于不同的元数据其确定的方式不同,因此,提取上述6大项的元数据的具体实现方式也不相同。According to the metadata description standards shown in Table 1 to Table 6 above, there are a total of 6 major items of metadata that need to be extracted, that is, the deep learning framework, model, and dataset selected by the source developer. ), the processing method of the data set (data-process), the structure of the model (model-architecture), the training parameters (training-params) used in the training process. Since different metadata has different determination methods, the specific implementation methods for extracting the metadata of the above six major items are also different.
以提取如表1-表3所示的框架、模型、数据集等元数据为例,由于该元数据通常情况下是由源开发者在打包训练组件的容器镜像时就已经确定好的,该元数据存储在启动训练容器的物理宿主机上。因此,元数据提取模块120可以通过向物理宿主机发送查询命令,从而获取到存储在物理宿主机上的如表1-表3所示的元数据。Take the extraction of metadata such as frameworks, models, and datasets as shown in Table 1 to Table 3 as an example. Since the metadata is usually determined by the source developer when packaging the container image of the training component, this The metadata is stored on the physical host that starts the training container. Therefore, the metadata extraction module 120 can obtain the metadata stored on the physical host computer as shown in Table 1 to Table 3 by sending a query command to the physical host computer.
以提取如表4-表6所示的数据集处理方式、模型结构、训练参数等元数据为例,由于该元数据是在创建和启动训练容器之后确定的,源开发者在向训练容器挂载的存储空间中存储的训练程序代码中包括该元数据,因此,元数据提取模块120可以通过访问训练容器挂载的存储空间中存储的训练程序代码,从而获取到如表4-表6所示的元数据。Take the extraction of metadata such as the data set processing method, model structure, and training parameters shown in Table 4 to Table 6 as an example. Since the metadata is determined after the training container is created and started, the source developer is linking to the training container. The metadata is included in the training program code stored in the loaded storage space. Therefore, the metadata extraction module 120 can access the training program code stored in the storage space mounted on the training container to obtain information as shown in Table 4 to Table 6. Metadata shown.
本申请实施例中元数据提取模块120可以按照关键字搜索的方式提取出上述几种类型的元数据。下面结合图2-图3,对本申请实施例提供的机器学习任务的完整流程图进行详细描述。In the embodiment of the present application, the metadata extraction module 120 can extract the above-mentioned several types of metadata in a keyword search manner. The complete flowchart of the machine learning task provided by the embodiment of the present application will be described in detail below with reference to FIGS.
参见图2,机器学习任务的完整流程图可以包括环境搭建过程、训练过程、推理过程,下面分别对上述三个过程进行详细描述。Referring to Figure 2, the complete flow chart of the machine learning task may include the environment building process, the training process, and the inference process. The above three processes will be described in detail below.
(1)环境搭建过程:(1) Environment construction process:
步骤210:源开发者打包训练组件镜像、元数据提取模块的镜像。Step 210: The source developer packages the image of the training component and the image of the metadata extraction module.
源开发者在打包训练组件镜像时会确定如表1-表3所示的框架、模型、数据集等元数据。When packaging the training component image, the source developer will determine the metadata such as the framework, model, and data set shown in Table 1 to Table 3.
在资源调度平台是kubernetes的情况下,训练组件可以是jupyter notebook。该jupyter notebook是一种交互式web应用程序,源开发者可以利用jupyter notebook在线输入和调整模型的训练程序代码。In the case that the resource scheduling platform is kubernetes, the training component can be jupyter notebook. The jupyter notebook is an interactive web application. The source developer can use the jupyter notebook to input and adjust the model training program code online.
步骤215:源开发者启动容器镜像。Step 215: The source developer starts the container image.
源开发者可以将步骤210中打包的训练组件镜像、元数据提取模块120的镜像存放在容器仓库中。应理解,容器仓库可以对容器镜像进行管理、存储和保护容器镜像。例如,该容器仓库可以是容器登记(container registry)。The source developer can store the training component image packaged in step 210 and the image of the metadata extraction module 120 in the container warehouse. It should be understood that the container warehouse can manage, store and protect container images. For example, the container warehouse may be a container registry.
源开发者可以输入容器启动脚本或者命令行来从容器仓库中不同版本的容器镜像拉到容器环境中,在容器中启动对应的组件。例如,在训练容器中运行训练组件,在提取容器中运行元数据提取模块120。Source developers can enter container startup scripts or command lines to pull different versions of container images from the container warehouse to the container environment, and start the corresponding components in the container. For example, the training component is run in the training container, and the metadata extraction module 120 is run in the extraction container.
应理解,不同版本的容器镜像可以对应于不同的框架、模型、数据集等元数据。It should be understood that different versions of container images may correspond to different metadata such as frameworks, models, and data sets.
还应理解,容器启动脚本或者命令行中可以包括拉取的容器镜像的名称、版本,容器镜像启动的时间等信息。It should also be understood that the container startup script or command line may include information such as the name and version of the pulled container image, and the time when the container image was started.
具体的,请参见图3中的容器环境300,提供训练功能的容器组310中可以包括训练容器和提取容器,训练容器挂载第一存储空间115,提取容器挂载第二存储空间121。Specifically, referring to the container environment 300 in FIG. 3, the container group 310 providing training functions may include a training container and an extraction container. The training container is attached to the first storage space 115 and the extraction container is attached to the second storage space 121.
在资源调度平台是kubernetes的情况下,容器组可以称为豆荚(pod)。pod在kubernetes中是最小的调度单元,一个pod中可以包括多个容器。对于pod而言,其可以运行在某个物理宿主机上,当需要调度时,kubernetes会将pod作为一个整体进行调度。In the case that the resource scheduling platform is kubernetes, the container group may be called a pod. Pod is the smallest scheduling unit in kubernetes, and a pod can include multiple containers. For a pod, it can run on a physical host. When scheduling is required, kubernetes will schedule the pod as a whole.
容器挂载的存储空间在kubernetes中可以是持久卷(persistent volume,PV),该PV是由网络管理员分配的一段网络存储区域。PV具有独立于任何单个pod的生命周期,也就是说,在pod的生命周期结束之后,pod内的容器会被销毁,但是pod内的容器挂载的PV不会被销毁。The storage space mounted by the container can be a persistent volume (PV) in kubernetes. The PV is a network storage area allocated by a network administrator. PV has a life cycle independent of any single pod, that is, after the life cycle of the pod ends, the container in the pod will be destroyed, but the PV mounted by the container in the pod will not be destroyed.
(2)训练过程:(2) Training process:
步骤220:源开发者输入训练程序代码。Step 220: The source developer inputs the training program code.
源开发者可以通过训练容器中运行的训练组件(例如jupyter notebook)按照表1-表6所示的元数据的描述标准输入训练程序代码。该训练程序代码中包括如表4-表6所示的数据集处理方式、模型结构、训练参数等元数据。The source developer can input the training program code according to the metadata description standard shown in Table 1 to Table 6 through the training component (such as jupyter notebook) running in the training container. The training program code includes metadata such as the data set processing mode, model structure, and training parameters shown in Table 4 to Table 6.
输入的该训练程序代码可以存储在训练容器挂载的第一存储空间115中。The input training program code may be stored in the first storage space 115 mounted on the training container.
需要说明的是,在模型训练的过程中,如果需要对输入的训练程序代码进行修改,可以通过jupyter notebook在线输入和调整模型的训练程序代码。It should be noted that in the process of model training, if you need to modify the input training program code, you can input and adjust the model training program code online through jupyter notebook.
模型训练过程结束之后,已训练好的模型也会存储在训练容器挂载的第一存储空间中。After the model training process is over, the trained model will also be stored in the first storage space mounted on the training container.
步骤225:元数据提取模块120提取元数据,并存储在提取容器挂载的第二存储空间121中。Step 225: The metadata extraction module 120 extracts metadata and stores it in the second storage space 121 mounted on the extraction container.
提取容器中运行的元数据提取模块120会按照关键字提取的方式,通过表1-表6所示的元数据的描述标准提取出上述元数据。The metadata extraction module 120 running in the extraction container extracts the above-mentioned metadata according to the key extraction method and the metadata description standards shown in Table 1 to Table 6.
由于不同的元数据其确定的方式不同,因此,提取上述6大项的元数据的具体实现方式也不相同。一种可能的实现方式是元数据提取模块120从源开发者输入的容器启动脚本、命令行中提取出如表1-表3所示的框架、模型、数据集等元数据。另一种可能的实现方式是元数据提取模块120从源开发者存储至第一存储空间115中的训练程序代码中提取出如表4-表6所示的数据集处理方式、模型结构、训练参数等元数据。具体的参见图4中的描述,此处不再赘述。Since different metadata has different determination methods, the specific implementation methods for extracting the metadata of the above six major items are also different. One possible implementation is that the metadata extraction module 120 extracts metadata such as frameworks, models, and data sets shown in Tables 1 to 3 from the container startup script and command line input by the source developer. Another possible implementation is that the metadata extraction module 120 extracts the data set processing method, model structure, and training shown in Table 4 to Table 6 from the training program code stored in the first storage space 115 by the source developer. Metadata such as parameters. For details, refer to the description in FIG. 4, which will not be repeated here.
还需要说明的是,当模型训练任务结束时,提供训练功能的pod会被销毁,但是挂载的第一存储空间115和第二存储空间121不会被销毁。It should also be noted that when the model training task ends, the pod providing the training function will be destroyed, but the mounted first storage space 115 and the second storage space 121 will not be destroyed.
(3)推理过程:(3) Reasoning process:
步骤230:启动推理组件镜像、元数据提取模块120的容器镜像。Step 230: Start the inference component image and the container image of the metadata extraction module 120.
创建并启动推理容器镜像以及元数据提取模块120的容器镜像的过程与步骤215对应,具体的请参见步骤215中的描述,此处不再赘述。The process of creating and starting the inference container image and the container image of the metadata extraction module 120 corresponds to step 215. For details, please refer to the description in step 215, which will not be repeated here.
步骤235:推理容器根据已训练的模型进行推理服务。Step 235: The inference container performs inference services according to the trained model.
具体的,请参见图3的容器环境300,提供推理功能的容器组320中可以包括推理容器和提取容器。可以将训练容器挂载的第一存储空间115重新挂载到推理容器,将训练功能的容器组中的提取容器挂载的第二存储空间121重新挂载到提供推理功能的容器组中的提取容器。Specifically, referring to the container environment 300 in FIG. 3, the container group 320 that provides inference functions may include inference containers and extraction containers. The first storage space 115 mounted by the training container can be remounted to the inference container, and the second storage space 121 mounted by the extraction container in the training container group can be remounted to the extraction container group providing the inference function. container.
推理容器可以根据挂载的第一存储空间115中存储的已训练的模型进行推理,提供推理功能的容器组中的提取容器还可以获取到推理过程中可能产生的元数据,并将该元数据存储在挂载的第二存储空间121中。The inference container can perform inferences based on the trained model stored in the mounted first storage space 115. The extraction container in the container group that provides the inference function can also obtain the metadata that may be generated during the inference process, and combine the metadata Stored in the mounted second storage space 121.
下面结合图4中的例子,对提取容器中运行的元数据提取模块120提取元数据的过程进行详细描述。The process of extracting metadata by the metadata extraction module 120 running in the extraction container will be described in detail below with reference to the example in FIG. 4.
图4是本申请实施例提供的一种元数据提取模块120提取元数据的方法的示意性流程图。图4所示的方法可以包括步骤410-420,下面分别对步骤410-420进行详细描述。FIG. 4 is a schematic flowchart of a method for extracting metadata by the metadata extraction module 120 according to an embodiment of the present application. The method shown in FIG. 4 may include steps 410-420, and steps 410-420 will be described in detail below.
应理解,根据提取的元数据的类型的不同,可以将图1所示的元数据提取模块120划分为两部分,分别为第一元数据提取模块和第二元数据提取模块。It should be understood that according to different types of extracted metadata, the metadata extraction module 120 shown in FIG. 1 can be divided into two parts, which are a first metadata extraction module and a second metadata extraction module.
第一元数据提取模块可以用于从物理宿主机上提取源开发者在打包训练组件的容器镜像时确定好的如表1-表3所示的框架、模型、数据集等元数据。第二元数据提取模块可以用于从源开发者在向训练容器挂载的存储空间中存储的训练程序代码中提取出如表4-表6所示的数据集处理方式、模型结构、训练参数等元数据。The first metadata extraction module may be used to extract metadata such as the framework, model, and data set shown in Tables 1 to 3, which are determined by the source developer when packaging the container image of the training component from the physical host. The second metadata extraction module can be used to extract the data set processing mode, model structure, and training parameters shown in Table 4 to Table 6 from the training program code stored in the storage space mounted to the training container by the source developer And other metadata.
可选的,在一些实施例中,资源调度平台为kubernetes,第一元数据提取模块可以为工作提取器(job extractor),job extractor是kubectl命令行。第二元数据提取模块为代码提取器(code extractor)。为了便于描述,下面以资源调度平台为kubernetes作为示例进行描述。Optionally, in some embodiments, the resource scheduling platform is kubernetes, the first metadata extraction module may be a job extractor, and the job extractor is a kubectl command line. The second metadata extraction module is a code extractor. For ease of description, the resource scheduling platform is kubernetes as an example for description.
步骤410:第一元数据提取模块向物理宿主机侧发送查询命令以提取如表1-表3所示的框架、模型、数据集等元数据。Step 410: The first metadata extraction module sends a query command to the physical host side to extract metadata such as the framework, model, and data set shown in Table 1 to Table 3.
对于如表1-表3所示的框架、模型、数据集等元数据已经由源开发者通过打包训练组件的容器镜像的方式确定,并将容器镜像存放在容器仓库中。源开发者会输入容器启动脚本或者命令行来从容器仓库中拉取不同版本的容器镜像,不同版本的容器镜像对应于不同的框架、模型、数据集等元数据。Metadata such as the framework, model, and data set shown in Table 1 to Table 3 has been determined by the source developer by packaging the container image of the training component, and the container image is stored in the container warehouse. The source developer will enter the container startup script or command line to pull different versions of the container image from the container warehouse. Different versions of the container image correspond to different metadata such as frameworks, models, and datasets.
由于框架、模型、数据集等元数据存储在物理宿主机上,因此,job extractor需要访问外部的服务来获取物理宿主机上存储的框架、模型、数据集等元数据。本申请实施例中可以通过配置网关(例如,出口(egress))使得第一元数据提取模块(例如,job extractor)能够通过egress访问到物理宿主机的互联网协议(internet protocol,IP)地址,并通过发送查询命令行来获取框架、模型、数据集等元数据。Since metadata such as frameworks, models, and datasets are stored on the physical host, job extractor needs to access external services to obtain metadata such as frameworks, models, and datasets stored on the physical host. In this embodiment of the application, the gateway (for example, egress) can be configured to enable the first metadata extraction module (for example, job extractor) to access the Internet protocol (IP) address of the physical host through the egress, and Obtain metadata such as frameworks, models, and datasets by sending query command lines.
在资源调度平台是kubernetes的情况下,可以发送“kubectl get”命令行动态地从物理宿主机侧的容器启动脚本、命令行中按照关键字提取的方式,提取出相关的元数据信息,例如,容器镜像的名称、版本,容器镜像启动的时间,框架,模型,数据集等元数据。并以java脚本对象符号(java script object notation,JSON)的格式或者其他文件格式的形式存储在其挂载的第二存储空间121中。When the resource scheduling platform is kubernetes, you can send the "kubectl get" command line to dynamically extract the relevant metadata information from the container startup script and command line on the physical host side by keyword extraction, for example, The name and version of the container image, the start time of the container image, the framework, the model, the data set and other metadata. And it is stored in the mounted second storage space 121 in the format of java script object notation (JSON) or other file formats.
步骤420:第二元数据提取模块从训练容器挂载的第一存储空间115中提取出如表4-表6所示的数据集处理方式、模型结构、训练参数等元数据。Step 420: The second metadata extraction module extracts metadata such as the data set processing mode, model structure, training parameters, etc. shown in Table 4 to Table 6 from the first storage space 115 mounted on the training container.
对于如表4-表6所示的数据集处理方式、模型结构、训练参数等元数据已经由源开发者存储在训练容器挂载的第一存储空间115中,因此,第二元数据提取模块(例如,code extractor)可以通过关键字搜索的方式,按照表4-表6所示的元数据的描述标准,从训练容器挂载的第一存储空间115中存储的训练程序代码中提取出入表4-表6所示的数据集处理方式、模型结构、训练参数等元数据。并以JSON的格式或者其他文件格式的形式存储在其挂载的第二存储空间121中。Metadata such as the data set processing method, model structure, and training parameters shown in Table 4 to Table 6 has been stored by the source developer in the first storage space 115 mounted on the training container. Therefore, the second metadata extraction module (For example, code extractor) You can extract the entry and exit table from the training program code stored in the first storage space 115 mounted on the training container in accordance with the description standard of the metadata shown in Table 4 to Table 6 by way of keyword search 4- Metadata such as the data set processing method, model structure, training parameters shown in Table 6. And it is stored in the mounted second storage space 121 in the form of JSON or other file formats.
上述code extractor和job extractor在提取到相应的元数据之后,可以将提取到的元数据整合,并将整合之后的元数据以“元数据描述文件+模型+数据集”的形式第二存储空间121中。After the above code extractor and job extractor extract the corresponding metadata, the extracted metadata can be integrated, and the integrated metadata can be used as the second storage space in the form of "metadata description file + model + data set" 121 in.
本申请实施例中源开发者可以通过上述机器学习任务的工作流程,在进行环境搭建或者模型训练的过程中,通过元数据提取模块自动的获取并存储源开发者在机器学习任务中使用的如表1-表6所示的元数据。在机器学习任务结束之后,如果源开发者或者其他开发者需要复现源开发环境,可以通过保存的元数据,实现整个机器学习任务全生命周期的工作流搭建,从而复现出源开发环境。In the embodiments of this application, the source developer can use the above-mentioned work flow of the machine learning task to automatically obtain and store the information used by the source developer in the machine learning task in the process of environment building or model training through the metadata extraction module. Metadata shown in Table 1-Table 6. After the end of the machine learning task, if the source developer or other developers need to reproduce the source development environment, they can use the saved metadata to realize the workflow construction of the entire life cycle of the machine learning task, thereby reproducing the source development environment.
图5是本申请实施例提供的一种提取机器学习训练过程中的元数据的系统500的示意性结构图,系统500中可以包括至少一个服务器。FIG. 5 is a schematic structural diagram of a system 500 for extracting metadata in a machine learning training process provided by an embodiment of the present application. The system 500 may include at least one server.
为了便于描述,图5中以服务器510、服务器520为例进行说明。服务器510和 服务器520的结构类似。For ease of description, FIG. 5 takes the server 510 and the server 520 as examples for description. The structure of the server 510 and the server 520 are similar.
图1所示的运行模块110可以运行在至少一个服务器上,例如,服务器510和服务器520上分别运行有运行模块110。The running module 110 shown in FIG. 1 may be run on at least one server, for example, the running module 110 is run on the server 510 and the server 520 respectively.
图1所示的元数据提取模块120的部署形式有多种,本申请实施例对此不做具体限定。作为一个示例,元数据提取模块120可以运行在至少一个服务器中的每一个,例如,服务器510以及服务器520上分别运行有元数据提取模块120。作为另一个示例,元数据提取模块120还可以运行在至少一个服务器中的一部分服务器上,例如,元数据提取模块120运行在服务器510上或者运行在服务器520上。作为另一个示例,元数据提取模块120还可以运行在除了上述至少一个服务器之外的其他服务器上,例如,元数据提取模块120运行在服务器530上。There are many deployment forms of the metadata extraction module 120 shown in FIG. 1, which are not specifically limited in the embodiment of the present application. As an example, the metadata extraction module 120 may run on each of at least one server, for example, the metadata extraction module 120 runs on the server 510 and the server 520 respectively. As another example, the metadata extraction module 120 may also run on a part of at least one server. For example, the metadata extraction module 120 runs on the server 510 or runs on the server 520. As another example, the metadata extraction module 120 may also run on other servers besides the aforementioned at least one server. For example, the metadata extraction module 120 runs on the server 530.
系统500可以执行上述提取机器学习训练过程中的元数据的方法,具体的,系统500中的至少一个服务器中可以包括至少一个处理器和存储器。存储器用于存储程序指令,至少一个服务器中包括的处理器可以执行存储器中存储的程序指令来实现上述提取机器学习训练过程中的元数据的方法,或者实现图1中图1所示的运行模块110、元数据提取模块120。下面以服务器510作为示例,对服务器510实现上述提取机器学习训练过程中的元数据的方法的具体过程进行详细描述。The system 500 may execute the above-mentioned method for extracting metadata in the machine learning training process. Specifically, at least one server in the system 500 may include at least one processor and a memory. The memory is used to store program instructions. The processor included in at least one server can execute the program instructions stored in the memory to implement the above-mentioned method for extracting metadata in the machine learning training process, or to implement the operating module shown in Figure 1 in Figure 1 110. Metadata extraction module 120. In the following, taking the server 510 as an example, the specific process of the server 510 for implementing the above-mentioned method for extracting metadata in the machine learning training process will be described in detail.
服务器510中可以包括:至少一个处理器(例如,处理器511、处理器516),存储器512,通信接口513,输入输出接口514。The server 510 may include: at least one processor (for example, the processor 511 and the processor 516), a memory 512, a communication interface 513, and an input/output interface 514.
其中,至少一个处理器可以与存储器512连接。该存储器512可以用于存储程序指令。该存储器512可以是至少一个处理器内部的存储单元,也可以是与至少一个处理器独立的外部存储单元,还可以是包括与至少一个处理器内部的存储单元和与至少一个处理器独立的外部存储单元的部件。Among them, at least one processor may be connected to the memory 512. The memory 512 can be used to store program instructions. The memory 512 may be a storage unit inside at least one processor, or an external storage unit independent of at least one processor, and may also include a storage unit internal to at least one processor and an external storage unit independent of at least one processor. Parts of the storage unit.
存储器512可以是固态硬盘(solid state drive,SSD),也可以是硬盘驱动器(hard disk drive,HDD),还可以是只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)等。The memory 512 can be a solid state drive (SSD), a hard disk drive (HDD), a read-only memory (ROM), a random access memory (random access memory) , RAM) etc.
可选的,服务器510还可以包括总线515。其中,存储器512、输入输出接口514、通信接口513可以通过总线515与至少一个处理器连接。总线515可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。所述总线515可以分为地址总线、数据总线、控制总线等。为便于表示,图5中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。Optionally, the server 510 may further include a bus 515. The memory 512, the input/output interface 514, and the communication interface 513 may be connected to at least one processor through a bus 515. The bus 515 may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus 515 can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one line is used to represent in FIG. 5, but it does not mean that there is only one bus or one type of bus.
可选地,在一些实施例中,系统500还可以包括云存储器540。云存储器540可以作为外部存储器,与系统500连接。上述程序指令可以存储在存储器512中,也可以存储在云存储器540中。Optionally, in some embodiments, the system 500 may further include a cloud storage 540. The cloud storage 540 can be used as an external storage and connected to the system 500. The above-mentioned program instructions may be stored in the memory 512 or the cloud storage 540.
在本申请实施例中,至少一个处理器可以采用中央处理单元(central processing unit,CPU),还可以是其它通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate Array,FPGA)或者其它可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处 理器也可以是任何常规的处理器等。或者采用一个或多个集成电路,用于执行相关程序,以实现本申请实施例所提供的技术方案。In the embodiment of the present application, at least one processor may be a central processing unit (central processing unit, CPU), or other general-purpose processors, digital signal processors (digital signal processors, DSP), and application specific integrated circuits (application specific integrated circuits). integrated circuit, ASIC), ready-made programmable gate array (field programmable gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. Alternatively, one or more integrated circuits are used to execute related programs to implement the technical solutions provided in the embodiments of the present application.
参见图5,在服务器510中,以处理器511为例,处理器511中运行有运行模块110。运行模块110中可以包括多个子模块,例如,图1所示的环境搭建子模块111、训练子模块112、推理子模块113、环境销毁子模块114。Referring to FIG. 5, in the server 510, the processor 511 is taken as an example, and the processor 511 runs an operation module 110. The running module 110 may include multiple sub-modules, for example, the environment construction sub-module 111, the training sub-module 112, the inference sub-module 113, and the environment destruction sub-module 114 shown in FIG.
存储器512的第一存储空间115中存储有源开发者输入的训练程序代码,所述训练程序代码中包括如表4-表6所述的数据集处理方式、模型结构、训练参数等元数据中的一种或者多种。第二存储空间121中存储有元数据提取模块120提取出的元数据。第三存储空间5121中存储有源开发者输入的训练容器启动脚本,所述训练容器启动脚本中包括如表1-表3中所示的框架、模型、数据集等元数据的一种或者多种。The first storage space 115 of the memory 512 stores the training program code input by the active developer. The training program code includes metadata such as the data set processing method, model structure, and training parameters described in Table 4 to Table 6. One or more of. The metadata extracted by the metadata extraction module 120 is stored in the second storage space 121. The third storage space 5121 stores the training container startup script input by the active developer. The training container startup script includes one or more metadata such as the framework, model, and data set shown in Table 1 to Table 3. Kind.
处理器511从存储器512中获取存储的程序指令,以运行上述机器学习任务。具体的,运行模块110中的环境搭建子模块111从存储器512的第三存储空间5121中获取容器启动脚本中,并执行上述容器环境的搭建过程。运行模块110中的训练子模块112从存储器512的第一存储空间115中获取训练程序代码,以执行上述模型的训练过程,并可以将模型的训练结果存储在所述第一存储空间115中。具体的有关运行模块110中的各个子模块执行机器学习任务的具体实现过程,请参考图1中的描述,此处不再赘述。The processor 511 obtains the stored program instructions from the memory 512 to run the above-mentioned machine learning tasks. Specifically, the environment building sub-module 111 in the running module 110 obtains the container startup script from the third storage space 5121 of the memory 512, and executes the above-mentioned container environment building process. The training sub-module 112 in the running module 110 obtains the training program code from the first storage space 115 of the memory 512 to execute the training process of the above model, and can store the training result of the model in the first storage space 115. For the specific implementation process of each sub-module in the running module 110 to execute the machine learning task, please refer to the description in FIG. 1, which will not be repeated here.
在上述机器学习任务的运行过程中,元数据提取模块120可以从存储器512的第一存储空间115中存储的训练程序代码中提取出如表4-表6所述的数据集处理方式、模型结构、训练参数等元数据中的一种或者多种。元数据提取模块120还可以从第三存储空间5121中存储的容器启动脚本中提取出如表1-表3中所示的框架、模型、数据集等元数据的一种或者多种。During the operation of the above-mentioned machine learning task, the metadata extraction module 120 can extract the data set processing method and model structure described in Table 4 to Table 6 from the training program code stored in the first storage space 115 of the memory 512 One or more of metadata such as training parameters. The metadata extraction module 120 may also extract one or more of metadata such as frameworks, models, and data sets as shown in Tables 1 to 3 from the container startup script stored in the third storage space 5121.
可选地,在一些实施例中,元数据提取模块120还可以将提取出的元数据生成描述文件,并将生成的描述文件存储在存储器512的第二存储空间121中。具体的有关元数据提取模块120提取元数据的过程请参考上文中的描述,此处不再赘述。Optionally, in some embodiments, the metadata extraction module 120 may also generate a description file from the extracted metadata, and store the generated description file in the second storage space 121 of the memory 512. For the specific process of extracting metadata by the metadata extraction module 120, please refer to the above description, which will not be repeated here.
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that, in the various embodiments of the present application, the size of the sequence number of the above-mentioned processes does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, rather than corresponding to the embodiments of the present application. The implementation process constitutes any limitation.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the above-described system, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执 行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, the functional units in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (15)

  1. 一种提取机器学习训练过程中的元数据的方法,其特征在于,所述方法应用于虚拟化环境,所述方法包括:A method for extracting metadata in a machine learning training process, characterized in that the method is applied to a virtualized environment, and the method includes:
    根据用户输入的机器学习程序代码在所述虚拟化环境中运行机器学习任务;Running a machine learning task in the virtualized environment according to the machine learning program code input by the user;
    从所述机器学习程序代码中提取元数据,所述元数据用于对所述机器学习任务的运行环境进行复现;Extracting metadata from the machine learning program code, where the metadata is used to replicate the operating environment of the machine learning task;
    将所述元数据存储在第一存储空间。The metadata is stored in the first storage space.
  2. 根据权利要求1所述的方法,其特征在于,所述从所述机器学习程序代码中提取元数据,包括:The method according to claim 1, wherein said extracting metadata from said machine learning program code comprises:
    通过关键字搜索的方式,按照所述元数据的类型从所述机器学习程序代码中提取出所述元数据。By way of keyword search, the metadata is extracted from the machine learning program code according to the type of the metadata.
  3. 根据权利要求2所述的方法,其特征在于,所述虚拟化环境通过至少一个训练容器运行所述机器学习任务,所述元数据包括第一类元数据;The method according to claim 2, wherein the virtualized environment runs the machine learning task through at least one training container, and the metadata includes the first type of metadata;
    所述通过关键字搜索的方式,按照所述元数据的类型从所述机器学习程序代码中提取出所述元数据,包括:The method of searching by keyword and extracting the metadata from the machine learning program code according to the type of the metadata includes:
    按照所述第一类元数据的类型从输入的训练容器启动脚本中提取出所述第一类元数据,所述训练容器启动脚本用于启动所述至少一个训练容器。The first type of metadata is extracted from the input training container startup script according to the type of the first type of metadata, and the training container startup script is used to start the at least one training container.
  4. 根据权利要求3所述的方法,其特征在于,所述第一类元数据的类型包括以下任何一个或多个:所述机器学习任务使用的框架、所述机器学习任务使用的模型、所述机器学习任务的训练过程中使用的数据集。The method according to claim 3, wherein the type of the first type of metadata includes any one or more of the following: a framework used by the machine learning task, a model used by the machine learning task, the The dataset used in the training process of the machine learning task.
  5. 根据权利要求3或4所述的方法,其特征在于,所述虚拟化环境通过至少一个训练容器运行所述机器学习任务,所述元数据包括第二类元数据;The method according to claim 3 or 4, wherein the virtualized environment runs the machine learning task through at least one training container, and the metadata includes the second type of metadata;
    所述通过关键字搜索的方式,按照所述元数据的类型从所述机器学习程序代码中提取出所述元数据,包括:The method of searching by keyword and extracting the metadata from the machine learning program code according to the type of the metadata includes:
    按照所述第二类元数据的类型从输入的训练程序代码中提取出所述元数据,所述训练程序代码存储在所述至少一个训练容器挂载的第二存储空间中,所述训练程序代码用于在所述至少一个训练容器中运行所述机器学习任务的模型训练过程。The metadata is extracted from the input training program code according to the type of the second type of metadata, the training program code is stored in a second storage space mounted on the at least one training container, and the training program The code is used to run the model training process of the machine learning task in the at least one training container.
  6. 根据权利要求5所述的方法,其特征在于,所述第二类元数据的类型包括以下任何一个或多个:所述机器学习任务的训练过程中使用的数据集的处理方式、所述机器学习任务的训练过程中使用的模型的结构、所述机器学习任务的训练过程中使用的训练参数。The method according to claim 5, wherein the type of the second type of metadata includes any one or more of the following: the processing method of the data set used in the training process of the machine learning task, the machine The structure of the model used in the training process of the learning task, and the training parameters used in the training process of the machine learning task.
  7. 一种提取机器学习训练过程中的元数据的装置,其特征在于,所述装置运行于虚拟化环境,所述装置包括:A device for extracting metadata in a machine learning training process, characterized in that the device runs in a virtualized environment, and the device includes:
    运行模块,用于根据用户输入的机器学习程序代码在所述虚拟化环境中运行机器学习任务;The running module is used to run the machine learning task in the virtualized environment according to the machine learning program code input by the user;
    元数据提取模块,用于从所述机器学习程序代码中提取元数据,所述元数据用于对所述机器学习任务的运行环境进行复现;A metadata extraction module, configured to extract metadata from the machine learning program code, and the metadata is used to replicate the operating environment of the machine learning task;
    所述元数据提取模块,还用于将所述元数据存储在第一存储空间。The metadata extraction module is further configured to store the metadata in the first storage space.
  8. 根据权利要求7所述的装置,其特征在于,所述元数据提取模块具体用于:The device according to claim 7, wherein the metadata extraction module is specifically configured to:
    通过关键字搜索的方式,按照所述元数据的类型从所述机器学习程序代码中提取出所述元数据。By way of keyword search, the metadata is extracted from the machine learning program code according to the type of the metadata.
  9. 根据权利要求8所述的装置,其特征在于,所述虚拟化环境通过至少一个训练容器运行所述机器学习任务,所述元数据包括第一类元数据;The device according to claim 8, wherein the virtualized environment runs the machine learning task through at least one training container, and the metadata includes the first type of metadata;
    所述元数据提取模块具体用于:The metadata extraction module is specifically used for:
    按照所述第一类元数据的类型从输入的训练容器启动脚本中提取出所述第一类元数据,所述训练容器启动脚本用于启动所述至少一个训练容器。The first type of metadata is extracted from the input training container startup script according to the type of the first type of metadata, and the training container startup script is used to start the at least one training container.
  10. 根据权利要求9所述的装置,其特征在于,所述第一类元数据的类型包括以下任何一个或多个:所述机器学习任务使用的框架、所述机器学习任务使用的模型、所述机器学习任务的训练过程中使用的数据集。The device according to claim 9, wherein the type of the first type of metadata includes any one or more of the following: a framework used by the machine learning task, a model used by the machine learning task, the The dataset used in the training process of the machine learning task.
  11. 根据权利要求9或10所述的装置,其特征在于,所述虚拟化环境通过至少一个训练容器运行所述机器学习任务,所述元数据包括第二类元数据;The device according to claim 9 or 10, wherein the virtualized environment runs the machine learning task through at least one training container, and the metadata includes the second type of metadata;
    所述元数据提取模块具体用于:The metadata extraction module is specifically used for:
    按照所述第二类元数据的类型从输入的训练程序代码中提取出所述元数据,所述训练程序代码存储在所述至少一个训练容器挂载的第二存储空间中,所述训练程序代码用于在所述至少一个训练容器中运行所述机器学习任务的模型训练过程。The metadata is extracted from the input training program code according to the type of the second type of metadata, the training program code is stored in a second storage space mounted on the at least one training container, and the training program The code is used to run the model training process of the machine learning task in the at least one training container.
  12. 根据权利要求11所述的装置,其特征在于,所述第二类元数据的类型包括以下任何一个或多个:所述机器学习任务的训练过程中使用的数据集的处理方式、所述机器学习任务的训练过程中使用的模型的结构、所述机器学习任务的训练过程中使用的训练参数。The device according to claim 11, wherein the type of the second type of metadata includes any one or more of the following: the processing method of the data set used in the training process of the machine learning task, the machine The structure of the model used in the training process of the learning task, and the training parameters used in the training process of the machine learning task.
  13. 一种提取机器学习训练过程中的元数据的系统,所述系统包括至少一个服务器,每个服务器包括存储器和至少一个处理器,所述存储器用于程序指令,所述至少一个处理器执行所述存储器中的程序指令以执行权利要求1至6中任一项所述的方法。A system for extracting metadata in a machine learning training process, the system includes at least one server, each server includes a memory and at least one processor, the memory is used for program instructions, and the at least one processor executes the The program instructions in the memory are used to execute the method of any one of claims 1 to 6.
  14. 一种非瞬态的可读存储介质,其特征在于,包括程序指令,当所述程序指令被计算机运行时,所述计算机执行如权利要求1至6中任一项所述的方法。A non-transitory readable storage medium, characterized by comprising program instructions, when the program instructions are executed by a computer, the computer executes the method according to any one of claims 1 to 6.
  15. 一种计算机程序产品,其特征在于,包括程序指令,当所述程序指令被计算机运行时,所述计算机执行如权利要求1至6中任一项所述的方法。A computer program product, characterized by comprising program instructions, when the program instructions are executed by a computer, the computer executes the method according to any one of claims 1 to 6.
PCT/CN2020/070577 2019-03-19 2020-01-07 Method and apparatus for extracting metadata in machine learning training process WO2020186899A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910208590.XA CN110058922B (en) 2019-03-19 2019-03-19 Method and device for extracting metadata of machine learning task
CN201910208590.X 2019-03-19

Publications (1)

Publication Number Publication Date
WO2020186899A1 true WO2020186899A1 (en) 2020-09-24

Family

ID=67317220

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/070577 WO2020186899A1 (en) 2019-03-19 2020-01-07 Method and apparatus for extracting metadata in machine learning training process

Country Status (2)

Country Link
CN (1) CN110058922B (en)
WO (1) WO2020186899A1 (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110058922B (en) * 2019-03-19 2021-08-20 华为技术有限公司 Method and device for extracting metadata of machine learning task
CN112395039B (en) * 2019-08-16 2024-01-19 北京神州泰岳软件股份有限公司 Method and device for managing Kubernetes cluster
CN110532098B (en) * 2019-08-30 2022-03-08 广东星舆科技有限公司 Method and system for providing GPU (graphics processing Unit) service
CN110795141B (en) * 2019-10-12 2023-10-10 广东浪潮大数据研究有限公司 Training task submitting method, device, equipment and medium
CN110837896B (en) * 2019-11-22 2022-07-08 中国联合网络通信集团有限公司 Storage and calling method and device of machine learning model
CN111160569A (en) * 2019-12-30 2020-05-15 第四范式(北京)技术有限公司 Application development method and device based on machine learning model and electronic equipment
US11599357B2 (en) * 2020-01-31 2023-03-07 International Business Machines Corporation Schema-based machine-learning model task deduction
CN111629061B (en) * 2020-05-28 2023-01-24 苏州浪潮智能科技有限公司 Inference service system based on Kubernetes
CN111694641A (en) * 2020-06-16 2020-09-22 中电科华云信息技术有限公司 Storage management method and system for container application
TWI772884B (en) * 2020-09-11 2022-08-01 英屬維爾京群島商飛思捷投資股份有限公司 Positioning system and method integrating machine learning positioning model
CN112286682A (en) * 2020-10-27 2021-01-29 上海淇馥信息技术有限公司 Machine learning task processing method, device and equipment based on distributed cluster
CN112311605B (en) * 2020-11-06 2023-12-22 北京格灵深瞳信息技术股份有限公司 Cloud platform and method for providing machine learning service
CN112819176B (en) * 2021-01-22 2022-11-08 烽火通信科技股份有限公司 Data management method and data management device suitable for machine learning
US20230289276A1 (en) * 2022-03-14 2023-09-14 International Business Machines Corporation Intelligently optimized machine learning models

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016159949A1 (en) * 2015-03-30 2016-10-06 Hewlett Packard Enterprise Development Lp Application analyzer for cloud computing
CN107451663A (en) * 2017-07-06 2017-12-08 阿里巴巴集团控股有限公司 Algorithm assembly, based on algorithm assembly modeling method, device and electronic equipment
CN109146084A (en) * 2018-09-06 2019-01-04 郑州云海信息技术有限公司 A kind of method and device of the machine learning based on cloud computing
CN109272116A (en) * 2018-09-05 2019-01-25 郑州云海信息技术有限公司 A kind of method and device of deep learning
CN110058922A (en) * 2019-03-19 2019-07-26 华为技术有限公司 A kind of method, apparatus of the metadata of extraction machine learning tasks

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8296727B2 (en) * 2005-10-14 2012-10-23 Oracle Corporation Sub-task mechanism for development of task-based user interfaces
CN104899141B (en) * 2015-06-05 2017-08-04 北京航空航天大学 A kind of test cases selection and extending method of network-oriented application system
CN108805282A (en) * 2018-04-28 2018-11-13 福建天晴在线互动科技有限公司 Deep learning data sharing method, storage medium based on block chain mode
CN108665072A (en) * 2018-05-23 2018-10-16 中国电力科学研究院有限公司 A kind of machine learning algorithm overall process training method and system based on cloud framework

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016159949A1 (en) * 2015-03-30 2016-10-06 Hewlett Packard Enterprise Development Lp Application analyzer for cloud computing
CN107451663A (en) * 2017-07-06 2017-12-08 阿里巴巴集团控股有限公司 Algorithm assembly, based on algorithm assembly modeling method, device and electronic equipment
CN109272116A (en) * 2018-09-05 2019-01-25 郑州云海信息技术有限公司 A kind of method and device of deep learning
CN109146084A (en) * 2018-09-06 2019-01-04 郑州云海信息技术有限公司 A kind of method and device of the machine learning based on cloud computing
CN110058922A (en) * 2019-03-19 2019-07-26 华为技术有限公司 A kind of method, apparatus of the metadata of extraction machine learning tasks

Also Published As

Publication number Publication date
CN110058922B (en) 2021-08-20
CN110058922A (en) 2019-07-26

Similar Documents

Publication Publication Date Title
WO2020186899A1 (en) Method and apparatus for extracting metadata in machine learning training process
US11113475B2 (en) Chatbot generator platform
US9690772B2 (en) Category and term polarity mutual annotation for aspect-based sentiment analysis
US10157226B1 (en) Predicting links in knowledge graphs using ontological knowledge
US9766868B2 (en) Dynamic source code generation
US9619209B1 (en) Dynamic source code generation
US11816456B2 (en) Notebook for navigating code using machine learning and flow analysis
US20220036175A1 (en) Machine learning-based issue classification utilizing combined representations of semantic and state transition graphs
US20150006537A1 (en) Aggregating Question Threads
WO2020038376A1 (en) Method and system for uniformly performing feature extraction
US10915304B1 (en) System optimized for performing source code analysis
US20140006373A1 (en) Automated subject annotator creation using subject expansion, ontological mining, and natural language processing techniques
JP2022041801A (en) System and method for gaining advanced review understanding using area-specific knowledge base
Beebe A Complete Bibliography of Publications in ACM Computing Surveys
KR102132450B1 (en) Method and apparatus for testing javascript interpretation engine using machine learning
TW201835759A (en) Development platform of mobile native applications
US11164088B2 (en) Interactive feedback and assessment experience
US11062616B2 (en) Interactive learning experience
US11748622B1 (en) Saving intermediate outputs of a neural network
WO2024031983A1 (en) Code management method and related device
US11704117B2 (en) System optimized for performing source code analysis
US20230401361A1 (en) Generating and analyzing material structures based on neural networks
US11921608B2 (en) Identifying a process and generating a process diagram
TWI724515B (en) Machine learning service delivery method
US20230118939A1 (en) Risk Assessment of a Container Build

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20774122

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20774122

Country of ref document: EP

Kind code of ref document: A1