WO2020186899A1

WO2020186899A1 - Method and apparatus for extracting metadata in machine learning training process

Info

Publication number: WO2020186899A1
Application number: PCT/CN2020/070577
Authority: WO
Inventors: 刘烨东
Original assignee: 华为技术有限公司
Priority date: 2019-03-19
Filing date: 2020-01-07
Publication date: 2020-09-24
Also published as: CN110058922B; CN110058922A

Abstract

A method for extracting metadata in a machine learning training process. The method is applied to a virtualized environment. The method comprises: running a machine learning task in a virtualized environment according to a machine learning program code input by a user; extracting metadata from the machine learning program code, the metadata being used for reproducing the running environment of the machine learning task; and storing the metadata in the first storage space. In the technical solution involved in the present method, relevant metadata required for reproducing a specific training environment during the training process of a target machine learning task is automatically extracted; and when other developers want to reproduce a specific training environment, the specific training environment can be reproduced according to the stored metadata, so that the propagation of models are accelerated.

Description

Method and device for extracting metadata in machine learning training process

Technical field

This application relates to the field of cloud computing, and more specifically, to a method, device, and computer-readable storage medium for extracting metadata in a machine learning training process.

Background technique

Machine learning (ML) is a multi-field interdisciplinary subject that specializes in computer simulation or realization of human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its performance . Machine learning is the core of artificial intelligence intelligence, the fundamental way to make computers intelligent, and its applications cover all areas of artificial intelligence.

The workflow of a machine learning task can include environment construction, model training process, and model inference process. After the source developer trains a model through the above process, the trained model will be provided to other developers. If other developers want to reproduce the training process, they need to fully reproduce the source development environment. However, other developers need to spend a lot of time to build and debug a training environment compatible with the target machine learning task in the process of reproducing the source development environment, which brings great inconvenience to the dissemination of the model.

Summary of the invention

This application provides a method and device for extracting metadata in the process of machine learning training. Developers can automatically extract some relevant metadata needed to reproduce a specific training environment during the training process of the target machine learning task. , When other developers want to reproduce a specific training environment, they can reproduce the specific training environment according to the stored related metadata, which speeds up the spread of the model.

In a first aspect, a method for extracting metadata in a machine learning training process is provided. The method is applied to a virtualized environment, and the method includes: running in the virtualized environment according to a machine learning program code input by a user Machine learning task; extracting metadata from the machine learning program code, the metadata being used to reproduce the operating environment of the machine learning task; storing the metadata in a first storage space.

In a possible implementation manner, the metadata is extracted from the machine learning program code according to the type of the metadata by way of keyword search.

In another possible implementation manner, the virtualized environment runs the machine learning task through at least one training container, and the metadata includes the first type of metadata. The first type of metadata may be extracted from the input training container startup script according to the type of the first type of metadata, and the training container startup script is used to start the at least one training container.

In another possible implementation manner, the type of the first type of metadata includes any one or more of the following: a framework used by the machine learning task, a model used by the machine learning task, and the machine learning task The dataset used in the training process.

In another possible implementation manner, the virtualized environment runs the machine learning task through at least one training container, and the metadata includes the second type of metadata. The metadata may be extracted from the input training program code according to the type of the second type of metadata, the training program code is stored in a second storage space mounted on the at least one training container, and the training The program code is used to run the model training process of the machine learning task in the at least one training container.

In another possible implementation manner, the type of the second type of metadata includes any one or more of the following: the processing method of the data set used in the training process of the machine learning task, the processing method of the machine learning task The structure of the model used in the training process, and the training parameters used in the training process of the machine learning task.

In a second aspect, a device for extracting metadata in a machine learning training process is provided, the device runs in a virtualized environment, and the device includes:

The running module is used to run the machine learning task in the virtualized environment according to the machine learning program code input by the user;

A metadata extraction module, configured to extract metadata from the machine learning program code, and the metadata is used to replicate the operating environment of the machine learning task;

The metadata extraction module is further configured to store the metadata in the first storage space.

In a possible implementation manner, the metadata extraction module is specifically configured to extract the metadata from the machine learning program code according to the type of the metadata by way of keyword search.

In another possible implementation manner, the virtualized environment runs the machine learning task through at least one training container, and the metadata includes the first type of metadata;

The metadata extraction module is specifically configured to: extract the first type metadata from the input training container startup script according to the type of the first type metadata, and the training container startup script is used to start the at least A training container.

In another possible implementation manner, the virtualized environment runs the machine learning task through at least one training container, and the metadata includes the second type of metadata;

The metadata extraction module is specifically configured to: extract the metadata from the input training program code according to the type of the second type of metadata, and the training program code is stored in the at least one training container mounted In the second storage space, the training program code is used to run the model training process of the machine learning task in the at least one training container.

In a third aspect, a system for extracting metadata in a machine learning training process is provided. The system includes at least one server, each server includes a memory and at least one processor, the memory is used for program instructions, and the at least one server At runtime, at least one processor executes the program instructions in the memory to execute the method in the first aspect or any one of the possible implementation manners of the first aspect, or is used to implement the second aspect or any one of the second aspect Run module and metadata extraction module in one possible implementation.

In a possible implementation manner, the running module may run on the multiple servers, and the metadata extraction module may run on each of the multiple servers.

In another possible implementation manner, the metadata extraction module may run on a part of multiple servers.

In another possible implementation manner, the metadata extraction module may run on any server other than the above-mentioned multiple servers.

Optionally, the processor may be a general-purpose processor, which may be implemented by hardware or software. When implemented by hardware, the processor may be a logic circuit, integrated circuit, etc.; when implemented by software, the processor may be a general-purpose processor, which is implemented by reading software codes stored in the memory, and the memory may Integrated in the processor, can be located outside of the processor, and exist independently.

In a fourth aspect, a non-transitory readable storage medium is provided, including program instructions. When the program instructions are executed by a computer, the computer executes any possible implementation as in the first aspect and the first aspect. The method in the way.

In a fifth aspect, a computer program product is provided, including program instructions. When the program instructions are executed by a computer, the computer executes the method in any one of the first aspect and the first aspect.

On the basis of the implementation manners provided by the above aspects, this application can be further combined to provide more implementation manners.

Description of the drawings

Fig. 1 is a schematic block diagram of an apparatus 100 for running a machine learning task provided by an embodiment of the present application.

Fig. 2 is a schematic flowchart of a machine learning task execution provided by an embodiment of the present application.

FIG. 3 is a schematic block diagram of a container environment 300 provided by an embodiment of the present application.

FIG. 4 is a schematic flowchart of a method for extracting metadata by a metadata extraction module according to an embodiment of the present application.

FIG. 5 is a schematic structural diagram of a system 500 for extracting metadata in a machine learning training process provided by an embodiment of the present application.

detailed description

The technical solution in this application will be described below in conjunction with the drawings.

Machine learning (ML) is a multi-field interdisciplinary subject that specializes in computer simulation or realization of human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its performance . Machine learning is the core of artificial intelligence intelligence, the fundamental way to make computers intelligent, and its applications cover all areas of artificial intelligence. The workflow of a machine learning task can include environment construction, model training process, and model inference process.

Fig. 1 is a schematic block diagram of an apparatus 100 for running a machine learning task provided by an embodiment of the present application. The device 100 may include an operating module 110, a metadata extraction module 120, and a second storage space mounted by the metadata extraction module 120. The above modules are described in detail below.

The running module 110 may include multiple sub-modules, such as an environment construction sub-module 111, a training sub-module 112, an inference sub-module 113, and an environment destruction sub-module 114.

It should be understood that the running module 110, the metadata extraction module 120, and their sub-modules can be run in a virtualized environment, for example, can be implemented by using a container, or for another example, can also be implemented by using a virtual machine. Make specific restrictions.

(1) Environment building sub-module 111:

The environment building submodule 111 is used to build a training environment for machine learning tasks. The construction of the machine learning task environment is actually the scheduling of computer hardware resources. The hardware resources may include, but are not limited to: computing resources and storage resources.

With the increasing complexity of machine learning tasks and the increasing amount of calculations, cloudification and containerization of machine learning tasks has become a development trend. The container technology represented by docker gradually matures, and it uses mirroring to create a virtualized operating environment. Related components can be deployed in the container. The container technology provides computing resources and storage resources, so that the computing resources and storage resources on the physical machine can be directly called to provide hardware resources for machine learning tasks. For example, the open source container scheduling platform represented by kubernetes can effectively manage containers.

It should be understood that docker is an open source application container engine that allows source developers to package their applications and dependent packages into a portable container, and then publish it to any popular Linux machine, which can also be virtualized. For ease of description, the following uses a container as an example to describe in detail the technical solutions provided by the embodiments of the present application. When the virtualization environment where the running module 110 and the metadata extraction module 120 are located is a virtual machine, the running module 110, the metadata extraction module 120 and their submodules may be implemented by a virtual machine.

The embodiments of this application do not specifically limit computing resources. It may be a central processing unit (central processing unit, CPU) or a graphics processing unit (graphics processing unit, GPU).

Specifically, the source developer can pull the container image of the packaged related component, such as the container image of the training component, into the container environment by packaging the image. And through the command line or container startup script input by the source developer, the training container is created and started, and the training process of the model is performed in the training container.

(2) Training sub-module 112:

The training sub-module 112 can run in the container environment built above, and perform the model training process according to the training program code input by the source developer.

Specifically, source developers can use network file system (NFS) shared storage or other storage products on the cloud platform, for example, distributed file system (DFS) to store the training program code in In the first storage space 115, the first storage space 115 can be mounted in the booted training container. The training sub-module 112 may train the model according to the stored training program code obtained in the first storage space 115.

The training sub-module 112 may also store the trained model in the first storage space 115 during the training process.

(3) Inference sub-module 113:

The inference sub-module 113 can access the first storage space 115, and can perform an inference process based on the trained model stored in the first storage space 115. Specifically, the inference sub-module 113 may determine the predicted output value according to the input training data and the trained model. And it can be determined whether the model trained by the training sub-module 112 is correct according to the error between the predicted output value and the prior knowledge of the training data.

It should be understood that prior knowledge is also called ground truth, and generally includes prediction results corresponding to training data provided by people.

For example, machine learning tasks are applied to the field of image recognition. The training data input to the model trained by the training sub-module 112 is the pixel information of the image, and the prior knowledge corresponding to the training data is that the label of the image is "dog". Input the training data labeled "dog" of the image into the trained model, and judge whether the predicted value output by the model is "dog". If the output of the model is "dog", it can be determined that the model can predict accurately.

(4) Environmental destruction sub-module 114:

After the above training process ends, the environment destruction sub-module 114 can destroy the created container environment. However, the first storage space 115 will not be destroyed, and the trained model is stored in the first storage space 115, so that the inference sub-module 113 can perform the inference process according to the stored trained model.

(5) Metadata extraction module 120:

The metadata extraction module 120 can automatically extract metadata from the machine learning program code input by the source developer during the machine learning task performed by the running module 110, and the metadata can be used to analyze the operating environment of the machine learning task. Reproduce.

The metadata extraction module 120 may also generate a description file from the extracted metadata, and store the generated description file in the second storage space 121. So that when other source developers want to reproduce the running environment of the above-mentioned machine learning tasks, they obtain the stored description file from the second storage space 121, and directly configure and debug the development environment according to some related metadata included in the description file , So as to reproduce the target training environment and accelerate the spread of the model.

In order to reproduce the training environment of a specific machine learning task, the source developer in the prior art usually provides one or more of the following three description standards of related metadata, for example, the deep learning selected by the source developer The framework (framework), the model used by the source developer, and the dataset used by the source developer. The following describes the above-mentioned metadata in detail with reference to Table 1 to Table 3.

Table 1 framework (framework)

属性Attributes	类型Types of	描述description
名称(name)Name	stringstring	源开发者选择的深度学习框架的名称The name of the deep learning framework chosen by the source developer
版本(version)Version	stringstring	源开发者选择的深度学习框架的版本The version of the deep learning framework selected by the source developer

As shown in Table 1, the framework of deep learning can include, but is not limited to: tensorflow (tensorflow), convolutional neural network framework (convolutional neural network framework, CNNF), convolutional architecture for fast feature embedding fast feature embedding, CAFFE).

It should be understood that in addition to supporting common network structures, such as convolutional neural network (CNN) and recursive neural network (RNN), tensorflow can also support deep reinforcement learning and other computationally intensive sciences. Calculation (such as solving partial differential equations, etc.).

Table 2 model (model)

属性Attributes	类型Types of	描述description
名称(name)Name	stringstring	源开发者使用的模型名称The name of the model used by the source developer
版本(version)Version	stringstring	源开发者使用的模型版本The model version used by the source developer
来源(source)Source	stringstring	源开发者使用的模型的来源The source of the model used by the developer
文件名(file)File name (file)	objectobject	源开发者使用的模型文件名The model file name used by the source developer
作者(creator)Author (creator)	stringstring	源开发者使用的模型的作者Author of the model used by the source developer
时间(time)Time	ISO-8601ISO-8601	源开发者使用的模型的创建时间Creation time of the model used by the source developer

As shown in Table 2, the models used by the source developers can include, but are not limited to: image recognition models, text recognition models, and so on.

It should be noted that the model used by the source developer can be a public model or a private model. If the source developer uses a public model, the public model provides a uniform resource location (URL) link.

It also needs to be explained. The file name of the model used by the source developer is not directly stored in the metadata description file, and the file name of the model can be packaged with the metadata description file in a file name description form. If the model used by the source developer is a public model, the metadata description file is a URL link.

Table 3 Dataset (dataset)

属性Attributes	类型Types of	描述description
名称(name)Name	stringstring	源开发者使用的数据集的名称The name of the dataset used by the source developer
版本(version)Version	stringstring	源开发者使用的数据集的版本The version of the data set used by the source developer
来源(source)Source	stringstring	源开发者使用的数据集的来源The source of the data set used by the source developer

As shown in Table 3, the URL link of the data set used by the source developer or the compressed file of the data set itself can be packaged with the metadata description file.

Referring to Table 1 to Table 3, metadata such as the aforementioned framework, model, and data set is usually determined by the source developer by packaging the container image of the training component in the environment building submodule 111. Take the container image that the source developer writes and launches another markup language (yet another markup language, YAML) file packaging training component as an example. When the source developer writes and starts the YAML file, the input program code includes the source development The key metadata such as the framework, model, and dataset selected and used by the user.

However, it is difficult to reproduce a training environment for machine learning tasks only by relying on the metadata provided in Table 1 to Table 3. The embodiment of this application also provides one or more of the following three description standards of related metadata, for example, the data-process used by the source developer, and the structure of the model used by the source developer (model-architecture), training parameters (training-parameters) used by the source developer during the training process. The following describes the above-mentioned metadata in detail in conjunction with Table 4 to Table 6.

Table 4 Data set processing method (data-process)

See Table 4, the data set segmentation method defined by the source developer can be a process of processing the input data set, for example, a part of the input data set is used in the training process of the model, that is, the part of the data set It can be used as training data during model training. A part of the input data set is used for the reasoning process of the model, that is, this part of the data set can be used as the test data in the model reasoning process.

Table 5 Model structure (model-architecture)

Table 6 Training parameters (training-params)

Referring to Table 4 to Table 6, one or more of the metadata such as the above-mentioned data set processing method, model structure, training parameters, etc. are usually hidden in the training program code stored in the first storage space 150 by the source developer.

In the embodiment of the present application, the metadata shown in Table 1 to Table 6 is automatically obtained in the process of the above-mentioned running module 110 performing the machine learning task. And directly configure and debug the development environment according to the metadata, so as to reproduce the training environment of the target machine learning task.

According to the metadata description standards shown in Table 1 to Table 6 above, there are a total of 6 major items of metadata that need to be extracted, that is, the deep learning framework, model, and dataset selected by the source developer. ), the processing method of the data set (data-process), the structure of the model (model-architecture), the training parameters (training-params) used in the training process. Since different metadata has different determination methods, the specific implementation methods for extracting the metadata of the above six major items are also different.

Take the extraction of metadata such as frameworks, models, and datasets as shown in Table 1 to Table 3 as an example. Since the metadata is usually determined by the source developer when packaging the container image of the training component, this The metadata is stored on the physical host that starts the training container. Therefore, the metadata extraction module 120 can obtain the metadata stored on the physical host computer as shown in Table 1 to Table 3 by sending a query command to the physical host computer.

Take the extraction of metadata such as the data set processing method, model structure, and training parameters shown in Table 4 to Table 6 as an example. Since the metadata is determined after the training container is created and started, the source developer is linking to the training container. The metadata is included in the training program code stored in the loaded storage space. Therefore, the metadata extraction module 120 can access the training program code stored in the storage space mounted on the training container to obtain information as shown in Table 4 to Table 6. Metadata shown.

In the embodiment of the present application, the metadata extraction module 120 can extract the above-mentioned several types of metadata in a keyword search manner. The complete flowchart of the machine learning task provided by the embodiment of the present application will be described in detail below with reference to FIGS.

Referring to Figure 2, the complete flow chart of the machine learning task may include the environment building process, the training process, and the inference process. The above three processes will be described in detail below.

(1) Environment construction process:

Step 210: The source developer packages the image of the training component and the image of the metadata extraction module.

When packaging the training component image, the source developer will determine the metadata such as the framework, model, and data set shown in Table 1 to Table 3.

In the case that the resource scheduling platform is kubernetes, the training component can be jupyter notebook. The jupyter notebook is an interactive web application. The source developer can use the jupyter notebook to input and adjust the model training program code online.

Step 215: The source developer starts the container image.

The source developer can store the training component image packaged in step 210 and the image of the metadata extraction module 120 in the container warehouse. It should be understood that the container warehouse can manage, store and protect container images. For example, the container warehouse may be a container registry.

Source developers can enter container startup scripts or command lines to pull different versions of container images from the container warehouse to the container environment, and start the corresponding components in the container. For example, the training component is run in the training container, and the metadata extraction module 120 is run in the extraction container.

It should be understood that different versions of container images may correspond to different metadata such as frameworks, models, and data sets.

It should also be understood that the container startup script or command line may include information such as the name and version of the pulled container image, and the time when the container image was started.

Specifically, referring to the container environment 300 in FIG. 3, the container group 310 providing training functions may include a training container and an extraction container. The training container is attached to the first storage space 115 and the extraction container is attached to the second storage space 121.

In the case that the resource scheduling platform is kubernetes, the container group may be called a pod. Pod is the smallest scheduling unit in kubernetes, and a pod can include multiple containers. For a pod, it can run on a physical host. When scheduling is required, kubernetes will schedule the pod as a whole.

The storage space mounted by the container can be a persistent volume (PV) in kubernetes. The PV is a network storage area allocated by a network administrator. PV has a life cycle independent of any single pod, that is, after the life cycle of the pod ends, the container in the pod will be destroyed, but the PV mounted by the container in the pod will not be destroyed.

(2) Training process:

Step 220: The source developer inputs the training program code.

The source developer can input the training program code according to the metadata description standard shown in Table 1 to Table 6 through the training component (such as jupyter notebook) running in the training container. The training program code includes metadata such as the data set processing mode, model structure, and training parameters shown in Table 4 to Table 6.

The input training program code may be stored in the first storage space 115 mounted on the training container.

It should be noted that in the process of model training, if you need to modify the input training program code, you can input and adjust the model training program code online through jupyter notebook.

After the model training process is over, the trained model will also be stored in the first storage space mounted on the training container.

Step 225: The metadata extraction module 120 extracts metadata and stores it in the second storage space 121 mounted on the extraction container.

The metadata extraction module 120 running in the extraction container extracts the above-mentioned metadata according to the key extraction method and the metadata description standards shown in Table 1 to Table 6.

Since different metadata has different determination methods, the specific implementation methods for extracting the metadata of the above six major items are also different. One possible implementation is that the metadata extraction module 120 extracts metadata such as frameworks, models, and data sets shown in Tables 1 to 3 from the container startup script and command line input by the source developer. Another possible implementation is that the metadata extraction module 120 extracts the data set processing method, model structure, and training shown in Table 4 to Table 6 from the training program code stored in the first storage space 115 by the source developer. Metadata such as parameters. For details, refer to the description in FIG. 4, which will not be repeated here.

It should also be noted that when the model training task ends, the pod providing the training function will be destroyed, but the mounted first storage space 115 and the second storage space 121 will not be destroyed.

(3) Reasoning process:

Step 230: Start the inference component image and the container image of the metadata extraction module 120.

The process of creating and starting the inference container image and the container image of the metadata extraction module 120 corresponds to step 215. For details, please refer to the description in step 215, which will not be repeated here.

Step 235: The inference container performs inference services according to the trained model.

Specifically, referring to the container environment 300 in FIG. 3, the container group 320 that provides inference functions may include inference containers and extraction containers. The first storage space 115 mounted by the training container can be remounted to the inference container, and the second storage space 121 mounted by the extraction container in the training container group can be remounted to the extraction container group providing the inference function. container.

The inference container can perform inferences based on the trained model stored in the mounted first storage space 115. The extraction container in the container group that provides the inference function can also obtain the metadata that may be generated during the inference process, and combine the metadata Stored in the mounted second storage space 121.

The process of extracting metadata by the metadata extraction module 120 running in the extraction container will be described in detail below with reference to the example in FIG. 4.

FIG. 4 is a schematic flowchart of a method for extracting metadata by the metadata extraction module 120 according to an embodiment of the present application. The method shown in FIG. 4 may include steps 410-420, and steps 410-420 will be described in detail below.

It should be understood that according to different types of extracted metadata, the metadata extraction module 120 shown in FIG. 1 can be divided into two parts, which are a first metadata extraction module and a second metadata extraction module.

The first metadata extraction module may be used to extract metadata such as the framework, model, and data set shown in Tables 1 to 3, which are determined by the source developer when packaging the container image of the training component from the physical host. The second metadata extraction module can be used to extract the data set processing mode, model structure, and training parameters shown in Table 4 to Table 6 from the training program code stored in the storage space mounted to the training container by the source developer And other metadata.

Optionally, in some embodiments, the resource scheduling platform is kubernetes, the first metadata extraction module may be a job extractor, and the job extractor is a kubectl command line. The second metadata extraction module is a code extractor. For ease of description, the resource scheduling platform is kubernetes as an example for description.

Step 410: The first metadata extraction module sends a query command to the physical host side to extract metadata such as the framework, model, and data set shown in Table 1 to Table 3.

Metadata such as the framework, model, and data set shown in Table 1 to Table 3 has been determined by the source developer by packaging the container image of the training component, and the container image is stored in the container warehouse. The source developer will enter the container startup script or command line to pull different versions of the container image from the container warehouse. Different versions of the container image correspond to different metadata such as frameworks, models, and datasets.

Since metadata such as frameworks, models, and datasets are stored on the physical host, job extractor needs to access external services to obtain metadata such as frameworks, models, and datasets stored on the physical host. In this embodiment of the application, the gateway (for example, egress) can be configured to enable the first metadata extraction module (for example, job extractor) to access the Internet protocol (IP) address of the physical host through the egress, and Obtain metadata such as frameworks, models, and datasets by sending query command lines.

When the resource scheduling platform is kubernetes, you can send the "kubectl get" command line to dynamically extract the relevant metadata information from the container startup script and command line on the physical host side by keyword extraction, for example, The name and version of the container image, the start time of the container image, the framework, the model, the data set and other metadata. And it is stored in the mounted second storage space 121 in the format of java script object notation (JSON) or other file formats.

Step 420: The second metadata extraction module extracts metadata such as the data set processing mode, model structure, training parameters, etc. shown in Table 4 to Table 6 from the first storage space 115 mounted on the training container.

Metadata such as the data set processing method, model structure, and training parameters shown in Table 4 to Table 6 has been stored by the source developer in the first storage space 115 mounted on the training container. Therefore, the second metadata extraction module (For example, code extractor) You can extract the entry and exit table from the training program code stored in the first storage space 115 mounted on the training container in accordance with the description standard of the metadata shown in Table 4 to Table 6 by way of keyword search 4- Metadata such as the data set processing method, model structure, training parameters shown in Table 6. And it is stored in the mounted second storage space 121 in the form of JSON or other file formats.

After the above code extractor and job extractor extract the corresponding metadata, the extracted metadata can be integrated, and the integrated metadata can be used as the second storage space in the form of "metadata description file + model + data set" 121 in.

In the embodiments of this application, the source developer can use the above-mentioned work flow of the machine learning task to automatically obtain and store the information used by the source developer in the machine learning task in the process of environment building or model training through the metadata extraction module. Metadata shown in Table 1-Table 6. After the end of the machine learning task, if the source developer or other developers need to reproduce the source development environment, they can use the saved metadata to realize the workflow construction of the entire life cycle of the machine learning task, thereby reproducing the source development environment.

FIG. 5 is a schematic structural diagram of a system 500 for extracting metadata in a machine learning training process provided by an embodiment of the present application. The system 500 may include at least one server.

For ease of description, FIG. 5 takes the server 510 and the server 520 as examples for description. The structure of the server 510 and the server 520 are similar.

The running module 110 shown in FIG. 1 may be run on at least one server, for example, the running module 110 is run on the server 510 and the server 520 respectively.

There are many deployment forms of the metadata extraction module 120 shown in FIG. 1, which are not specifically limited in the embodiment of the present application. As an example, the metadata extraction module 120 may run on each of at least one server, for example, the metadata extraction module 120 runs on the server 510 and the server 520 respectively. As another example, the metadata extraction module 120 may also run on a part of at least one server. For example, the metadata extraction module 120 runs on the server 510 or runs on the server 520. As another example, the metadata extraction module 120 may also run on other servers besides the aforementioned at least one server. For example, the metadata extraction module 120 runs on the server 530.

The system 500 may execute the above-mentioned method for extracting metadata in the machine learning training process. Specifically, at least one server in the system 500 may include at least one processor and a memory. The memory is used to store program instructions. The processor included in at least one server can execute the program instructions stored in the memory to implement the above-mentioned method for extracting metadata in the machine learning training process, or to implement the operating module shown in Figure 1 in Figure 1 110. Metadata extraction module 120. In the following, taking the server 510 as an example, the specific process of the server 510 for implementing the above-mentioned method for extracting metadata in the machine learning training process will be described in detail.

The server 510 may include: at least one processor (for example, the processor 511 and the processor 516), a memory 512, a communication interface 513, and an input/output interface 514.

Among them, at least one processor may be connected to the memory 512. The memory 512 can be used to store program instructions. The memory 512 may be a storage unit inside at least one processor, or an external storage unit independent of at least one processor, and may also include a storage unit internal to at least one processor and an external storage unit independent of at least one processor. Parts of the storage unit.

The memory 512 can be a solid state drive (SSD), a hard disk drive (HDD), a read-only memory (ROM), a random access memory (random access memory) , RAM) etc.

Optionally, the server 510 may further include a bus 515. The memory 512, the input/output interface 514, and the communication interface 513 may be connected to at least one processor through a bus 515. The bus 515 may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus 515 can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one line is used to represent in FIG. 5, but it does not mean that there is only one bus or one type of bus.

Optionally, in some embodiments, the system 500 may further include a cloud storage 540. The cloud storage 540 can be used as an external storage and connected to the system 500. The above-mentioned program instructions may be stored in the memory 512 or the cloud storage 540.

In the embodiment of the present application, at least one processor may be a central processing unit (central processing unit, CPU), or other general-purpose processors, digital signal processors (digital signal processors, DSP), and application specific integrated circuits (application specific integrated circuits). integrated circuit, ASIC), ready-made programmable gate array (field programmable gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. Alternatively, one or more integrated circuits are used to execute related programs to implement the technical solutions provided in the embodiments of the present application.

Referring to FIG. 5, in the server 510, the processor 511 is taken as an example, and the processor 511 runs an operation module 110. The running module 110 may include multiple sub-modules, for example, the environment construction sub-module 111, the training sub-module 112, the inference sub-module 113, and the environment destruction sub-module 114 shown in FIG.

The first storage space 115 of the memory 512 stores the training program code input by the active developer. The training program code includes metadata such as the data set processing method, model structure, and training parameters described in Table 4 to Table 6. One or more of. The metadata extracted by the metadata extraction module 120 is stored in the second storage space 121. The third storage space 5121 stores the training container startup script input by the active developer. The training container startup script includes one or more metadata such as the framework, model, and data set shown in Table 1 to Table 3. Kind.

The processor 511 obtains the stored program instructions from the memory 512 to run the above-mentioned machine learning tasks. Specifically, the environment building sub-module 111 in the running module 110 obtains the container startup script from the third storage space 5121 of the memory 512, and executes the above-mentioned container environment building process. The training sub-module 112 in the running module 110 obtains the training program code from the first storage space 115 of the memory 512 to execute the training process of the above model, and can store the training result of the model in the first storage space 115. For the specific implementation process of each sub-module in the running module 110 to execute the machine learning task, please refer to the description in FIG. 1, which will not be repeated here.

During the operation of the above-mentioned machine learning task, the metadata extraction module 120 can extract the data set processing method and model structure described in Table 4 to Table 6 from the training program code stored in the first storage space 115 of the memory 512 One or more of metadata such as training parameters. The metadata extraction module 120 may also extract one or more of metadata such as frameworks, models, and data sets as shown in Tables 1 to 3 from the container startup script stored in the third storage space 5121.

Optionally, in some embodiments, the metadata extraction module 120 may also generate a description file from the extracted metadata, and store the generated description file in the second storage space 121 of the memory 512. For the specific process of extracting metadata by the metadata extraction module 120, please refer to the above description, which will not be repeated here.

It should be understood that, in the various embodiments of the present application, the size of the sequence number of the above-mentioned processes does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, rather than corresponding to the embodiments of the present application. The implementation process constitutes any limitation.

A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the above-described system, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.

If the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

A method for extracting metadata in a machine learning training process, characterized in that the method is applied to a virtualized environment, and the method includes:

Running a machine learning task in the virtualized environment according to the machine learning program code input by the user;

Extracting metadata from the machine learning program code, where the metadata is used to replicate the operating environment of the machine learning task;

The metadata is stored in the first storage space.
The method according to claim 1, wherein said extracting metadata from said machine learning program code comprises:

By way of keyword search, the metadata is extracted from the machine learning program code according to the type of the metadata.
The method according to claim 2, wherein the virtualized environment runs the machine learning task through at least one training container, and the metadata includes the first type of metadata;

The method of searching by keyword and extracting the metadata from the machine learning program code according to the type of the metadata includes:

The first type of metadata is extracted from the input training container startup script according to the type of the first type of metadata, and the training container startup script is used to start the at least one training container.
The method according to claim 3, wherein the type of the first type of metadata includes any one or more of the following: a framework used by the machine learning task, a model used by the machine learning task, the The dataset used in the training process of the machine learning task.
The method according to claim 3 or 4, wherein the virtualized environment runs the machine learning task through at least one training container, and the metadata includes the second type of metadata;

The method of searching by keyword and extracting the metadata from the machine learning program code according to the type of the metadata includes:

The metadata is extracted from the input training program code according to the type of the second type of metadata, the training program code is stored in a second storage space mounted on the at least one training container, and the training program The code is used to run the model training process of the machine learning task in the at least one training container.
The method according to claim 5, wherein the type of the second type of metadata includes any one or more of the following: the processing method of the data set used in the training process of the machine learning task, the machine The structure of the model used in the training process of the learning task, and the training parameters used in the training process of the machine learning task.
A device for extracting metadata in a machine learning training process, characterized in that the device runs in a virtualized environment, and the device includes:

The running module is used to run the machine learning task in the virtualized environment according to the machine learning program code input by the user;

A metadata extraction module, configured to extract metadata from the machine learning program code, and the metadata is used to replicate the operating environment of the machine learning task;

The metadata extraction module is further configured to store the metadata in the first storage space.
The device according to claim 7, wherein the metadata extraction module is specifically configured to:

By way of keyword search, the metadata is extracted from the machine learning program code according to the type of the metadata.
The device according to claim 8, wherein the virtualized environment runs the machine learning task through at least one training container, and the metadata includes the first type of metadata;

The metadata extraction module is specifically used for:

The first type of metadata is extracted from the input training container startup script according to the type of the first type of metadata, and the training container startup script is used to start the at least one training container.
The device according to claim 9, wherein the type of the first type of metadata includes any one or more of the following: a framework used by the machine learning task, a model used by the machine learning task, the The dataset used in the training process of the machine learning task.
The device according to claim 9 or 10, wherein the virtualized environment runs the machine learning task through at least one training container, and the metadata includes the second type of metadata;

The metadata extraction module is specifically used for:

The metadata is extracted from the input training program code according to the type of the second type of metadata, the training program code is stored in a second storage space mounted on the at least one training container, and the training program The code is used to run the model training process of the machine learning task in the at least one training container.
The device according to claim 11, wherein the type of the second type of metadata includes any one or more of the following: the processing method of the data set used in the training process of the machine learning task, the machine The structure of the model used in the training process of the learning task, and the training parameters used in the training process of the machine learning task.
A system for extracting metadata in a machine learning training process, the system includes at least one server, each server includes a memory and at least one processor, the memory is used for program instructions, and the at least one processor executes the The program instructions in the memory are used to execute the method of any one of claims 1 to 6.
A non-transitory readable storage medium, characterized by comprising program instructions, when the program instructions are executed by a computer, the computer executes the method according to any one of claims 1 to 6.
A computer program product, characterized by comprising program instructions, when the program instructions are executed by a computer, the computer executes the method according to any one of claims 1 to 6.