CN111310936A

CN111310936A - Machine learning training construction method, platform, device, equipment and storage medium

Info

Publication number: CN111310936A
Application number: CN202010293953.7A
Authority: CN
Inventors: 王恬宇; 张志鹏
Original assignee: Guangji Technology Shanghai Co Ltd
Current assignee: Guangji Technology Shanghai Co Ltd
Priority date: 2020-04-15
Filing date: 2020-04-15
Publication date: 2020-06-19
Anticipated expiration: 2040-04-15
Also published as: CN111310936B

Abstract

The invention discloses a method, a platform, a device, equipment and a storage medium for constructing machine learning training, wherein the method comprises the following steps: receiving a directed acyclic graph input by a user in a training interactive interface in an operation component mode, wherein the directed acyclic graph is used for indicating each component included in a target training process created by the user and data flow directions among the components; generating a workflow according to the directed acyclic graph, wherein the workflow comprises each component in the directed acyclic graph and the running sequence of each component; calling the pre-stored encapsulated code segments corresponding to the components according to the operation sequence of the components in the workflow, building an operation environment corresponding to each code segment in the server node, and operating the corresponding code segments in the operation environment. The method reduces the threshold for constructing the machine learning training task, and simultaneously improves the efficiency for constructing the machine learning training task.

Description

Machine learning training construction method, platform, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of big data processing, in particular to a method, a platform, a device, equipment and a storage medium for constructing machine learning training.

Background

Machine learning refers to a process in which a machine trains (learns) a large amount of historical data by a statistical algorithm to generate a model (experience), and predicts the output of a relevant problem using the model. Machine learning is a branch of artificial intelligence. Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. The deep learning is the most popular machine learning technology at present, the machine learning is realized by a method of constructing a multilayer neural network, and the final aim of the deep learning is to enable a machine to have the analysis and learning capability like a human and process data such as characters, images, videos, sounds and the like. The deep learning frameworks that are currently popular are TensorFlow, Keras, PyTorch, Caffe, CNTK, MXnet, PaddlePaddle, and others.

At present, a machine learning task needs to be constructed, an artificial intelligence knowledge background needs to be provided, and meanwhile, a machine learning environment needs to be built by self to compile a large number of computer program code segments.

However, the above-mentioned method of constructing a machine learning task by itself has a high professional requirement for users, and at the same time, requires a great deal of time and effort for the users to write codes. In addition, a large number of code segments that need to be written for each machine learning task are specific to a specific deep learning framework (such as one of TensorFlow, Keras, PyTorch, Caffe, CNTK, MXnet, paddlepaddlel, etc.), which is also constrained by the limitations of the framework, the code segments are not reusable, and different tasks need to be written for different code segment implementations. When a large number of different machine learning tasks need to be performed, the amount of work involved is proportionately increased.

Disclosure of Invention

The invention provides a construction method, a platform, a device, equipment and a storage medium for machine learning training, and aims to solve the technical problems that the professional requirements of users are high and the users need to invest a lot of time and energy in the conventional construction mode for machine learning training.

In a first aspect, an embodiment of the present invention provides a method for constructing machine learning training, including:

receiving a directed acyclic graph input by a user in a training interactive interface in an operation component mode; the directed acyclic graph is used for indicating each component included in the target training process created by the user and the data flow direction among the components;

generating a workflow according to the directed acyclic graph; wherein the workflow comprises each component in the directed acyclic graph and the running sequence of each component;

and calling the pre-stored encapsulated code segments corresponding to the components according to the operation sequence of the components in the workflow, building an operation environment corresponding to each code segment in a server node, and operating the corresponding code segments in the operation environment.

In a second aspect, an embodiment of the present invention provides a machine learning training building platform, which is used for executing the machine learning training building method provided in the first aspect.

In a third aspect, an embodiment of the present invention provides a device for constructing machine learning training, including:

the receiving module is used for receiving the directed acyclic graph input by the user in the mode of operating the components on the training interactive interface; the directed acyclic graph is used for indicating each component included in the target training process created by the user and the data flow direction among the components;

the generating module is used for generating a workflow according to the directed acyclic graph; wherein the workflow comprises each component in the directed acyclic graph and the running sequence of each component;

and the calling operation module is used for calling the pre-stored encapsulated code segments corresponding to the components according to the operation sequence of the components in the workflow, building an operation environment corresponding to each code segment in the server node, and operating the corresponding code segments in the operation environment.

In a fourth aspect, an embodiment of the present invention further provides a computer device, including:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a method of constructing machine learning training as provided in the first aspect.

In a fifth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for constructing machine learning training as provided in the first aspect.

The embodiment of the invention provides a method, a platform, a device, equipment and a storage medium for constructing machine learning training, wherein the method comprises the following steps: receiving a directed acyclic graph input by a user in a training interactive interface in an operation component mode, wherein the directed acyclic graph is used for indicating each component included in a target training process created by the user and data flow directions among the components; generating a workflow according to the directed acyclic graph, wherein the workflow comprises each component in the directed acyclic graph and the running sequence of each component; calling the pre-stored encapsulated code segments corresponding to the components according to the operation sequence of the components in the workflow, building an operation environment corresponding to each code segment in the server node, and operating the corresponding code segments in the operation environment. It has the following technical effects: on the one hand, the directed acyclic graph input by a user in an operation component mode on a training interactive interface can be received, and then a machine learning training process is carried out according to the directed acyclic graph, on the premise that the user does not have an artificial intelligence basis, operation modes such as dragging and clicking can be carried out on the component, a machine learning training task is simply and quickly established, the threshold for establishing the machine learning training task is reduced, on the other hand, in different machine learning training tasks, the user does not need to write different machine learning tasks, and only the directed acyclic graph needs to be established again, so that the efficiency for establishing the machine learning training task is improved.

Drawings

FIG. 1 is a schematic flow chart of a method for constructing machine learning training according to the present invention;

FIG. 2 is an architecture diagram of a build platform for machine learning training provided by the present invention;

FIG. 3 is a schematic diagram of a training interaction interface in the method for constructing machine learning training provided by the present invention;

FIG. 4 is a schematic diagram of a directed acyclic graph in the method for constructing machine learning training provided by the present invention;

FIG. 5 is a schematic diagram of a workflow provided by the present invention;

FIG. 6 is a schematic diagram of a code segment after encapsulation in the method for constructing machine learning training provided by the present invention;

FIG. 7 is a schematic process diagram of running code segments corresponding to training components in the method for constructing machine learning training provided by the present invention;

fig. 8 is a schematic structural diagram of a machine learning training building apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Fig. 1 is a schematic flow chart of a method for constructing machine learning training according to the present invention. The method and the device are suitable for scenes of quickly constructing the machine learning training task by the user without the artificial intelligence knowledge background. The machine learning training building method can be executed by a machine learning training building device, which can be implemented by software and/or hardware, and can be integrated into a computer device. As shown in fig. 1, the method for constructing machine learning training provided by this embodiment includes the following steps:

step 101: and receiving the directed acyclic graph input by the user in the mode of operating the components in the training interactive interface.

Wherein, the directed acyclic graph is used for indicating each Component (Component) included in the target training process created by the user and the data flow direction between the components.

Specifically, the training interactive interface in the present embodiment refers to an interface that provides the user with the system operation. FIG. 3 is a schematic diagram of a training interaction interface in the method for constructing machine learning training provided by the present invention. As shown in fig. 3, the training interactive interface 31 illustratively includes: a component list area 311 and a canvas area 312 for forming a directed acyclic graph. Optionally, the training interactive interface 31 may further include a component usage instruction area. In this embodiment, the user inputs the directed acyclic graph by operating the component, which means that the user may input the directed acyclic graph by dragging, connecting, clicking a button, and the like on the component.

When an Experiment of statistical analysis, machine learning and deep learning is initiated, an Experiment (Experiment) needs to be created, namely, a directed acyclic graph. And initiating a running task in the Experiment. Experiment is an abstraction used for packet management execution tasks.

The directed acyclic graph in this embodiment includes various components included in the target training process created by the user and data flow between the components. A component in this embodiment refers to a data operation. Each component may define an output (output) or a product (artifact). Wherein, the operation result of the endmost component in the directed acyclic graph is a product, and the operation results of other components are output. The output of each Component can be used as the input of the next step by setting the environment variables of the next step. The product is actually a model generated by algorithm training, exists in an agreed format file written after the operation of the component is finished, and can be directly applied to prediction, detection and the like.

Fig. 4 is a schematic diagram of a directed acyclic graph in the method for constructing machine learning training provided by the present invention. As shown in FIG. 4, a directed acyclic graph can include the following components: data source component → data preprocessing component → feature engineering component → algorithm training component → model evaluation component, wherein the arrow represents the data flow direction.

A user can construct a directed acyclic graph through simple operations of dragging components, connecting lines, clicking buttons and the like on a training interactive interface. The threshold for constructing the machine learning training task is reduced, and the efficiency for constructing the machine learning training task is improved.

The user may initiate a run task after creating the directed acyclic graph.

The embodiment also provides a machine learning training construction platform for executing the machine learning training construction method provided by the embodiment. The construction platform for machine learning training in the embodiment is a coding-free one-stop artificial intelligence platform covering the use of non-artificial intelligence professionals. The user does not need to have related professional knowledge of artificial intelligence, and only needs to upload a certain amount of training data and marking results, a deep learning training task can be constructed on the platform through simple operations such as dragging the component and clicking the button, and therefore the model required by the user can be trained conveniently. In the specific implementation, a layered architecture is adopted, each layer is independent, and communication is performed through a standard interface, so that the adjustment of each layer does not affect other layers, and the overall expandability is ensured.

For convenience of description, the method for constructing the machine learning training will be described below in conjunction with the platform for constructing the machine learning training. Fig. 2 is an architecture diagram of a machine learning training platform provided in the present invention. As shown in fig. 2, the interface interaction layer in the machine learning trained building platform provided by the present embodiment is used to perform step 101.

Step 102: and generating a workflow (workflow) according to the directed acyclic graph.

The workflow comprises each component in the directed acyclic graph and the running sequence of each component.

Specifically, a user drags a component in a training interactive interface at the front end to construct a directed acyclic graph, but the interface at the front end only operates the directed acyclic graph, and the directed acyclic graph needs to be converted into a workflow. The workflow comprises each component in the directed acyclic graph and the running sequence of each component.

As shown in FIG. 2, the build platform for machine learning training further comprises a business logic layer, a capability scheduling layer, and a capability providing layer. The business logic layer in the build platform of machine learning training is used to perform step 102.

And the business logic layer is used for realizing access control, is in charge of data source management, generates a workflow according to the directed acyclic graph, manages the workflow, and submits the workflow to the capability scheduling layer in the machine learning training construction platform for execution. The business logic layer manages the workflow, namely, the current construction state of the workflow is saved, the workflow is edited, and the like.

The data sources in this embodiment are of two types: structured data and unstructured data. Structured data is stored in Comma-Separated Values (CSV) files in the file system, while unstructured data is stored directly in directories of the file system. The file system here may be a distributed file system. The data source in this embodiment may be training data and a labeling result in a construction process of machine learning training.

The building device of the machine learning training or the building platform of the machine learning training can receive the training data uploaded by the user and the labeling result in advance.

Fig. 5 is a schematic diagram of a workflow provided by the present invention. As shown in FIG. 5, each link in the workflow provides a component 51 consisting of a layer definition for a capability in a build platform of machine learning training. Behind the component is a code fragment (CodeSnippet) of the capability provision layer. It should be noted that the components in the workflow are substantially the same as those in the directed acyclic graph. Since the capability scheduling layer in the machine learning trained construction platform cannot directly schedule and run according to the sequence of the components in the front-end directed acyclic graph, a business logic layer is required to convert the directed acyclic graph into a workflow of the capability scheduling layer.

Step 103: calling the pre-stored encapsulated code segments corresponding to the components according to the operation sequence of the components in the workflow, building an operation environment corresponding to each code segment in the server node, and operating the corresponding code segments in the operation environment.

Specifically, after generating the workflow, the business logic layer may send the workflow to the capability scheduling layer. The capability scheduling layer can call the pre-stored encapsulated code segments corresponding to the components according to the running sequence of the components in the workflow, build the running environments corresponding to the code segments in the server nodes, and run the corresponding code segments in the running environments.

Referring to fig. 2, the capability scheduling layer schedules each component in a workflow manner, and provides an interface for uniformly calling an algorithm to the service logic layer, while shielding differences between different algorithms of the capability providing layer.

The capability providing layer integrates various data preprocessing, characteristic engineering, data analysis, machine learning algorithms and deep learning algorithms to form a large number of assemblies which can be dragged by users, and the differences of different machine learning frames at the bottom layer are shielded through the packaging of the assemblies. The capability providing layer also provides a scheduled interface to the capability scheduling layer.

The scheduling granularity provided by the capability providing layer to the capability scheduling layer is an algorithm training code segment. Whether the algorithm is implemented based on Pandas, scikitlern, TensorFlow, Keras, PyTorch, Caffe, CNTK, MXnet, paddlefold, or other libraries, it can be implemented or invoked by code segments and encapsulated in a unified standard. More specifically, these code segments may be Python-based code segments.

Each component corresponds to a packaged code segment. Fig. 6 is a schematic diagram of a code segment after encapsulation in the method for constructing machine learning training provided by the present invention. As shown in fig. 6, the encapsulated code segment includes an algorithm code segment, an input data stream interface, an output data stream interface, and a parameter interface. Any encapsulated code segment is associated by a plurality of input data streams and zero or one output data stream, and can carry out debugging optimization algorithm code segment performance through a plurality of parameters. The code segment shown in fig. 6 comprises 2 input data stream interfaces: input data stream interface 1 and input data stream interface 2, include 3 parameter interfaces: parameter interface 1, parameter interface 2 and parameter interface 3, including 1 output data stream interface.

The code segment is packaged, so that the method has the advantages that the expansibility of the platform is greatly improved, a plurality of advanced algorithms can be integrated into the platform at extremely low time cost, the scheduling interface is provided for the capability scheduling layer through the method, the capability scheduling layer can call the algorithms without any modification, and the actual statistical analysis or training task is carried out by combining a data source.

In one implementation, the packaged code segments corresponding to the components are called by the workflow task scheduling tool according to the running sequence of the components in the workflow.

The orderly operation of each component in the workflow is realized through the workflow task scheduling tool, so that the normal and effective operation of the constructed machine learning training task is ensured.

In an implementation manner, the server node in this embodiment is a server node in a private cloud. A private cloud is a cloud platform built using infrastructure inside an enterprise.

Cloud computing is a novel computing resource utilization mode, and distributes computing tasks on a resource pool formed by a large number of computers, so that various application systems can acquire computing power, storage space and information service according to needs, namely, a technology for performing integrated management and redistribution on computer hardware resources. Cloud services include public clouds, which generally refer to usable clouds that third party providers provide for users, and private clouds. Public clouds are typically available over the Internet (Internet), and the core attribute of public clouds is shared resource services. There are many instances of such a cloud that can provide services throughout the open public network today. Currently, enterprises build machine learning public clouds to provide services to users in a Web manner. However, uploading over the external internet will cause a significant delay when encountering the need to prepare large amounts of training data for machine learning training tasks. For some organizations, the training data is generated by the business of the organization, certain privacy exists, and the user is prohibited by uploading and processing the training data through the external Internet, wherein the uploading and processing may generate data leakage risks.

Therefore, in the embodiment, the machine learning training task is run in the server node of the private cloud built by the infrastructure inside the enterprise, so that on one hand, the transmission rate of the training data is high, and on the other hand, the safety is high.

In this embodiment, the container technology is used to build the operating environment corresponding to each code segment in the server node. The specific process is as follows: determining a server node to which a mirror image corresponding to each code segment is to be loaded through a container arrangement tool; and loading the mirror image corresponding to each code segment into the corresponding server node to form each container. The container orchestration tool may determine the server node to which the image of each code segment is to be loaded, and then record the image of each code segment into the corresponding server node to form each container. The container is the operating environment.

Correspondingly, each code segment is mounted in a corresponding container to be run, so that the corresponding code segment can be run in a running environment.

Alternatively, the container in this embodiment may be an application container engine (Docker) -based container. The container arrangement tool may be kubernets. The workflow task scheduling tool is Argo based on Kubernetes.

In one scenario, when the component is a training component or an evaluation component, the mirror image of the training component and the evaluation component includes a mirror image of a machine learning framework to which the training component or the evaluation component corresponds. The training component in the present embodiment refers to a component that performs a training task. An evaluation component refers to a component that performs an evaluation task. In this scenario, the images of the machine learning framework corresponding to the training component or the evaluation component are loaded into the respective server nodes, forming a container. And mounting the code segments corresponding to the training components or the evaluation components to the corresponding containers for running.

Based on the above scenario, please continue to refer to fig. 2, the platform for constructing machine learning training provided in this embodiment further includes: a machine learning framework layer. Machine learning frameworks that integrate the current mainstream, including scikitleren, TensorFlow, Keras, PyTorch, Caffe2, CNTK, MXnet, paddlepaddley, etc., can satisfy most current machine learning application scenarios. Each machine learning framework has a corresponding mirror image in a Kubernet local mirror image warehouse, the running environment of the corresponding machine learning framework is provided by directly pulling the corresponding mirror image, and then the Kubernet creates a code segment for providing layer transfer of Job resource running capability according to the mirror image.

Further, when the component is a training component, the efficiency of running the code segments corresponding to the training component can be improved through distributed training.

Referring to fig. 2, the platform for machine learning training provided in this embodiment further includes a container private cloud layer and a server cluster layer. Illustratively, a container-based private cloud is built by kubernets and Docker technology and enables distributed training of each specific machine learning framework. The server cluster layer is constructed by common server nodes in the local area network. Each server node is used as a node in the whole cluster system, and each server node can be provided with 0, 1 or a plurality of Graphic Processing Units (GPUs). Resources such as a GPU (graphics Processing Unit), a Central Processing Unit (CPU), a memory, a hard disk, and storage, which are possessed by each node may be different, and the kubernets at the upper layer mark corresponding tags on the resources possessed by different server nodes, and then schedule the resources according to the tags. The tags here may be type tags.

According to the invention, a private cloud is constructed by a Docker container technology and Kubernetes, so that support is provided for an upper-layer machine learning training task. The Docker container technology can process differences among different platforms, provide a standardized delivery mode, unified configuration and unified environment, ensure efficiency and effectively realize resource limitation. In addition, the container can be rapidly migrated and highly available in seconds. The container service can be configured according to needs, the second-level elastic expansion and contraction are realized, the service time for environment building and application creation of development, testing and operation and maintenance personnel is greatly reduced, the working efficiency is improved, the resource utilization rate of infrastructure is improved, and the hardware, software and labor cost is reduced. Kubernetes is a container cluster management system of Google (Google) open source, is a scheduling service for constructing a container based on Docker, and provides functional kits such as resource scheduling, balanced disaster tolerance, service registration, dynamic expansion and contraction capacity and the like. Kubernetes integrates, manages and redistributes computer hardware resources through a software means, achieves containerization management and intelligent resource allocation of a user private cluster, and provides full-flow standardized host management, application continuous integration, mirror image construction, deployment management, container operation and maintenance and multi-level monitoring services. The use of kubernets facilitates the management of cross-machine running containerized applications. A container private cloud platform can be constructed based on server infrastructure in a local area network through the Docker container and the Kubernetes, and the privacy requirement of data in enterprises or organizations is met.

Corresponding custom resources are built in the container private cloud layer aiming at different deep learning frameworks of the upper layer to support distributed training of the corresponding frameworks in Kubernets. These custom resources are for different deep learning frameworks, although the implementation is different, the basic principle is similar: and marking type labels on different server nodes, and scheduling different tasks to the nodes of corresponding types to execute during specific training. Nodes are generally divided into two categories: a called parameter server node (parameter server) for managing the storage and update work of parameters; one type is a compute server node (worker) that performs specific computations. And an asynchronous parallel mode is adopted in distributed training. In the asynchronous training, after each computation server node completes the training of one sample (mini-batch), the parameters of the model are directly updated without waiting for the completion of the training of other computation server nodes.

And the distributed processing of the whole machine learning training task is completed through the coordination work of the computing server node and the parameter service node. The user uses different training components to correspond to different algorithms, and the algorithms are associated with a specific deep learning framework, so that corresponding self-defined resources are called, the bottom-layer distributed processing is transparent to the user, the training speed can be obviously improved, and the user does not need to pay attention to specific details of bottom-layer distributed training.

When the component is a training component, determining a server node to which a mirror image corresponding to each code segment is to be loaded through a container arrangement tool, specifically comprising: and determining parameter server nodes participating in the parameter updating task of the training task corresponding to the training component and determining a plurality of computing server nodes participating in the computing task of the training task corresponding to the training component through a container arrangement tool. The determination may be based on a pre-configured mapping table. For example, the mapping relationships of the training components, the parameter server nodes, and the compute server nodes are defined in the mapping table.

Correspondingly, loading the mirror image of the machine learning framework corresponding to the training component into the corresponding server node, and forming a container, including: and loading the mirror image of the machine learning framework corresponding to the training component into the parameter server node and the plurality of calculation server nodes to respectively form containers.

Correspondingly, the method for hanging the code segments corresponding to the training components into the corresponding containers to run comprises the following steps: in a container for controlling the parameter server nodes, a code segment of an operating training component is used for receiving and storing current parameters of training tasks corresponding to the training component from a plurality of computing server nodes, controlling a code segment corresponding to the training component operating in the container of the computing server nodes, asynchronously and parallelly acquiring the current parameters from the parameter server nodes, controlling each computing server node to forward propagate according to the current parameters and training data to acquire a current predicted value, backward propagate according to the current predicted value to acquire an updated current parameter, using the updated current parameter as a new current parameter, sending the new current parameter to the parameter server nodes, repeating the step until the error between the current predicted value and a marked value is smaller than a preset error threshold value, and stopping executing the step.

Fig. 7 is a schematic process diagram of running code segments corresponding to training components in the machine learning training construction method provided by the present invention. As shown in fig. 7, different computing server nodes all execute a step of randomly selecting part of training data, obtaining current parameters from the parameter server nodes, obtaining predicted values through forward propagation, updating parameters through backward propagation, taking the updated parameters as new current parameters, and sending the new current parameters to the parameter server nodes.

As can be seen from fig. 7, in each iteration, different computing server nodes read the latest values of the parameters, but the values obtained may be different because the time for reading the parameter values by different computing server nodes is different. According to the value of the current parameter and a small part of randomly acquired training data, different computing server nodes respectively run a back propagation process and independently update the parameter. Asynchronous mode can be simply considered as a single-machine mode where multiple copies are replicated, each copy being trained using different training data. Asynchronous training the population of asynchronous training is much faster since it is not necessary to wait for all compute server nodes to average out the parameters after computation is complete. Therefore, by adopting the method to operate the code segment corresponding to the training component, the efficiency of machine learning training can be improved.

According to the invention, the bottom frames of ScikitLearn, TensorFlow, Keras, PyTorch, Caffe, CNTK, MXnet, PaddlePaddle and the like and the distributed algorithm built on the frames are packaged, the front end provides a visual operation environment for dragging and pulling, a user builds a machine learning training task through simple operations of dragging components, clicking buttons and the like, and conveniently trains out a model required by the user, so that the creation process of the machine learning training task is as simple as building blocks, and the use threshold of machine learning is greatly reduced.

Furthermore, in the construction method of machine learning training provided by this embodiment, the training progress and state of the task can be viewed in real time on the training interactive interface. For a deep learning training task, data needs to be repeatedly loaded and used, and the prediction value of the model on the training data and the actual value of the training data are closer and closer through a random gradient descent algorithm. The training data is divided into a plurality of batches, each batch contains a plurality of data, and the training times of the whole data set are preset through parameters, so that the training progress of the whole training task can be converted from the number of the currently trained data to the number of the batches.

In the training interactive interface, the user can also pause, restart and other operations on the training task at any time. The system provides optimized default hyper-parameters, and the user can train a good model even if the default parameters of the component are used.

The machine learning training construction platform provided by the embodiment can also provide a user-defined assembly, and a user can upload own codes in the user-defined assembly, so that a machine learning training task is constructed in a more flexible mode. Furthermore, the platform can integrate mature algorithms such as data preprocessing, feature engineering, statistical analysis, machine learning and deep learning, and can integrate intelligent algorithms such as advanced target detection, target tracking, scene recognition, computer vision navigation, image semantic segmentation, instance segmentation, super-resolution, 3D reconstruction, voice recognition and semantic analysis.

The method for constructing machine learning training provided by the embodiment comprises the following steps: receiving a directed acyclic graph input by a user in a training interactive interface in an operation component mode, wherein the directed acyclic graph is used for indicating each component included in a target training process created by the user and data flow directions among the components; generating a workflow according to the directed acyclic graph, wherein the workflow comprises each component in the directed acyclic graph and the running sequence of each component; calling the pre-stored encapsulated code segments corresponding to the components according to the operation sequence of the components in the workflow, building an operation environment corresponding to each code segment in the server node, and operating the corresponding code segments in the operation environment. It has the following technical effects: on the one hand, the directed acyclic graph input by a user in an operation component mode on a training interactive interface can be received, and then a machine learning training process is carried out according to the directed acyclic graph, on the premise that the user does not have an artificial intelligence basis, operation modes such as dragging and clicking can be carried out on the component, a machine learning training task is simply and quickly established, the threshold for establishing the machine learning training task is reduced, on the other hand, in different machine learning training tasks, the user does not need to write different machine learning tasks, and only the directed acyclic graph needs to be established again, so that the efficiency for establishing the machine learning training task is improved.

Fig. 8 is a schematic structural diagram of a machine learning training building apparatus according to an embodiment of the present invention. As shown in fig. 8, the machine learning training construction apparatus provided in this embodiment includes the following modules: a receiving module 81, a generating module 82 and a call running module 83.

And the receiving module 81 is used for receiving the directed acyclic graph input by the user in the training interactive interface in the operating component mode.

Wherein the directed acyclic graph is used for indicating each component included in the user-created target training process and data flow direction among the components.

And the generating module 82 is used for generating a workflow according to the directed acyclic graph.

Wherein the workflow comprises each component in the directed acyclic graph and the running sequence of each component.

And the calling and running module 83 is configured to call the pre-stored encapsulated code segments corresponding to the components according to the running sequence of the components in the workflow, establish a running environment corresponding to each code segment in the server node, and run the corresponding code segment in the running environment.

Optionally, in the aspect of calling the pre-stored encapsulated code segments corresponding to the components according to the running sequence of the components in the workflow, the calling and running module 83 is specifically configured to: and calling the pre-stored encapsulated code segments corresponding to the components according to the running sequence of the components in the workflow by a workflow task scheduling tool.

Optionally, the server node is a server node in a private cloud. In the aspect of building the running environment corresponding to each code segment in the server node, the call running module 83 is specifically configured to: determining a server node to which a mirror image corresponding to each code segment is to be loaded through a container arrangement tool; and loading the mirror image corresponding to each code segment into the corresponding server node to form each container. Correspondingly, in the aspect of running a corresponding code segment in the running environment, the call running module 83 is specifically configured to: and mounting each code segment into a corresponding container for operation.

Optionally, when the component is a training component or an evaluation component, in the aspect of loading the image corresponding to each code segment into a corresponding server node to form each container, the call running module 83 is specifically configured to: and loading the mirror image of the machine learning framework corresponding to the training component or the evaluation component into the corresponding server node to form a container. Correspondingly, in terms of mounting and running each code segment into a corresponding container, the call running module 83 is specifically configured to: and mounting the code segments corresponding to the training components or the evaluation components into corresponding containers for running.

Optionally, when the component is a training component, in the aspect of determining, by the container arrangement tool, a server node to which the image corresponding to each code segment is to be loaded, the call running module 83 is specifically configured to: and determining parameter server nodes participating in the parameter updating task of the training task corresponding to the training component and determining a plurality of computing server nodes participating in the computing task of the training task corresponding to the training component by the container arrangement tool. Correspondingly, in the aspect of loading the mirror image of the machine learning framework corresponding to the training component into the corresponding server node to form a container, the call running module 83 is specifically configured to: and loading the mirror image of the machine learning framework corresponding to the training component into the parameter server node and the plurality of calculation server nodes to respectively form containers. Correspondingly, in terms of mounting and running the code segments corresponding to the training components in the corresponding containers, the call running module 83 is specifically configured to: in a container for controlling the parameter server nodes, a code segment of an operating training component is used for receiving and storing current parameters of training tasks corresponding to the training component from a plurality of the computing server nodes, controlling a code segment corresponding to the training component operating in the container of the computing server nodes, asynchronously and parallelly acquiring the current parameters from the parameter server nodes, controlling each computing server node to forward propagate according to the current parameters and training data to acquire a current predicted value, backward propagate according to the current predicted value to acquire an updated current parameter, using the updated current parameter as a new current parameter, sending the new current parameter to the parameter server nodes, and repeating the steps until the error between the current predicted value and a labeled value is smaller than a preset error threshold value, the execution of this step is stopped.

Optionally, the container is a Docker-based container, the container orchestration tool is kubernets, and the workflow task scheduling tool is Argo based on kubernets.

Optionally, the training interactive interface includes: a component list area and a canvas area for forming a directed acyclic graph.

Optionally, the encapsulated code segment includes an algorithm code segment, an input data stream interface, an output data stream interface, and a parameter interface.

The machine learning training construction device provided by the embodiment of the invention can execute the machine learning training construction method provided by the embodiment shown in fig. 1 and various optional modes, and has corresponding functional modules and beneficial effects of the execution method.

Fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present invention. As shown in fig. 9, the computer device includes a processor 90 and a memory 91. The number of the processors 90 in the computer device may be one or more, and one processor 90 is taken as an example in fig. 9; the processor 90 and the memory 91 of the computer device may be connected by a bus or other means, as exemplified by the bus connection in fig. 9.

The memory 91 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions and modules corresponding to the machine learning training construction method in the embodiment of the present invention (for example, the receiving module 81, the generating module 82, and the invoking and running module 83 in the machine learning training construction apparatus). The processor 90 executes various functional applications and data processing of the computer device by executing software programs, instructions and modules stored in the memory 91, namely, implements the above-described method for constructing machine learning training.

The memory 91 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 91 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 91 may further include memory located remotely from processor 90, which may be connected to a computer device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The present invention also provides a storage medium containing computer-executable instructions which, when executed by a computer processor, perform a method of constructing machine learning training, the method comprising:

Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the machine learning training construction method provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk, or an optical disk of a computer, and includes several instructions to enable a computer device (which may be a personal computer, a vehicle, or a network device) to execute the method for constructing the machine learning training according to the embodiments of the present invention.

It should be noted that, in the embodiment of the building apparatus for machine learning training, the included units and modules are only divided according to the functional logic, but are not limited to the above division as long as the corresponding functions can be realized; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for constructing machine learning training, comprising:

2. The method of claim 1, wherein the server node is a server node in a private cloud;

the calling the pre-stored encapsulated code segments corresponding to the components according to the running sequence of the components in the workflow comprises:

calling the pre-stored encapsulated code segments corresponding to each component according to the running sequence of each component in the workflow by a workflow task scheduling tool;

the establishing of the running environment corresponding to each code segment in the server node comprises the following steps:

determining a server node to which a mirror image corresponding to each code segment is to be loaded through a container arrangement tool;

loading the mirror image corresponding to each code segment into a corresponding server node to form each container;

correspondingly, in the running environment, running a corresponding code segment;

and mounting each code segment into a corresponding container for operation.

3. The method of claim 2, wherein when the component is a training component or an evaluation component, the loading the image corresponding to each code segment into the corresponding server node to form each container comprises:

loading the mirror image of the machine learning framework corresponding to the training component or the evaluation component into a corresponding server node to form a container;

correspondingly, each code segment is mounted in a corresponding container to run, and the method comprises the following steps:

and mounting the code segments corresponding to the training components or the evaluation components into corresponding containers for running.

4. The method of claim 3, wherein when the component is a training component, the determining, by the container orchestration tool, a server node to which the image corresponding to each of the code segments is to be loaded comprises:

determining, by the container arrangement tool, parameter server nodes participating in a parameter update task of a training task corresponding to the training component, and determining a plurality of computing server nodes participating in a computing task of the training task corresponding to the training component;

correspondingly, loading the mirror image of the machine learning framework corresponding to the training component into the corresponding server node to form a container, including:

loading the mirror image of the machine learning framework corresponding to the training component into the parameter server node and the plurality of computing server nodes to respectively form containers;

correspondingly, the step of mounting the code segments corresponding to the training components into corresponding containers for running comprises the following steps:

in a container for controlling the parameter server nodes, a code segment of an operating training component is used for receiving and storing current parameters of training tasks corresponding to the training component from a plurality of the computing server nodes, controlling a code segment corresponding to the training component operating in the container of the computing server nodes, asynchronously and parallelly acquiring the current parameters from the parameter server nodes, controlling each computing server node to forward propagate according to the current parameters and training data to acquire a current predicted value, backward propagate according to the current predicted value to acquire an updated current parameter, using the updated current parameter as a new current parameter, sending the new current parameter to the parameter server nodes, and repeating the steps until the error between the current predicted value and a labeled value is smaller than a preset error threshold value, the execution of this step is stopped.

5. The method of any one of claims 2 to 4, wherein the container is an application container engine Docker based container, the container orchestration tool is Kubernets, and the workflow task scheduling tool is Argo based on Kubernets.

6. The method of any of claims 1-4, wherein the training interactive interface comprises: a component list area and a canvas area for forming a directed acyclic graph;

the encapsulated code segment comprises an algorithm code segment, an input data stream interface, an output data stream interface and a parameter interface.

7. A machine learning trained build platform for performing the method of machine learning trained build of any one of claims 1-6.

8. A building device for machine learning training, comprising:

9. A computer device, characterized in that the computer device comprises:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a method of constructing machine learning training as recited in any of claims 1-6.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method of constructing machine learning training according to any one of claims 1 to 6.