CN111310936B

CN111310936B - Construction method, platform, device, equipment and storage medium for machine learning training

Info

Publication number: CN111310936B
Application number: CN202010293953.7A
Authority: CN
Inventors: 王恬宇; 张志鹏
Original assignee: Guangji Technology Shanghai Co ltd
Current assignee: Guangji Technology Shanghai Co ltd
Priority date: 2020-04-15
Filing date: 2020-04-15
Publication date: 2023-06-20
Anticipated expiration: 2040-04-15
Also published as: CN111310936A

Abstract

The invention discloses a construction method, a platform, a device, equipment and a storage medium for machine learning training, wherein the method comprises the following steps: receiving a directed acyclic graph input by a user in a training interaction interface in a component operation mode, wherein the directed acyclic graph is used for indicating each component and a data flow direction among the components in a target training process created by the user; generating a workflow according to the directed acyclic graph, wherein the workflow comprises all components in the directed acyclic graph and the operation sequence of all components; and calling the pre-stored encapsulated code segments corresponding to the components according to the operation sequence of the components in the workflow, building an operation environment corresponding to the code segments in the server node, and operating the corresponding code segments in the operation environment. The method reduces the threshold for constructing the machine learning training task, and improves the efficiency of constructing the machine learning training task.

Description

Construction method, platform, device, equipment and storage medium for machine learning training

Technical Field

The embodiment of the invention relates to the technical field of big data processing, in particular to a construction method, a platform, a device, equipment and a storage medium for machine learning training.

Background

Machine learning refers to the training (learning) of a large amount of historical data by a machine through a statistical algorithm to generate a model (experience), and predicting the output of a related problem by using the model. Machine learning is a branch of artificial intelligence. Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Deep learning is the most popular machine learning technology at present, and machine learning is realized by a method for constructing a multi-layer neural network, and the final aim is to enable a machine to have analysis learning capability like a person and process data such as characters, images, videos, sounds and the like. Currently popular deep learning frameworks are TensorFlow, keras, pyTorch, caffe, CNTK, MXnet, paddlePaddle and the like.

At present, a machine learning task is constructed, an artificial intelligence knowledge background is needed, and meanwhile, a machine learning environment is built by the user to write a large number of computer program code segments.

However, the above-described method of constructing the machine learning task by itself requires a relatively high professional requirement for the user, and requires a lot of time and effort for the user to write the code. In addition, the large number of code segments that need to be written for each machine learning task is limited by the limitations of a particular deep learning framework (e.g., one of TensorFlow, keras, pyTorch, caffe, CNTK, MXnet, paddlePaddle, etc.), i.e., the code segments are not reusable, and different tasks need to be written for different code segment implementations. When a large number of different machine learning tasks are to be implemented, the effort involved is also proportionally increased.

Disclosure of Invention

The invention provides a construction method, a platform, a device, equipment and a storage medium for machine learning training, which are used for solving the technical problems that the prior construction mode of the machine learning training has higher professional requirements on users and requires the users to input a great amount of time and energy.

In a first aspect, an embodiment of the present invention provides a method for constructing machine learning training, including:

receiving a directed acyclic graph input by a user in a training interaction interface in an operation component mode; the directed acyclic graph is used for indicating each component included in the target training process created by the user and the data flow direction among the components;

Generating a workflow according to the directed acyclic graph; wherein the workflow comprises each component in the directed acyclic graph and the operation sequence of each component;

and calling the pre-stored encapsulated code segments corresponding to the components according to the operation sequence of the components in the workflow, building operation environments corresponding to the code segments in the server nodes, and operating the corresponding code segments in the operation environments.

In a second aspect, an embodiment of the present invention provides a construction platform for machine learning training, where the platform is configured to perform the construction method for machine learning training as provided in the first aspect.

In a third aspect, an embodiment of the present invention provides a construction apparatus for machine learning training, including:

the receiving module is used for receiving the directed acyclic graph input by the user in the training interactive interface in an operation component mode; the directed acyclic graph is used for indicating each component included in the target training process created by the user and the data flow direction among the components;

the generation module is used for generating a workflow according to the directed acyclic graph; wherein the workflow comprises each component in the directed acyclic graph and the operation sequence of each component;

And the calling operation module is used for calling the prestored encapsulated code segments corresponding to the components according to the operation sequence of the components in the workflow, building operation environments corresponding to the code segments in the server nodes, and operating the corresponding code segments in the operation environments.

In a fourth aspect, an embodiment of the present invention further provides a computer device, including:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement a method of building machine learning training as provided in the first aspect.

In a fifth aspect, embodiments of the present invention also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of constructing a machine learning training as provided in the first aspect.

The embodiment of the invention provides a method, a platform, a device, equipment and a storage medium for constructing machine learning training, wherein the method comprises the following steps: receiving a directed acyclic graph input by a user in a training interaction interface in a component operation mode, wherein the directed acyclic graph is used for indicating each component and a data flow direction among the components in a target training process created by the user; generating a workflow according to the directed acyclic graph, wherein the workflow comprises all components in the directed acyclic graph and the operation sequence of all components; and calling the pre-stored encapsulated code segments corresponding to the components according to the operation sequence of the components in the workflow, building an operation environment corresponding to the code segments in the server node, and operating the corresponding code segments in the operation environment. The method has the following technical effects: on one hand, the directed acyclic graph input by a user in a mode of operating the component on the training interactive interface can be received, and then, the machine learning training process is carried out according to the directed acyclic graph, and on the premise that the user does not have an artificial intelligence basis, the machine learning training task can be simply and rapidly built by carrying out operation modes such as dragging, clicking and the like on the component, so that the threshold for building the machine learning training task is reduced, and on the other hand, in different machine learning training tasks, the user does not need to write different machine learning tasks, and only the directed acyclic graph needs to be rebuilt, so that the efficiency of building the machine learning training task is improved.

Drawings

FIG. 1 is a flow chart of a method of constructing machine learning training provided by the present invention;

FIG. 2 is a schematic diagram of a build platform for machine learning training provided by the present invention;

FIG. 3 is a schematic diagram of a training interactive interface in the method for constructing machine learning training provided by the invention;

FIG. 4 is a schematic diagram of a directed acyclic graph in a method of constructing machine learning training provided by the present invention;

FIG. 5 is a schematic diagram of a workflow provided by the present invention;

FIG. 6 is a schematic diagram of a code segment encapsulated in the method for constructing machine learning training provided by the present invention;

FIG. 7 is a schematic diagram of a process of running a code segment corresponding to a training component in the method for constructing machine learning training provided by the present invention;

FIG. 8 is a schematic structural diagram of a machine learning training construction device according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Fig. 1 is a flow chart of a method for constructing machine learning training provided by the invention. The method and the device are suitable for a scene that a user without artificial intelligence knowledge background quickly builds a machine learning training task. The method of constructing machine learning training may be performed by a machine learning training construction apparatus, which may be implemented in software and/or hardware, and which may be integrated in a computer device. As shown in fig. 1, the method for constructing machine learning training provided in this embodiment includes the following steps:

step 101: and receiving the directed acyclic graph input by the user in the training interaction interface in a mode of operating the components.

Wherein the directed acyclic graph is used to indicate individual components (components) included in a user-created target training process and a data flow between the individual components.

Specifically, the training interactive interface in the present embodiment refers to an interface that provides a user with system operation. Fig. 3 is a schematic diagram of a training interactive interface in the method for constructing machine learning training provided by the invention. As shown in fig. 3, the training interactive interface 31 includes, illustratively: the component list area 311 and the canvas area 312 for forming the directed acyclic graph. Optionally, the training interactive interface 31 may further include a component usage instruction area. In this embodiment, the user inputs the directed acyclic graph in a manner of operating the component, which means that the user can input the directed acyclic graph by dragging, connecting wires, clicking a button, and the like on the component.

When an Experiment of statistical analysis, machine learning, deep learning is initiated, an Experiment (Experiment), i.e., a directed acyclic graph, needs to be created. And initiating the running task in the expert. Experiment is an abstract concept for grouping management of running tasks.

The directed acyclic graph in this embodiment includes various components and data flow between the components that the user-created target training process includes. The component in this embodiment refers to one data operation. Each component may define an output (output) or a product (artifact). The operation result of the end-most component in the directed acyclic graph is a product, and the operation result of other components is output. The output of each Component can be used as input to the next step by setting the environment variables of the next step. The product is actually a model generated by algorithm training, exists in a written default format file after the assembly operation is completed, and can be directly applied to prediction, detection and the like.

Fig. 4 is a schematic diagram of a directed acyclic graph in a method of constructing machine learning training provided by the present invention. As shown in fig. 4, the directed acyclic graph may include the following components: data source component, data preprocessing component, characteristic engineering component, algorithm training component and model evaluation component, wherein the arrow represents the data flow direction.

The user can construct a directed acyclic graph on the training interactive interface through simple operations such as dragging the component, connecting wires, clicking the button and the like. The threshold for constructing the machine learning training task is reduced, and the efficiency of constructing the machine learning training task is improved.

The user may initiate a run task after the directed acyclic graph is created.

The embodiment also provides a construction platform for machine learning training, which is used for executing the construction method for machine learning training provided by the embodiment. The machine learning training construction platform of the embodiment is a coding-free one-stop artificial intelligent platform which is used by non-artificial intelligent professionals. The user does not need to have the relevant professional knowledge of artificial intelligence, only needs to upload a certain amount of training data and labeling results, and can construct a deep learning training task on the platform through simple operations such as dragging the component, clicking the button and the like, so that a model required by the user can be conveniently trained. In the concrete implementation, a layered architecture is adopted, each layer is mutually independent and communicates through a standard interface, so that the adjustment of each layer does not affect other layers, and the overall expandability is ensured.

For convenience of description, a construction method of the machine learning training will be described below in connection with a construction platform of the machine learning training. Fig. 2 is a schematic diagram of a machine learning training building platform provided by the present invention. As shown in fig. 2, an interface interaction layer in the build platform of the machine learning training provided in the present embodiment is used to perform step 101.

Step 102: from the directed acyclic graph, a workflow (workflow) is generated.

The workflow comprises various components in the directed acyclic graph and the running sequence of the various components.

Specifically, the user drags the component in the training interactive interface of the front end to construct a directed acyclic graph, but this is merely the interface operation directed acyclic graph of the front end, which needs to be converted into a workflow. The workflow comprises various components in the directed acyclic graph and the running sequence of the various components.

As shown in fig. 2, the build platform of machine learning training further includes a business logic layer, a capability scheduling layer, and a capability providing layer. The business logic layer in the build platform of the machine learning training is used to perform step 102.

The business logic layer is used for realizing access control, is responsible for data source management, generates a workflow according to the directed acyclic graph, manages the workflow, and submits the workflow to the capacity scheduling layer in the construction platform for machine learning training for execution. The business logic layer manages the workflow, which means to save the current construction state of the workflow, edit the workflow, etc.

The data sources in this embodiment are of two types: structured data and unstructured data. Structured data is stored in a Comma Separated Values (CSV) file in a file system, while unstructured data is stored directly in a directory of the file system. The file system herein may be a distributed file system. The data source in this embodiment may be training data and labeling results in the construction process of machine learning training.

The machine learning training construction device or the machine learning training construction platform can receive training data and labeling results uploaded by a user in advance.

Fig. 5 is a schematic diagram of a workflow provided by the present invention. As shown in FIG. 5, each link in the workflow provides a component 51 of layer definitions for capabilities in a build platform trained by machine learning. Behind the component is a code segment fragment (CodeSnippet) of the capability-providing layer. It should be noted that the components in the workflow are substantially the same as the components in the directed acyclic graph. Because the capability scheduling layer in the building platform of the machine learning training cannot directly schedule and run according to the sequence of the components in the directed acyclic graph at the front end, a business logic layer is required to convert the directed acyclic graph into a workflow of the capability scheduling layer.

Step 103: and calling the pre-stored encapsulated code segments corresponding to the components according to the operation sequence of the components in the workflow, building an operation environment corresponding to the code segments in the server node, and operating the corresponding code segments in the operation environment.

Specifically, after generating the workflow, the business logic layer may send the workflow to the capability scheduling layer. The capability scheduling layer can call pre-stored encapsulated code segments corresponding to the components according to the operation sequence of the components in the workflow, build operation environments corresponding to the code segments in the server nodes, and operate the corresponding code segments in the operation environments.

With continued reference to fig. 2, the capability scheduling layer schedules each component in a workflow manner, and provides an interface for unified call algorithms to the service logic layer, while masking differences between different algorithms of the capability providing layer.

The capability providing layer integrates various data preprocessing, feature engineering, data analysis, machine learning algorithms and deep learning algorithms to form a large number of components which can be dragged by a user, and the difference of different machine learning frames at the bottom layer is shielded through the encapsulation of the components. The capability providing layer also provides a scheduled interface for the capability scheduling layer.

The capability providing layer provides the capability scheduling layer with scheduling granularity of algorithm training code segments. Whether the algorithm is based on Pandas, scikitLearn, tensorFlow, keras, pyTorch, caffe, CNTK, MXnet, paddlePaddle or other library implementations, it may be implemented or invoked by a code segment and packaged in a unified standard. More specifically, the code segments may be Python-based code segments.

Each component corresponds to a packaged code segment. Fig. 6 is a schematic diagram of a code segment encapsulated in the method for constructing machine learning training provided by the present invention. As shown in fig. 6, the encapsulated code segments include an algorithm code segment, an input data stream interface, an output data stream interface, and a parameter interface. Any encapsulated code segment is associated with zero or one output data stream from multiple input data streams and can be debugged by multiple parameters to optimize the performance of the algorithm code segment. The code segment shown in fig. 6 includes 2 input data stream interfaces: an input data stream interface 1 and an input data stream interface 2, comprising 3 parameter interfaces: parameter interface 1, parameter interface 2 and parameter interface 3, comprising 1 output data stream interface.

The code segments are packaged, so that the expansibility of the platform is greatly improved, a plurality of advanced algorithms can be integrated into the platform at extremely low time cost, a scheduling interface is provided for the capacity scheduling layer through the mode, the capacity scheduling layer can call the algorithms without any modification, and the actual statistical analysis or training tasks are carried out by combining with a data source.

In one implementation, the pre-stored encapsulated code segments corresponding to each component are invoked by a workflow task scheduling tool according to the running order of each component in the workflow.

The orderly operation of each component in the workflow is realized through the workflow task scheduling tool, so that the normal and effective operation of the constructed machine learning training task is ensured.

In one implementation, the server node in this embodiment is a server node in a private cloud. The private cloud is a cloud platform built with infrastructure inside the enterprise.

Cloud computing is a novel computing resource utilization mode, and distributes computing tasks on a resource pool formed by a large number of computers, so that various application systems can acquire computing power, storage space and information service according to requirements, namely, a technology for integrating, managing and reassigning computer hardware resources. Cloud services include public clouds, which generally refer to a usable cloud provided by a third party provider to a user, and private clouds. Public clouds are commonly available over the Internet (Internet), with the core attribute of public clouds being shared resource services. There are many examples of such clouds that can provide services throughout the open public network today. Currently, enterprises build machine learning public clouds to provide services to users in a Web manner. However, uploading via the external internet would cause a large delay when encountering the need to prepare a large amount of training data for the machine learning training task. For some organizations, training data is generated by the unit service, so that certain privacy exists, and the risk of data leakage is possibly generated by uploading and processing the training data through the external internet, so that users are contraindicated.

Therefore, in this embodiment, the machine learning training task is run in the server node of the private cloud built by the infrastructure inside the enterprise, on the one hand, the training data transmission rate is higher, and on the other hand, the security is higher.

In the embodiment, the operation environment corresponding to each code segment is built in the server node through the container technology. The specific process is as follows: determining a server node to be loaded with the mirror image corresponding to each code segment through a container arrangement tool; and loading the images corresponding to the code segments into corresponding server nodes to form containers. The container orchestration tool may determine the server node to which the image of each code segment is to be loaded, and then record the image of each code segment into the corresponding server node, forming each container. The container is herein referred to as a run-time environment.

Correspondingly, each code segment is mounted in a corresponding container to run, and the corresponding code segment can be run in the running environment.

Alternatively, the container in this embodiment may be an application container engine (Docker) based container. The container orchestration tool may be Kubernetes. The workflow task scheduling tool is Argo based on Kubernetes.

In one scenario, when the component is a training component or an evaluation component, the mirroring of the training component and the evaluation component includes mirroring of a machine learning framework to which the training component or the evaluation component corresponds. The training component in this embodiment refers to a component that performs a training task. An evaluation component refers to a component that performs an evaluation task. In this scenario, the mirror image of the machine learning framework corresponding to the training component or the evaluation component is loaded into the corresponding server node, forming a container. And mounting the code segments corresponding to the training component or the evaluation component into the corresponding containers for operation.

Based on the above scenario, please continue to refer to fig. 2, the machine learning training construction platform provided in this embodiment further includes: machine learning framework layer. The machine learning framework integrating the current mainstream comprises ScikitLearn, tensorFlow, keras, pyTorch, caffe, CNTK, MXnet, paddlePaddle and the like, and most of the current machine learning application scenes can be met through the framework. Each machine learning frame has a corresponding mirror image in a Kubernetes local mirror image warehouse, the corresponding mirror image is directly pulled to obtain the running environment of the corresponding machine learning frame, and then the Kubernetes creates a code segment transferred by the Job resource running capability providing layer according to the mirror image.

Further, when the component is a training component, the efficiency of running the code segments corresponding to the training component can be improved through distributed training.

With continued reference to fig. 2, the machine learning training building platform provided in this embodiment further includes a container private cloud layer and a server cluster layer. Illustratively, a container-based private cloud built by Kubernetes and Docker techniques, and enables distributed training of each specific machine learning framework. The server cluster layer is built by common server nodes in the local area network. Each server node acts as a node in the overall cluster hierarchy, and each server node may install 0, 1 or more graphics processors (Graphics Processing Unit, GPUs). Resources such as GPUs, central processing units (Central Processing Unit, CPUs), memories, hard disks, storages and the like of each node may be different, and the resources of different server nodes are labeled with corresponding labels by the upper layer Kubernetes, and then are scheduled according to the labels. The tags herein may be type tags.

The method provided by the invention provides support for the upper machine learning training task by constructing the private cloud through the Docker container technology and the Kubernetes. The Docker container technology can process the difference between different platforms, provide a standardized delivery method, unify configuration, unify environment, ensure efficiency and effectively realize resource limitation. In addition, the container can be quickly migrated, and the second level is available. The container service can configure the application as required, the second-level elasticity is flexible, the service time of environment construction and application creation of development, test and operation staff is greatly reduced, the working efficiency is improved, the utilization rate of infrastructure resources is improved, and the hardware, software and manpower cost is reduced. Kubernetes is a Google (Google) open-source container cluster management system, and is a scheduling service for constructing a container based on a Docker, and provides a functional suite of resource scheduling, balanced disaster recovery, service registration, dynamic expansion and contraction and the like. The Kubernetes integrates, manages and reassigns the computer hardware resources through a software means, achieves the containerization management and the resource intelligent allocation of the private user clusters, and provides the host management, the application continuous integration, the mirror image construction, the deployment management, the container transportation and the multi-level monitoring service with the standardized whole flow. The use of Kubernetes can conveniently manage running containerized applications across machines. The container private cloud platform can be constructed based on a server infrastructure in a local area network through a Docker container and Kubernetes, so that the privacy requirement of data inside an enterprise or an organization is met.

Corresponding custom resources are built for different deep learning frameworks at the upper layer in the container private cloud layer to support distributed training of the corresponding frameworks in Kubernetes. These custom resources, though specific implementation is different for different deep learning frameworks, are similar in basic principle: and marking different server nodes with type labels, and scheduling different tasks to the nodes of the corresponding types for execution during specific training. Nodes are generally divided into two categories: a class of parameter server nodes (parameter servers) for managing the storage and update work of parameters; one type is a compute server node (worker) that performs specific computations. An asynchronous parallel mode is adopted in the distributed training. In asynchronous training, after each computing server node completes one sample (mini-batch) training, other computing server nodes do not need to wait for completing training, and parameters of a model are directly updated.

And the distributed processing of the whole machine learning training task is completed through the coordination work of the computing server node and the parameter service node. The user uses different training components to correspond to different algorithms, the algorithms are associated with specific deep learning frames, and corresponding custom resources are called, so that the underlying distributed processing is transparent to the user, the training speed is obviously improved only when the underlying distributed training is perceived, and the user does not need to pay attention to specific details of the underlying distributed training.

When the component is a training component, determining a server node to be loaded by the mirror image corresponding to each code segment through a container arrangement tool, wherein the server node comprises the following specific steps: and determining a parameter server node participating in a parameter updating task of a training task corresponding to the training component through a container arrangement tool, and determining a plurality of calculation server nodes participating in a calculation task of the training task corresponding to the training component. The determination may be based on a pre-configured mapping table. For example, the mapping relationship of the training components, the parameter server nodes, and the calculation server nodes is defined in the mapping table.

Correspondingly, the mirror image of the machine learning framework corresponding to the training component is loaded into the corresponding server node to form a container, and the method comprises the following steps: and loading the mirror image of the machine learning frame corresponding to the training component into a parameter server node and a plurality of calculation server nodes to respectively form containers.

Correspondingly, the code segments corresponding to the training components are mounted in the corresponding containers for operation, and the method comprises the following steps: the method comprises the steps of controlling a container of a parameter server node, receiving and storing current parameters of training tasks corresponding to training components from a plurality of calculation server nodes by a code segment of an operating training component, controlling a code segment corresponding to the training component in the container of the plurality of calculation server nodes, asynchronously and parallelly obtaining the current parameters from the parameter server nodes, controlling each calculation server node to conduct forward propagation according to the current parameters and training data to obtain current predicted values, conducting backward propagation according to the current predicted values, obtaining updated current parameters, taking the updated current parameters as new current parameters, sending the new current parameters to the parameter server nodes, repeating the steps until errors of the current predicted values and the labeling values are smaller than a preset error threshold, and stopping executing the steps.

Fig. 7 is a schematic diagram of a process of running a code segment corresponding to a training component in the construction method of machine learning training provided by the invention. As shown in fig. 7, different computing server nodes execute random selection of part of training data, acquire current parameters from the parameter server nodes, forward propagate to obtain predicted values, backward propagate to update parameters, take the updated parameters as new current parameters, and send the new current parameters to the parameter server nodes.

It can be seen from fig. 7 that, at each iteration, the different computing server nodes will read the latest value of the parameter, but the obtained values may also be different because the time that the different computing server nodes read the parameter value is different. Based on the value of the current parameter and a small portion of the training data acquired randomly, the different computing server nodes each run a back-propagation process and update the parameters independently. The asynchronous mode can be simply considered as a stand-alone mode that replicates multiple copies, each copy being trained using different training data. Asynchronous training the overall asynchronous training is much faster because it does not have to wait for the parameters to be averaged after all computation server nodes have completed. Therefore, the code segments corresponding to the training components are operated in the mode, so that the efficiency of machine learning training can be improved.

According to the invention, through packaging the frames of ScikitLearn, tensorFlow, keras, pyTorch, caffe, CNTK, MXnet, paddlePaddle and the like of the bottom layer and the distributed algorithm constructed on the frames, the front end provides a visual operation environment for dragging, and a machine learning training task is constructed by simple operations of a dragging component, a clicking button and the like by a user, so that a model required by the user is conveniently trained, the creation process of the machine learning training task is as simple as building blocks, and the use threshold of machine learning is greatly reduced.

Furthermore, in the method for constructing machine learning training provided in this embodiment, the training progress and status of the task may also be checked in real time on the training interactive interface. For a training task of deep learning, repeated loading and repeated use are needed for data, and the predicted value of the model on the training data is more and more similar to the actual value of the training data through a random gradient descent algorithm. The training data are divided into a plurality of batches, each batch contains a plurality of data, and the whole data set is trained for a plurality of rounds, which are preset through parameters, so that the training progress of the whole training task can be converted from the current training data to the current training data.

In the training interactive interface, the user can pause, restart and the like the training task at any time. The system provides optimized default hyper-parameters, and the user can train a good model even if using the default parameters of the component.

The building platform for machine learning training provided by the embodiment can also provide a custom component, and a user can upload own codes in the custom component, so that a machine learning training task is built in a more flexible mode. Furthermore, the platform not only can integrate mature algorithms such as data preprocessing, feature engineering, statistical analysis, machine learning, deep learning and the like, but also can integrate intelligent algorithms such as target detection, target tracking, scene recognition, computer vision navigation, image semantic segmentation, instance segmentation, super-resolution, 3D reconstruction, voice recognition, semantic analysis and the like at a front edge.

The method for constructing the machine learning training provided by the embodiment comprises the following steps: receiving a directed acyclic graph input by a user in a training interaction interface in a component operation mode, wherein the directed acyclic graph is used for indicating each component and a data flow direction among the components in a target training process created by the user; generating a workflow according to the directed acyclic graph, wherein the workflow comprises all components in the directed acyclic graph and the operation sequence of all components; and calling the pre-stored encapsulated code segments corresponding to the components according to the operation sequence of the components in the workflow, building an operation environment corresponding to the code segments in the server node, and operating the corresponding code segments in the operation environment. The method has the following technical effects: on one hand, the directed acyclic graph input by a user in a mode of operating the component on the training interactive interface can be received, and then, the machine learning training process is carried out according to the directed acyclic graph, and on the premise that the user does not have an artificial intelligence basis, the machine learning training task can be simply and rapidly built by carrying out operation modes such as dragging, clicking and the like on the component, so that the threshold for building the machine learning training task is reduced, and on the other hand, in different machine learning training tasks, the user does not need to write different machine learning tasks, and only the directed acyclic graph needs to be rebuilt, so that the efficiency of building the machine learning training task is improved.

Fig. 8 is a schematic structural diagram of a construction device for machine learning training according to an embodiment of the present invention. As shown in fig. 8, the construction device for machine learning training provided in this embodiment includes the following modules: the receiving module 81, the generating module 82 and the calling running module 83.

The receiving module 81 is configured to receive a directed acyclic graph input by a user in a training interactive interface in a manner of operating components.

The directed acyclic graph is used for indicating each component included in the target training process created by the user and the data flow direction among the components.

And the generating module 82 is used for generating a workflow according to the directed acyclic graph.

The workflow comprises each component in the directed acyclic graph and the operation sequence of each component.

And the calling operation module 83 is configured to call pre-stored encapsulated code segments corresponding to each component according to an operation sequence of each component in the workflow, build an operation environment corresponding to each code segment in a server node, and operate the corresponding code segment in the operation environment.

Optionally, in terms of calling the pre-stored encapsulated code segments corresponding to the components according to the operation sequence of the components in the workflow, the calling operation module 83 is specifically configured to: and calling the pre-stored encapsulated code segments corresponding to each component in the workflow according to the operation sequence of each component in the workflow through a workflow task scheduling tool.

Optionally, the server node is a server node in a private cloud. In the aspect of setting up the operation environment corresponding to each code segment in the server node, the calling operation module 83 is specifically configured to: determining server nodes to be loaded by the mirror images corresponding to the code segments through a container arrangement tool; and loading the images corresponding to the code segments into corresponding server nodes to form containers. Correspondingly, in the running environment, the aspect of running the corresponding code segment, the calling running module 83 is specifically configured to: and mounting each code segment into a corresponding container for operation.

Optionally, when the component is a training component or an evaluation component, in the aspect of loading the images corresponding to the code segments into the corresponding server nodes to form the containers, the call running module 83 is specifically configured to: and loading the mirror image of the machine learning frame corresponding to the training component or the evaluation component into a corresponding server node to form a container. Correspondingly, in terms of installing each code segment into a corresponding container to run, the call running module 83 is specifically configured to: and mounting the code segments corresponding to the training component or the evaluation component into corresponding containers for operation.

Optionally, when the component is a training component, in determining, by the container arrangement tool, that the mirror image corresponding to each code segment is to be loaded to the server node, the call running module 83 is specifically configured to: and determining a parameter server node participating in a parameter updating task of a training task corresponding to the training component through the container arrangement tool, and determining a plurality of calculation server nodes participating in a calculation task of the training task corresponding to the training component. Correspondingly, in loading the mirror image of the machine learning frame corresponding to the training component into the corresponding server node to form a container, the calling operation module 83 is specifically configured to: and loading mirror images of the machine learning frames corresponding to the training components into the parameter server node and the plurality of calculation server nodes to form containers respectively. Correspondingly, in terms of installing the code segment corresponding to the training component into the corresponding container to run, the calling running module 83 is specifically configured to: and controlling the code segments of the running training components in the containers of the parameter server nodes to receive and store the current parameters of the training tasks corresponding to the training components from a plurality of calculation server nodes, controlling the code segments corresponding to the running training components in the containers of the calculation server nodes to acquire the current parameters from the parameter server nodes in an asynchronous parallel mode, controlling each calculation server node to conduct forward propagation according to the current parameters and training data to acquire current predicted values, conducting backward propagation according to the current predicted values to acquire updated current parameters, taking the updated current parameters as new current parameters, sending the new current parameters to the parameter server nodes, repeating the step until the error between the current predicted values and the labeling values is smaller than a preset error threshold, and stopping executing the step.

Optionally, the container is a Docker-based container, the container orchestration tool is Kubernetes, and the workflow task scheduler tool is an Argo based on Kubernetes.

Optionally, the training interaction interface includes: a component list area and a canvas area for forming a directed acyclic graph.

Optionally, the encapsulated code segments include an algorithm code segment, an input data stream interface, an output data stream interface, and a parameter interface.

The construction device for machine learning training provided by the embodiment of the invention can execute the construction method for machine learning training provided by the embodiment shown in the figure 1 and various optional modes, and has the corresponding functional modules and beneficial effects of the execution method.

Fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present invention. As shown in fig. 9, the computer device includes a processor 90 and a memory 91. The number of processors 90 in the computer device may be one or more, one processor 90 being taken as an example in fig. 9; the processor 90 and the memory 91 of the computer device may be connected by a bus or otherwise, for example in fig. 9.

The memory 91 is a computer readable storage medium, and may be used to store a software program, a computer executable program, and a module, such as program instructions and modules corresponding to a method for constructing a machine learning training in an embodiment of the present invention (for example, the receiving module 81, the generating module 82, and the call running module 83 in a device for constructing a machine learning training). The processor 90 executes various functional applications of the computer device and data processing, that is, implements the above-described construction method of machine learning training, by running software programs, instructions, and modules stored in the memory 91.

The memory 91 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for functions; the storage data area may store data created according to the use of the computer device, etc. In addition, the memory 91 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 91 may further comprise memory located remotely from processor 90, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The present invention also provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a method of constructing a machine learning training, the method comprising:

Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the method operations described above, and may also perform the related operations in the method for building the machine learning training provided in any embodiment of the present invention.

From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, or a computer device (which may be a personal computer, a vehicle, or a network device, etc.), including several instructions for causing a computer device to perform the method of constructing a machine learning training according to the embodiments of the present invention.

It should be noted that, in the embodiment of the machine learning training building apparatus, each unit and module included are only divided according to the functional logic, but not limited to the above-mentioned division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A method of constructing a machine learning exercise comprising:

invoking pre-stored encapsulated code segments corresponding to all components in a workflow according to the operation sequence of the components by a workflow task scheduling tool, determining server nodes to which the images corresponding to all the code segments are to be loaded by a container arrangement tool, loading the images corresponding to all the code segments into corresponding server nodes to form all containers, and mounting all the code segments into corresponding containers for operation; the server nodes are server nodes in the private cloud.

2. The method of claim 1, wherein when the component is a training component or an evaluation component, the loading the corresponding images of the respective code segments into the respective server nodes to form respective containers comprises:

loading the mirror image of the machine learning frame corresponding to the training component or the evaluation component into a corresponding server node to form a container;

correspondingly, each code segment is installed into a corresponding container to run, and the method comprises the following steps:

And mounting the code segments corresponding to the training component or the evaluation component into corresponding containers for operation.

3. The method of claim 2, wherein when the component is a training component, the determining, by a container orchestration tool, a server node to which the corresponding image of each code segment is to be loaded, comprises:

determining, by the container orchestration tool, a parameter server node that participates in a parameter update task of a training task corresponding to the training component, and determining a plurality of computing server nodes that participate in a computing task of the training task corresponding to the training component;

correspondingly, the mirror image of the machine learning frame corresponding to the training component is loaded into the corresponding server node to form a container, and the method comprises the following steps:

loading mirror images of the machine learning frames corresponding to the training components into the parameter server nodes and the plurality of calculation server nodes to form containers respectively;

correspondingly, the code segments corresponding to the training components are mounted in the corresponding containers for operation, and the method comprises the following steps:

and controlling the code segments of the running training components in the containers of the parameter server nodes to receive and store the current parameters of the training tasks corresponding to the training components from a plurality of calculation server nodes, controlling the code segments corresponding to the running training components in the containers of the calculation server nodes to acquire the current parameters from the parameter server nodes in an asynchronous parallel mode, controlling each calculation server node to conduct forward propagation according to the current parameters and training data to acquire current predicted values, conducting backward propagation according to the current predicted values to acquire updated current parameters, taking the updated current parameters as new current parameters, sending the new current parameters to the parameter server nodes, repeating the step until the error between the current predicted values and the labeling values is smaller than a preset error threshold, and stopping executing the step.

4. A method according to any of claims 1-3, wherein the container is an application container engine Docker-based container, the container orchestration tool is Kubernetes, and the workflow task scheduling tool is a Kubernetes-based Argo.

5. A method according to any one of claims 1-3, wherein the training interactive interface comprises: a component list area and a canvas area for forming a directed acyclic graph;

the encapsulated code segments include an algorithm code segment, an input data stream interface, an output data stream interface, and a parameter interface.

6. A machine learning training build platform for performing the machine learning training build method of any one of claims 1-5.

7. A machine learning training building apparatus, comprising:

The method comprises the steps of calling an operation module, namely calling pre-stored encapsulated code segments corresponding to all components in a workflow according to the operation sequence of all the components in the workflow through a workflow task scheduling tool, determining server nodes to which mirror images corresponding to all the code segments are to be loaded through a container arrangement tool, loading the mirror images corresponding to all the code segments into corresponding server nodes, forming all containers, and mounting all the code segments into corresponding containers for operation; the server nodes are server nodes in the private cloud.

8. A computer device, the computer device comprising:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of construction of machine learning training of any of claims 1-5.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a method of construction of a machine learning training according to any of the claims 1-5.