CN115334152A

CN115334152A - Method for submitting structured machine learning calculation task to calculation cluster

Info

Publication number: CN115334152A
Application number: CN202211125556.4A
Authority: CN
Inventors: 王明亮; 肖宇轩
Original assignee: Beijing Vector Stack Technology Co ltd
Current assignee: Beijing Vector Stack Technology Co ltd
Priority date: 2022-09-16
Filing date: 2022-09-16
Publication date: 2022-11-11
Anticipated expiration: 2042-09-16
Also published as: CN115334152B

Abstract

The invention discloses a method for submitting a structured machine learning calculation task to a calculation cluster, which belongs to the technical field of cluster calculation, large-scale calculation and machine learning, in particular to a system for submitting the structured machine learning calculation task to the calculation cluster, and comprises the following steps: code package, user agent program and cluster inner component; the code package includes at least: the system comprises a catalog of a file system containing project files and definition files, wherein the definition files are used for defining detailed information and an operation mode of a code package, the detailed information of the code package at least comprises a version, a name and a description, and the operation mode at least comprises definitions of a target and an action; the user agent program is used for executing the action specified by the user by reading the definition file of the code packet; the user terminal agent program establishes connection with the components in the cluster to perform data communication. The invention provides abstraction of targets and actions and provides a definition file protocol of the code packet, so that the operation of submitting a calculation task to a calculation cluster is simple and convenient.

Description

Method for submitting structured machine learning calculation task to calculation cluster

Technical Field

The invention belongs to the technical field of cluster computing, large-scale computing and machine learning, and particularly relates to a method, a device and a system for submitting a structured machine learning computing task to a computing cluster.

Background

In computers, clustering is the use of multiple computers, such as typical personal computers or UNIX workstations, multiple storage devices and memory redundant interconnect lines to make a single, highly available system to users. Cluster computing (clustering computing) can be used to implement load balancing.

Cluster computing may also be used to perform cost-effective parallel computing, which is typically a service for scientific computing, machine learning, data analysis, or other applications requiring parallel computing.

The inventor of the present invention has found that, in the prior art, when a cluster management tool such as kubecect is used, manual operations of a user are more, and extra commands need to be run to obtain a calculation result, monitor a state, and perform manual judgment; if the program script is used to implement all actions, control logic, workflow, etc., high software engineering capacity is required, the work difficulty is high, and additional time is consumed.

Disclosure of Invention

In order to at least solve the technical problem, the invention provides a method and a device for submitting a structured machine learning computing task to a computing cluster.

According to a first aspect of the present invention, there is provided a method of submitting a structured machine learning computing task to a computing cluster, comprising:

acquiring a code packet created by a user;

when the user side agent program is called, running the code packet;

and establishing connection between the user terminal agent program and the components in the cluster, and performing data communication.

Further, the obtaining the code package created by the user includes: and acquiring a directory of the file system containing the project file and the definition file created by the user, and taking the directory of the file system containing the project file and the definition file as a code package.

Further, the definition file is used for defining detailed information and an operation mode of the code package;

wherein, the detailed information of the code packet at least comprises a version, a name and a description;

the operational mode includes at least the definition of the object and the action.

Further, the running code package includes: the client agent program executes the user-specified action by reading the definition file of the code packet.

Further, the establishing a connection between the user-side agent program and the components in the cluster and performing data communication includes:

and establishing connection between the user terminal agent program and the exposed public interface of the components in the cluster, and performing data communication between the user terminal agent program and the components in the cluster by adopting any one of computer network application layer protocols.

Further, the method further comprises: the components within the cluster, upon receiving a request from a client agent, first check whether the user has the right to run the computing task.

Further, the method further comprises: the user terminal agent program is provided with a workflow mechanism;

when a certain target is operated, the user-side agent program recursively analyzes all dependent targets, dynamically constructs a directed acyclic graph, and serially or parallelly operates the targets which are not dependent or depend on the operated targets in the graph.

Further, the method comprises:

acquiring a directory of a file system which is created by a user and contains project files and definition files as a code package;

the definition file is used for defining detailed information and an operation mode of the code package, and the detailed information of the code package at least comprises a version, a name and a description; the operation mode at least comprises the definition of the target and the action;

when the user-side agent program is called, the user-side agent program executes the action specified by the user by reading the definition file of the code packet;

for a part of actions which need to provide necessary support for the components in the cluster positioned in the computing cluster, the user side agent program sends a request to the components in the cluster;

when a certain target is operated, the user-side agent program recursively analyzes all dependent targets, dynamically constructs a directed acyclic graph, and serially or parallelly operates the targets which are not dependent or depend already operated in the graph;

when executing action, the user terminal agent program continuously communicates with the cluster to monitor the current action state, so as to process various situations according to the strategy preset by the user;

establishing connection between a user side agent program and the components in the cluster, sending a request to the components in the cluster by the user side agent program, and receiving a result returned by the components in the cluster;

the client agent program can also establish connection with other servers, send requests to the servers establishing connection with the client agent program, and receive return results.

According to a second aspect of the present invention, an apparatus for submitting a structured machine learning computing task to a computing cluster, comprises:

the acquisition module is used for acquiring the code packet created by the user;

the calling module is used for running the code packet when the user side agent program is called;

and the communication module is used for establishing connection between the user side agent program and the components in the cluster and carrying out data communication.

According to a third aspect of the invention, a system for submitting a structured machine learning computing task to a computing cluster, comprises:

code package, user agent program and cluster inner component;

the code package includes at least: the system comprises a directory of a file system containing project files and definition files, wherein the definition files are used for defining detailed information and an operation mode of a code package, the detailed information of the code package at least comprises a version, a name and a description, and the operation mode at least comprises definitions of a target and an action;

the user side agent program is used for executing the action specified by the user by reading the definition file of the code packet;

and the user terminal agent program establishes connection with the component interfaces in the cluster to carry out data communication.

According to a fourth aspect of the invention, a computer device comprises a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to any one of the first aspect of the invention when executing the program.

According to a fifth aspect of the present invention, a computer readable storage medium stores a program which, when executed, is capable of implementing the method according to any one of the first aspect of the present invention.

The invention has the beneficial effects that: the method for decomposing, abstracting and structuring the machine learning calculation task provides abstraction of targets and actions and specification of a definition file of a code package, declares the operation mode of the code package in a simple and clear mode, and is very easy to create a new code package or modify an existing machine learning project into the code package. Furthermore, the user agent program can automatically analyze the dependency relationship of the target and execute the workflow, can independently monitor the execution state of the action, and determines the next action according to the strategy preset by the user, so that manual judgment and operation are replaced, the working time is saved, the working difficulty is reduced, and the operation is simple and convenient.

In addition, the scheme of the invention introduces components in the cluster, which can verify that the submission of the computing task has corresponding authority, ensure the safety of the computing cluster and provide support for the running of the computing task.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which,

FIG. 1 is a flow chart of a method for submitting a structured machine learning computing task to a computing cluster according to the present invention;

FIG. 2 is a schematic structural diagram of a system for submitting a structured machine learning computing task to a computing cluster according to the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.

In order to more clearly illustrate the present invention, the present invention is further described below with reference to preferred embodiments and the accompanying drawings. Similar parts in the figures are denoted by the same reference numerals. It is to be understood by persons skilled in the art that the following detailed description is illustrative and not restrictive, and is not to be taken as limiting the scope of the invention.

In a first aspect of the present invention, there is provided a method of submitting a structured machine learning computing task to a computing cluster, as shown in fig. 1, comprising:

step 101: acquiring a code packet created by a user;

in the invention, the code packet created by the user at least comprises: the system comprises a directory of a file system containing project files and definition files, wherein the definition files are used for defining detailed information and operation modes of a code package, and further the detailed information of the code package at least comprises a version, a name and a description. The mode of operation includes at least the definition of the target and the action.

That is, the code package is an abstraction of the complete machine learning item that the user wants to run in the compute cluster. A code package is a way to organize all files required for running a machine learning application, and is usually a directory of a file system, where a code package definition file (codec.yaml) defines detailed information and a running mode of the code package, and further includes codes, data files, resource configuration files, and the like. The code packages are typically versioned and distributed by a source code versioning system (e.g., git).

Still further, the goal is an abstraction of a concrete task for the code package. The target is subordinate to a code package, which defines a specific runnable task of the code package. The target has a complete, realistic, user-understandable meaning. Such as performing model training in the cluster, deploying the model as an inference service, etc. Each target can specify other targets on which it depends, e.g., creating inference services requires model training to be done first, and a client agent will recursively resolve dependencies when running a target, and then run a workflow.

An action is an abstraction of a specific executable operation on an object, subordinate to an object, that defines a specific executable operation of the object. Actions have standardized, user configurable conventions. Such as creating a storage volume in the cluster, creating a task for model training (Job), etc. The action defines multiple types, each type is specific to a certain type of specific operation, and in one embodiment of the invention, the actions can be realized by a code packet and a user side agent program together, and in another embodiment, the actions can be realized by the code packet, the user side agent program and components in the cluster together; the user can easily define new types to extend the functionality. The action has a state during execution, and a user can set a policy to determine the next action according to the state of the action, such as whether to execute the next action, the execution time and the like.

The workflow is composed of a set of dependent objects in a code package, embodied as a Directed Acyclic Graph (DAG). When the workflow is operated, the dependency relationship in the workflow is analyzed, and the targets which do not depend or depend on the workflow and are already operated can be operated in series or in parallel.

In the technical scheme of the invention, all information of the code package is maintained in a simple and clear definition file, so that a user can easily create a new code package or modify an existing machine learning item into the code package, and the user can use a common source code version control system (such as Git) to carry out version management and distribution on the code package.

Step 102: when the user side agent program is called, the code packet is operated;

in the invention, the user agent program executes the action specified by the user by reading the definition file of the code packet. The execution of the partial action requires the intra-cluster components in the computing cluster to provide necessary support, and the client agent sends a corresponding request to the intra-cluster components to complete the process.

Further, the client agent has a workflow mechanism. When a certain target is operated, the user-side agent program can recursively analyze all the dependent targets, dynamically construct a directed acyclic graph, and then serially or parallelly operate the targets which are not currently dependent or depend on the targets which are already operated.

The client agent is configured with a monitoring mechanism. When a certain action is executed, the client agent program will continuously communicate with the cluster to monitor the current action state, so as to handle various situations according to the policy preset by the user.

In the present invention, the user-side agent program is implemented flexibly, including but not limited to command line tools (CLI), SDKs in various programming languages, application programs with user interfaces, etc.

Step 103: and establishing connection between the user terminal agent program and the components in the cluster, and carrying out data communication.

In the invention, the user terminal agent program is connected with the exposed public interface of the assembly in the cluster, and any one of application layer protocols of a computer network is adopted to carry out data communication between the user terminal agent program and the assembly in the cluster, for example, websocket can be adopted to realize the data communication between the user terminal agent program and the assembly in the cluster.

In another embodiment of the present invention, the client agent communicating data with the components in the cluster comprises: the user terminal agent program sends a request to the components in the cluster and receives the result returned by the components in the cluster. The user terminal agent program can also establish connection with other servers, send requests to other servers and receive the results returned by the servers.

Further, the components within the cluster, upon receiving a request from a client agent, first check whether the user has the right to run the computing task. The process can be implemented in various ways, for example, using an authentication mechanism of the cluster itself, using security services based on specifications of the oid, the UMA, and the like, which are deployed by a cluster administrator, and the like; this process may also be skipped depending on the security policy of the cluster.

The intra-cluster components provide the necessary support for the user agent to perform actions, such as sending a request to create a resource to a resource server within the cluster, port forwarding containers in the cluster, etc.

In another embodiment of the present invention, when the above-mentioned security mechanism and the support of action execution are not required at all, the system can also be composed of only the code packet and the client agent program, and the client agent program can directly access the clustered server.

In another embodiment of the present invention, an example of a code package includes a code pack, a code file, and a resource configuration file of a cluster, which includes:

mnist-keras ├── codepack.yaml ├── download_dataset.py ├── main.py ├── pvc.yaml

├── secret.yaml └── trainingjob.yaml

yaml is as follows:

name: mnist-keras description: A simple image classifier based on CNN using tf2. targets: - name: prepare-env actions: - name: workspace-for-training verb: apply files: [pvc.yaml] - name: secret-for-s3 verb: apply files: [secret.yaml] - name: copy-file deps: ["prepare-env"] actions: - name: copy-code verb: copy src: . dst: training-pvc:mnist-keras/

strategy:

success: continue

failure: abort

- name: copy-dataset verb: copy src: s3://data/ dst: training-pvc:mnist-keras/data/

- name: run-distributed-training deps: ["prepare-env", "copy-file"] actions: - name: job verb: create files: [job.yaml]

wherein, the definition file contains 3 objects: a prefix-env, a copy-file, and a run-distributed-tracking, wherein the copy-file depends on the previous object and the run-distributed-tracking depends on the previous two objects. Each object contains one or more actions for specifically performing an operation, e.g., a copy-file contains both copy-code and copy-dataset actions, both copy type, for copying files from local and S3 databases to a PVC; if the strategy of copy-code action is successful, the operation is continued, and if the strategy is failed, the operation is stopped. Wherein the action is recorded as action and the target is recorded as target.

Run-distributed-routing objects of this code package are run using a command line tool (an implementation of a client agent) that parses the workflow, sends requests to components within the cluster to perform actions, monitors the execution of actions. The output of the command line is as follows:

$ codepack run --target run-distributed-training

Running sequence: prepare-env -> copy-file -> run-distributed-training Target 1/3: prepare-env APPLY by files ['pvc.yaml'] PersistentVolumeClaim training-pvc created

APPLY by files ['secret.yaml'] Secret training-secret created Target 2/3: copy-file COPY from . to training-pvc:mnist-keras/ monitoring copy action ...

succeeded

COPY from s3://data/ to training-pvc:mnist-keras/data/ Target 3/3: run-distributed-training CREATE by files ['trainingjob.yaml'] TrainingJob mnist-keras created。

in a second aspect of the present invention, there is provided an apparatus for submitting a structured machine learning computing task to a computing cluster, comprising:

the invention relates to a system for managing a code package, which comprises an acquisition module, a storage module and a code package, wherein the acquisition module is used for at least acquiring a directory of a file system which is created by a user and contains a project file and a definition file, the definition file is used for defining the detailed information and the operation mode of the code package, further, the detailed information of the code package at least comprises a version, a name and a description, the operation mode at least comprises the definition of a target and an action, and the directory of the file system which contains the project file and the definition file is used as the code package.

That is, the code package is an abstraction of the complete machine learning item that the user wants to run in the compute cluster. A code package is a way to organize all the files needed to run a machine learning application, usually a directory of a file system, where a code package definition file (codepack. Yaml) defines the detailed information and running mode of the code package, and in addition, contains code, data files, resource configuration files, etc. The code packages are typically versioned and distributed by a source code versioning system (e.g., git).

An action is an abstraction of a concrete executable operation on a target, subordinate to a target, that defines a concrete executable operation of the target. Actions have standardized, user configurable conventions. Such as creating a storage volume in the cluster, creating a task for model training (Job), etc. The method comprises the steps that multiple types of actions are defined, each type aims at specific operation of a certain type, and for partial actions needing to provide necessary support for components in a cluster, a user side agent program sends requests to the components in the cluster; the user can easily define new types to extend the functionality. The action has a state during execution, and a user can set a policy to determine the next action according to the state of the action, such as whether to execute the next action, the execution time and the like.

The workflow is composed of a group of objects with dependency relationships in a code package, and is embodied as a Directed Acyclic Graph (DAG). When the workflow is operated, the dependency relationship in the workflow is analyzed, and the targets which do not depend or depend on the workflow and are already operated can be operated in series or in parallel.

In the technical scheme of the invention, because all the information of the code package is maintained in a simple and clear definition file, it is very easy for a user to create a new code package or modify an existing machine learning project into the code package, and the user can use a common source code version control system (such as Git) to perform version management and distribution on the code package.

in the invention, the calling module is specifically used for the user-side agent program to execute the action specified by the user by reading the definition file of the code packet. The execution of the partial action requires the intra-cluster components in the computing cluster to provide necessary support, and the client agent sends a corresponding request to the intra-cluster components to complete the process.

In the present invention, the user agent is implemented flexibly, including but not limited to command line tools (CLI), SDKs in various programming languages, applications with user interfaces, etc.

In the present invention, the communication module is specifically configured to establish a connection between the user-side agent program and the exposed public interface of the component in the cluster, and perform data communication between the user-side agent program and the component in the cluster by using any one of application layer protocols of the computer network, for example, data communication between the user-side agent program and the component in the cluster can be realized by using Websocket.

Further, the components within the cluster, upon receiving a request from a client agent, first check whether the user has the right to run the computing task. The process can be implemented in various ways, such as using an identity authentication mechanism of the cluster itself, using security services deployed by a cluster administrator based on specifications such as OIDC and UMA; this process may also be skipped depending on the security policy of the cluster.

When the above-mentioned security mechanism and support for action execution are not required at all, the system may also be composed of only the code packet and the client agent program, and the client agent program will directly access the clustered server at this time.

According to a third aspect of the present invention, there is provided a system for submitting a structured machine learning computing task to a computing cluster, comprising:

code package, user agent program and cluster internal components;

further, the code package is an abstraction of the complete machine learning item that the user wants to run in the compute cluster. A code package is a way to organize all the files needed to run a machine learning application, usually a directory of a file system, where a code package definition file (codepack. Yaml) defines the detailed information and running mode of the code package, and in addition, contains code, data files, resource configuration files, etc. The code packages are typically version managed and distributed by a source code version control system (e.g., git).

An action is an abstraction of a concrete executable operation on a target, subordinate to a target, that defines a concrete executable operation of the target. Actions have standardized conventions that can be configured by the user. Such as creating a storage volume in the cluster, creating a task for model training, etc. The method comprises the steps that multiple types of actions are defined, each type aims at specific operation of a certain type, and for partial actions needing to provide necessary support for components in a cluster in a computing cluster, a user side agent program sends requests to the components in the cluster; the user can also easily define new types to extend the functionality. The action has a state during execution, and a user can set a strategy to determine the next action according to the state of the action, such as whether to execute the next action, the execution time and the like.

in the present invention, when the user-side agent program executes a part of the actions, it needs to provide necessary support for the components in the cluster in the computing cluster, and at this time, the user-side agent program will send a corresponding request to the components in the cluster to complete the process.

The client agent is configured with a monitoring mechanism. When a certain action is executed, the client agent program will continuously communicate with the cluster to monitor the current action state, so as to process various situations according to the strategy preset by the user.

The user agent program establishes connection with the components in the cluster to carry out data communication;

The components in the cluster provide the necessary support for the client agent to perform actions, such as sending a request to create a resource to a resource server in the cluster, port forwarding containers in the cluster, and the like.

As used herein, the singular forms "a", "an", "the" and "the" include plural referents unless the context clearly dictates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

It should be understood that the above detailed description of the technical solution of the present invention with the help of preferred embodiments is illustrative and not restrictive. After reading the description of the present invention, those skilled in the art may modify the technical solutions described in the embodiments, or may substitute part of the technical features of the embodiments; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of submitting a structured machine learning computing task to a computing cluster, comprising:

acquiring a code packet created by a user;

when the user side agent program is called, running the code packet;

2. The method of claim 1,

the acquiring the code packet created by the user comprises the following steps: at least a directory of a file system containing the project file and the definition file created by a user is obtained, and the directory of the file system containing the project file and the definition file is used as a code package.

3. The method of claim 2,

the definition file is used for defining the detailed information and the operation mode of the code package;

wherein, the detailed information of the code package at least comprises a version, a name and a description;

the mode of operation includes at least the definition of the target and the action.

4. The method of claim 1,

the run code package, comprising: the client agent program executes the user-specified action by reading the definition file of the code packet.

5. The method of claim 1,

the establishing connection between the user terminal agent program and the components in the cluster and the data communication comprises the following steps:

6. The method of claim 5,

the method further comprises the following steps: the components within the cluster, upon receiving a request from a client agent, first check whether the user has the right to run the computing task.

7. The method of claim 1,

the method further comprises the following steps: the user terminal agent program is provided with a workflow mechanism;

when a certain target is operated, the user-side agent program recursively analyzes all the dependent targets, dynamically constructs a directed acyclic graph, and operates the targets which are not dependent or depend on the operated targets in the graph in series or in parallel.

8. The method of claim 1, wherein the method comprises:

when a certain target is operated, the user-side agent program recursively analyzes all dependent targets, a directed acyclic graph is dynamically constructed, and the current targets which are not dependent or are already operated are operated in series or in parallel;

when executing action, the user agent program continuously communicates with the cluster to monitor the current action state, so as to process various situations according to the strategy preset by the user;

establishing connection between a user agent program and the components in the cluster, wherein the user agent program sends a request to the components in the cluster and receives a result returned by the components in the cluster;

the client agent program can also directly establish connection with other servers, send requests to the servers establishing connection with the client agent program, and receive return results.

9. An apparatus for submitting a structured machine learning computing task to a computing cluster, comprising:

10. A system for submitting a structured machine learning computing task to a computing cluster, comprising:

code package, user agent program and cluster inner component;

the code package includes at least: the system comprises a catalog of a file system containing project files and definition files, wherein the definition files are used for defining detailed information and an operation mode of a code package, the detailed information of the code package at least comprises a version, a name and a description, and the operation mode at least comprises definitions of a target and an action;

and the user terminal agent program establishes connection with the components in the cluster to carry out data communication.