CN115334152A - Method for submitting structured machine learning calculation task to calculation cluster - Google Patents

Method for submitting structured machine learning calculation task to calculation cluster Download PDF

Info

Publication number
CN115334152A
CN115334152A CN202211125556.4A CN202211125556A CN115334152A CN 115334152 A CN115334152 A CN 115334152A CN 202211125556 A CN202211125556 A CN 202211125556A CN 115334152 A CN115334152 A CN 115334152A
Authority
CN
China
Prior art keywords
cluster
user
agent program
code
components
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211125556.4A
Other languages
Chinese (zh)
Other versions
CN115334152B (en
Inventor
王明亮
肖宇轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Vector Stack Technology Co ltd
Original Assignee
Beijing Vector Stack Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Vector Stack Technology Co ltd filed Critical Beijing Vector Stack Technology Co ltd
Priority to CN202211125556.4A priority Critical patent/CN115334152B/en
Publication of CN115334152A publication Critical patent/CN115334152A/en
Application granted granted Critical
Publication of CN115334152B publication Critical patent/CN115334152B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/71Version control; Configuration management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/547Remote procedure calls [RPC]; Web services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer And Data Communications (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses a method for submitting a structured machine learning calculation task to a calculation cluster, which belongs to the technical field of cluster calculation, large-scale calculation and machine learning, in particular to a system for submitting the structured machine learning calculation task to the calculation cluster, and comprises the following steps: code package, user agent program and cluster inner component; the code package includes at least: the system comprises a catalog of a file system containing project files and definition files, wherein the definition files are used for defining detailed information and an operation mode of a code package, the detailed information of the code package at least comprises a version, a name and a description, and the operation mode at least comprises definitions of a target and an action; the user agent program is used for executing the action specified by the user by reading the definition file of the code packet; the user terminal agent program establishes connection with the components in the cluster to perform data communication. The invention provides abstraction of targets and actions and provides a definition file protocol of the code packet, so that the operation of submitting a calculation task to a calculation cluster is simple and convenient.

Description

Method for submitting structured machine learning calculation task to calculation cluster
Technical Field
The invention belongs to the technical field of cluster computing, large-scale computing and machine learning, and particularly relates to a method, a device and a system for submitting a structured machine learning computing task to a computing cluster.
Background
In computers, clustering is the use of multiple computers, such as typical personal computers or UNIX workstations, multiple storage devices and memory redundant interconnect lines to make a single, highly available system to users. Cluster computing (clustering computing) can be used to implement load balancing.
Cluster computing may also be used to perform cost-effective parallel computing, which is typically a service for scientific computing, machine learning, data analysis, or other applications requiring parallel computing.
The inventor of the present invention has found that, in the prior art, when a cluster management tool such as kubecect is used, manual operations of a user are more, and extra commands need to be run to obtain a calculation result, monitor a state, and perform manual judgment; if the program script is used to implement all actions, control logic, workflow, etc., high software engineering capacity is required, the work difficulty is high, and additional time is consumed.
Disclosure of Invention
In order to at least solve the technical problem, the invention provides a method and a device for submitting a structured machine learning computing task to a computing cluster.
According to a first aspect of the present invention, there is provided a method of submitting a structured machine learning computing task to a computing cluster, comprising:
acquiring a code packet created by a user;
when the user side agent program is called, running the code packet;
and establishing connection between the user terminal agent program and the components in the cluster, and performing data communication.
Further, the obtaining the code package created by the user includes: and acquiring a directory of the file system containing the project file and the definition file created by the user, and taking the directory of the file system containing the project file and the definition file as a code package.
Further, the definition file is used for defining detailed information and an operation mode of the code package;
wherein, the detailed information of the code packet at least comprises a version, a name and a description;
the operational mode includes at least the definition of the object and the action.
Further, the running code package includes: the client agent program executes the user-specified action by reading the definition file of the code packet.
Further, the establishing a connection between the user-side agent program and the components in the cluster and performing data communication includes:
and establishing connection between the user terminal agent program and the exposed public interface of the components in the cluster, and performing data communication between the user terminal agent program and the components in the cluster by adopting any one of computer network application layer protocols.
Further, the method further comprises: the components within the cluster, upon receiving a request from a client agent, first check whether the user has the right to run the computing task.
Further, the method further comprises: the user terminal agent program is provided with a workflow mechanism;
when a certain target is operated, the user-side agent program recursively analyzes all dependent targets, dynamically constructs a directed acyclic graph, and serially or parallelly operates the targets which are not dependent or depend on the operated targets in the graph.
Further, the method comprises:
acquiring a directory of a file system which is created by a user and contains project files and definition files as a code package;
the definition file is used for defining detailed information and an operation mode of the code package, and the detailed information of the code package at least comprises a version, a name and a description; the operation mode at least comprises the definition of the target and the action;
when the user-side agent program is called, the user-side agent program executes the action specified by the user by reading the definition file of the code packet;
for a part of actions which need to provide necessary support for the components in the cluster positioned in the computing cluster, the user side agent program sends a request to the components in the cluster;
when a certain target is operated, the user-side agent program recursively analyzes all dependent targets, dynamically constructs a directed acyclic graph, and serially or parallelly operates the targets which are not dependent or depend already operated in the graph;
when executing action, the user terminal agent program continuously communicates with the cluster to monitor the current action state, so as to process various situations according to the strategy preset by the user;
establishing connection between a user side agent program and the components in the cluster, sending a request to the components in the cluster by the user side agent program, and receiving a result returned by the components in the cluster;
the client agent program can also establish connection with other servers, send requests to the servers establishing connection with the client agent program, and receive return results.
According to a second aspect of the present invention, an apparatus for submitting a structured machine learning computing task to a computing cluster, comprises:
the acquisition module is used for acquiring the code packet created by the user;
the calling module is used for running the code packet when the user side agent program is called;
and the communication module is used for establishing connection between the user side agent program and the components in the cluster and carrying out data communication.
According to a third aspect of the invention, a system for submitting a structured machine learning computing task to a computing cluster, comprises:
code package, user agent program and cluster inner component;
the code package includes at least: the system comprises a directory of a file system containing project files and definition files, wherein the definition files are used for defining detailed information and an operation mode of a code package, the detailed information of the code package at least comprises a version, a name and a description, and the operation mode at least comprises definitions of a target and an action;
the user side agent program is used for executing the action specified by the user by reading the definition file of the code packet;
and the user terminal agent program establishes connection with the component interfaces in the cluster to carry out data communication.
According to a fourth aspect of the invention, a computer device comprises a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to any one of the first aspect of the invention when executing the program.
According to a fifth aspect of the present invention, a computer readable storage medium stores a program which, when executed, is capable of implementing the method according to any one of the first aspect of the present invention.
The invention has the beneficial effects that: the method for decomposing, abstracting and structuring the machine learning calculation task provides abstraction of targets and actions and specification of a definition file of a code package, declares the operation mode of the code package in a simple and clear mode, and is very easy to create a new code package or modify an existing machine learning project into the code package. Furthermore, the user agent program can automatically analyze the dependency relationship of the target and execute the workflow, can independently monitor the execution state of the action, and determines the next action according to the strategy preset by the user, so that manual judgment and operation are replaced, the working time is saved, the working difficulty is reduced, and the operation is simple and convenient.
In addition, the scheme of the invention introduces components in the cluster, which can verify that the submission of the computing task has corresponding authority, ensure the safety of the computing cluster and provide support for the running of the computing task.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which,
FIG. 1 is a flow chart of a method for submitting a structured machine learning computing task to a computing cluster according to the present invention;
FIG. 2 is a schematic structural diagram of a system for submitting a structured machine learning computing task to a computing cluster according to the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.
In order to more clearly illustrate the present invention, the present invention is further described below with reference to preferred embodiments and the accompanying drawings. Similar parts in the figures are denoted by the same reference numerals. It is to be understood by persons skilled in the art that the following detailed description is illustrative and not restrictive, and is not to be taken as limiting the scope of the invention.
In a first aspect of the present invention, there is provided a method of submitting a structured machine learning computing task to a computing cluster, as shown in fig. 1, comprising:
step 101: acquiring a code packet created by a user;
in the invention, the code packet created by the user at least comprises: the system comprises a directory of a file system containing project files and definition files, wherein the definition files are used for defining detailed information and operation modes of a code package, and further the detailed information of the code package at least comprises a version, a name and a description. The mode of operation includes at least the definition of the target and the action.
That is, the code package is an abstraction of the complete machine learning item that the user wants to run in the compute cluster. A code package is a way to organize all files required for running a machine learning application, and is usually a directory of a file system, where a code package definition file (codec.yaml) defines detailed information and a running mode of the code package, and further includes codes, data files, resource configuration files, and the like. The code packages are typically versioned and distributed by a source code versioning system (e.g., git).
Still further, the goal is an abstraction of a concrete task for the code package. The target is subordinate to a code package, which defines a specific runnable task of the code package. The target has a complete, realistic, user-understandable meaning. Such as performing model training in the cluster, deploying the model as an inference service, etc. Each target can specify other targets on which it depends, e.g., creating inference services requires model training to be done first, and a client agent will recursively resolve dependencies when running a target, and then run a workflow.
An action is an abstraction of a specific executable operation on an object, subordinate to an object, that defines a specific executable operation of the object. Actions have standardized, user configurable conventions. Such as creating a storage volume in the cluster, creating a task for model training (Job), etc. The action defines multiple types, each type is specific to a certain type of specific operation, and in one embodiment of the invention, the actions can be realized by a code packet and a user side agent program together, and in another embodiment, the actions can be realized by the code packet, the user side agent program and components in the cluster together; the user can easily define new types to extend the functionality. The action has a state during execution, and a user can set a policy to determine the next action according to the state of the action, such as whether to execute the next action, the execution time and the like.
The workflow is composed of a set of dependent objects in a code package, embodied as a Directed Acyclic Graph (DAG). When the workflow is operated, the dependency relationship in the workflow is analyzed, and the targets which do not depend or depend on the workflow and are already operated can be operated in series or in parallel.
In the technical scheme of the invention, all information of the code package is maintained in a simple and clear definition file, so that a user can easily create a new code package or modify an existing machine learning item into the code package, and the user can use a common source code version control system (such as Git) to carry out version management and distribution on the code package.
Step 102: when the user side agent program is called, the code packet is operated;
in the invention, the user agent program executes the action specified by the user by reading the definition file of the code packet. The execution of the partial action requires the intra-cluster components in the computing cluster to provide necessary support, and the client agent sends a corresponding request to the intra-cluster components to complete the process.
Further, the client agent has a workflow mechanism. When a certain target is operated, the user-side agent program can recursively analyze all the dependent targets, dynamically construct a directed acyclic graph, and then serially or parallelly operate the targets which are not currently dependent or depend on the targets which are already operated.
The client agent is configured with a monitoring mechanism. When a certain action is executed, the client agent program will continuously communicate with the cluster to monitor the current action state, so as to handle various situations according to the policy preset by the user.
In the present invention, the user-side agent program is implemented flexibly, including but not limited to command line tools (CLI), SDKs in various programming languages, application programs with user interfaces, etc.
Step 103: and establishing connection between the user terminal agent program and the components in the cluster, and carrying out data communication.
In the invention, the user terminal agent program is connected with the exposed public interface of the assembly in the cluster, and any one of application layer protocols of a computer network is adopted to carry out data communication between the user terminal agent program and the assembly in the cluster, for example, websocket can be adopted to realize the data communication between the user terminal agent program and the assembly in the cluster.
In another embodiment of the present invention, the client agent communicating data with the components in the cluster comprises: the user terminal agent program sends a request to the components in the cluster and receives the result returned by the components in the cluster. The user terminal agent program can also establish connection with other servers, send requests to other servers and receive the results returned by the servers.
Further, the components within the cluster, upon receiving a request from a client agent, first check whether the user has the right to run the computing task. The process can be implemented in various ways, for example, using an authentication mechanism of the cluster itself, using security services based on specifications of the oid, the UMA, and the like, which are deployed by a cluster administrator, and the like; this process may also be skipped depending on the security policy of the cluster.
The intra-cluster components provide the necessary support for the user agent to perform actions, such as sending a request to create a resource to a resource server within the cluster, port forwarding containers in the cluster, etc.
In another embodiment of the present invention, when the above-mentioned security mechanism and the support of action execution are not required at all, the system can also be composed of only the code packet and the client agent program, and the client agent program can directly access the clustered server.
In another embodiment of the present invention, an example of a code package includes a code pack, a code file, and a resource configuration file of a cluster, which includes:
mnist-keras ├── codepack.yaml ├── download_dataset.py ├── main.py ├── pvc.yaml
├── secret.yaml └── trainingjob.yaml
yaml is as follows:
name: mnist-keras description: A simple image classifier based on CNN using tf2. targets: - name: prepare-env actions: - name: workspace-for-training verb: apply files: [pvc.yaml] - name: secret-for-s3 verb: apply files: [secret.yaml] - name: copy-file deps: ["prepare-env"] actions: - name: copy-code verb: copy src: . dst: training-pvc:mnist-keras/
strategy:
success: continue
failure: abort
- name: copy-dataset verb: copy src: s3://data/ dst: training-pvc:mnist-keras/data/
- name: run-distributed-training deps: ["prepare-env", "copy-file"] actions: - name: job verb: create files: [job.yaml]
wherein, the definition file contains 3 objects: a prefix-env, a copy-file, and a run-distributed-tracking, wherein the copy-file depends on the previous object and the run-distributed-tracking depends on the previous two objects. Each object contains one or more actions for specifically performing an operation, e.g., a copy-file contains both copy-code and copy-dataset actions, both copy type, for copying files from local and S3 databases to a PVC; if the strategy of copy-code action is successful, the operation is continued, and if the strategy is failed, the operation is stopped. Wherein the action is recorded as action and the target is recorded as target.
Run-distributed-routing objects of this code package are run using a command line tool (an implementation of a client agent) that parses the workflow, sends requests to components within the cluster to perform actions, monitors the execution of actions. The output of the command line is as follows:
$ codepack run --target run-distributed-training
Running sequence: prepare-env -> copy-file -> run-distributed-training Target 1/3: prepare-env APPLY by files ['pvc.yaml'] PersistentVolumeClaim training-pvc created
APPLY by files ['secret.yaml'] Secret training-secret created Target 2/3: copy-file COPY from . to training-pvc:mnist-keras/ monitoring copy action ...
succeeded
COPY from s3://data/ to training-pvc:mnist-keras/data/ Target 3/3: run-distributed-training CREATE by files ['trainingjob.yaml'] TrainingJob mnist-keras created。
in a second aspect of the present invention, there is provided an apparatus for submitting a structured machine learning computing task to a computing cluster, comprising:
the acquisition module is used for acquiring the code packet created by the user;
the invention relates to a system for managing a code package, which comprises an acquisition module, a storage module and a code package, wherein the acquisition module is used for at least acquiring a directory of a file system which is created by a user and contains a project file and a definition file, the definition file is used for defining the detailed information and the operation mode of the code package, further, the detailed information of the code package at least comprises a version, a name and a description, the operation mode at least comprises the definition of a target and an action, and the directory of the file system which contains the project file and the definition file is used as the code package.
That is, the code package is an abstraction of the complete machine learning item that the user wants to run in the compute cluster. A code package is a way to organize all the files needed to run a machine learning application, usually a directory of a file system, where a code package definition file (codepack. Yaml) defines the detailed information and running mode of the code package, and in addition, contains code, data files, resource configuration files, etc. The code packages are typically versioned and distributed by a source code versioning system (e.g., git).
Still further, the goal is an abstraction of a concrete task for the code package. The target is subordinate to a code package, which defines a specific runnable task of the code package. The target has a complete, realistic, user-understandable meaning. Such as performing model training in the cluster, deploying the model as an inference service, etc. Each target can specify other targets on which it depends, e.g., creating inference services requires model training to be done first, and a client agent will recursively resolve dependencies when running a target, and then run a workflow.
An action is an abstraction of a concrete executable operation on a target, subordinate to a target, that defines a concrete executable operation of the target. Actions have standardized, user configurable conventions. Such as creating a storage volume in the cluster, creating a task for model training (Job), etc. The method comprises the steps that multiple types of actions are defined, each type aims at specific operation of a certain type, and for partial actions needing to provide necessary support for components in a cluster, a user side agent program sends requests to the components in the cluster; the user can easily define new types to extend the functionality. The action has a state during execution, and a user can set a policy to determine the next action according to the state of the action, such as whether to execute the next action, the execution time and the like.
The workflow is composed of a group of objects with dependency relationships in a code package, and is embodied as a Directed Acyclic Graph (DAG). When the workflow is operated, the dependency relationship in the workflow is analyzed, and the targets which do not depend or depend on the workflow and are already operated can be operated in series or in parallel.
In the technical scheme of the invention, because all the information of the code package is maintained in a simple and clear definition file, it is very easy for a user to create a new code package or modify an existing machine learning project into the code package, and the user can use a common source code version control system (such as Git) to perform version management and distribution on the code package.
The calling module is used for running the code packet when the user side agent program is called;
in the invention, the calling module is specifically used for the user-side agent program to execute the action specified by the user by reading the definition file of the code packet. The execution of the partial action requires the intra-cluster components in the computing cluster to provide necessary support, and the client agent sends a corresponding request to the intra-cluster components to complete the process.
Further, the client agent has a workflow mechanism. When a certain target is operated, the user-side agent program can recursively analyze all the dependent targets, dynamically construct a directed acyclic graph, and then serially or parallelly operate the targets which are not currently dependent or depend on the targets which are already operated.
The client agent is configured with a monitoring mechanism. When a certain action is executed, the client agent program will continuously communicate with the cluster to monitor the current action state, so as to handle various situations according to the policy preset by the user.
In the present invention, the user agent is implemented flexibly, including but not limited to command line tools (CLI), SDKs in various programming languages, applications with user interfaces, etc.
And the communication module is used for establishing connection between the user side agent program and the components in the cluster and carrying out data communication.
In the present invention, the communication module is specifically configured to establish a connection between the user-side agent program and the exposed public interface of the component in the cluster, and perform data communication between the user-side agent program and the component in the cluster by using any one of application layer protocols of the computer network, for example, data communication between the user-side agent program and the component in the cluster can be realized by using Websocket.
Further, the components within the cluster, upon receiving a request from a client agent, first check whether the user has the right to run the computing task. The process can be implemented in various ways, such as using an identity authentication mechanism of the cluster itself, using security services deployed by a cluster administrator based on specifications such as OIDC and UMA; this process may also be skipped depending on the security policy of the cluster.
The intra-cluster components provide the necessary support for the user agent to perform actions, such as sending a request to create a resource to a resource server within the cluster, port forwarding containers in the cluster, etc.
When the above-mentioned security mechanism and support for action execution are not required at all, the system may also be composed of only the code packet and the client agent program, and the client agent program will directly access the clustered server at this time.
According to a third aspect of the present invention, there is provided a system for submitting a structured machine learning computing task to a computing cluster, comprising:
code package, user agent program and cluster internal components;
the code package includes at least: the system comprises a directory of a file system containing project files and definition files, wherein the definition files are used for defining detailed information and an operation mode of a code package, the detailed information of the code package at least comprises a version, a name and a description, and the operation mode at least comprises definitions of a target and an action;
further, the code package is an abstraction of the complete machine learning item that the user wants to run in the compute cluster. A code package is a way to organize all the files needed to run a machine learning application, usually a directory of a file system, where a code package definition file (codepack. Yaml) defines the detailed information and running mode of the code package, and in addition, contains code, data files, resource configuration files, etc. The code packages are typically version managed and distributed by a source code version control system (e.g., git).
Still further, the goal is an abstraction of a concrete task for the code package. The target is subordinate to a code package, which defines a specific runnable task of the code package. The target has a complete, realistic, user-understandable meaning. Such as performing model training in the cluster, deploying the model as an inference service, etc. Each target can specify other targets on which it depends, e.g., creating inference services requires model training to be done first, and a client agent will recursively resolve dependencies when running a target, and then run a workflow.
An action is an abstraction of a concrete executable operation on a target, subordinate to a target, that defines a concrete executable operation of the target. Actions have standardized conventions that can be configured by the user. Such as creating a storage volume in the cluster, creating a task for model training, etc. The method comprises the steps that multiple types of actions are defined, each type aims at specific operation of a certain type, and for partial actions needing to provide necessary support for components in a cluster in a computing cluster, a user side agent program sends requests to the components in the cluster; the user can also easily define new types to extend the functionality. The action has a state during execution, and a user can set a strategy to determine the next action according to the state of the action, such as whether to execute the next action, the execution time and the like.
The workflow is composed of a set of dependent objects in a code package, embodied as a Directed Acyclic Graph (DAG). When the workflow is operated, the dependency relationship in the workflow is analyzed, and the targets which do not depend or depend on the workflow and are already operated can be operated in series or in parallel.
In the technical scheme of the invention, because all the information of the code package is maintained in a simple and clear definition file, it is very easy for a user to create a new code package or modify an existing machine learning project into the code package, and the user can use a common source code version control system (such as Git) to perform version management and distribution on the code package.
The user side agent program is used for executing the action specified by the user by reading the definition file of the code packet;
in the present invention, when the user-side agent program executes a part of the actions, it needs to provide necessary support for the components in the cluster in the computing cluster, and at this time, the user-side agent program will send a corresponding request to the components in the cluster to complete the process.
Further, the client agent has a workflow mechanism. When a certain target is operated, the user-side agent program can recursively analyze all the dependent targets, dynamically construct a directed acyclic graph, and then serially or parallelly operate the targets which are not currently dependent or depend on the targets which are already operated.
The client agent is configured with a monitoring mechanism. When a certain action is executed, the client agent program will continuously communicate with the cluster to monitor the current action state, so as to process various situations according to the strategy preset by the user.
In the present invention, the user-side agent program is implemented flexibly, including but not limited to command line tools (CLI), SDKs in various programming languages, application programs with user interfaces, etc.
The user agent program establishes connection with the components in the cluster to carry out data communication;
in the invention, the user terminal agent program is connected with the exposed public interface of the assembly in the cluster, and any one of application layer protocols of a computer network is adopted to carry out data communication between the user terminal agent program and the assembly in the cluster, for example, websocket can be adopted to realize the data communication between the user terminal agent program and the assembly in the cluster.
Further, the components within the cluster, upon receiving a request from a client agent, first check whether the user has the right to run the computing task. The process can be implemented in various ways, for example, using an authentication mechanism of the cluster itself, using security services based on specifications of the oid, the UMA, and the like, which are deployed by a cluster administrator, and the like; this process may also be skipped depending on the security policy of the cluster.
The components in the cluster provide the necessary support for the client agent to perform actions, such as sending a request to create a resource to a resource server in the cluster, port forwarding containers in the cluster, and the like.
When the above-mentioned security mechanism and support for action execution are not required at all, the system may also be composed of only the code packet and the client agent program, and the client agent program will directly access the clustered server at this time.
As used herein, the singular forms "a", "an", "the" and "the" include plural referents unless the context clearly dictates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
It should be understood that the above detailed description of the technical solution of the present invention with the help of preferred embodiments is illustrative and not restrictive. After reading the description of the present invention, those skilled in the art may modify the technical solutions described in the embodiments, or may substitute part of the technical features of the embodiments; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of submitting a structured machine learning computing task to a computing cluster, comprising:
acquiring a code packet created by a user;
when the user side agent program is called, running the code packet;
and establishing connection between the user terminal agent program and the components in the cluster, and performing data communication.
2. The method of claim 1,
the acquiring the code packet created by the user comprises the following steps: at least a directory of a file system containing the project file and the definition file created by a user is obtained, and the directory of the file system containing the project file and the definition file is used as a code package.
3. The method of claim 2,
the definition file is used for defining the detailed information and the operation mode of the code package;
wherein, the detailed information of the code package at least comprises a version, a name and a description;
the mode of operation includes at least the definition of the target and the action.
4. The method of claim 1,
the run code package, comprising: the client agent program executes the user-specified action by reading the definition file of the code packet.
5. The method of claim 1,
the establishing connection between the user terminal agent program and the components in the cluster and the data communication comprises the following steps:
and establishing connection between the user terminal agent program and the exposed public interface of the components in the cluster, and performing data communication between the user terminal agent program and the components in the cluster by adopting any one of computer network application layer protocols.
6. The method of claim 5,
the method further comprises the following steps: the components within the cluster, upon receiving a request from a client agent, first check whether the user has the right to run the computing task.
7. The method of claim 1,
the method further comprises the following steps: the user terminal agent program is provided with a workflow mechanism;
when a certain target is operated, the user-side agent program recursively analyzes all the dependent targets, dynamically constructs a directed acyclic graph, and operates the targets which are not dependent or depend on the operated targets in the graph in series or in parallel.
8. The method of claim 1, wherein the method comprises:
acquiring a directory of a file system which is created by a user and contains project files and definition files as a code package;
the definition file is used for defining detailed information and an operation mode of the code package, and the detailed information of the code package at least comprises a version, a name and a description; the operation mode at least comprises the definition of the target and the action;
when the user-side agent program is called, the user-side agent program executes the action specified by the user by reading the definition file of the code packet;
for a part of actions which need to provide necessary support for the components in the cluster positioned in the computing cluster, the user side agent program sends a request to the components in the cluster;
when a certain target is operated, the user-side agent program recursively analyzes all dependent targets, a directed acyclic graph is dynamically constructed, and the current targets which are not dependent or are already operated are operated in series or in parallel;
when executing action, the user agent program continuously communicates with the cluster to monitor the current action state, so as to process various situations according to the strategy preset by the user;
establishing connection between a user agent program and the components in the cluster, wherein the user agent program sends a request to the components in the cluster and receives a result returned by the components in the cluster;
the client agent program can also directly establish connection with other servers, send requests to the servers establishing connection with the client agent program, and receive return results.
9. An apparatus for submitting a structured machine learning computing task to a computing cluster, comprising:
the acquisition module is used for acquiring the code packet created by the user;
the calling module is used for running the code packet when the user side agent program is called;
and the communication module is used for establishing connection between the user side agent program and the components in the cluster and carrying out data communication.
10. A system for submitting a structured machine learning computing task to a computing cluster, comprising:
code package, user agent program and cluster inner component;
the code package includes at least: the system comprises a catalog of a file system containing project files and definition files, wherein the definition files are used for defining detailed information and an operation mode of a code package, the detailed information of the code package at least comprises a version, a name and a description, and the operation mode at least comprises definitions of a target and an action;
the user side agent program is used for executing the action specified by the user by reading the definition file of the code packet;
and the user terminal agent program establishes connection with the components in the cluster to carry out data communication.
CN202211125556.4A 2022-09-16 2022-09-16 Method for submitting structured machine learning calculation task to calculation cluster Active CN115334152B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211125556.4A CN115334152B (en) 2022-09-16 2022-09-16 Method for submitting structured machine learning calculation task to calculation cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211125556.4A CN115334152B (en) 2022-09-16 2022-09-16 Method for submitting structured machine learning calculation task to calculation cluster

Publications (2)

Publication Number Publication Date
CN115334152A true CN115334152A (en) 2022-11-11
CN115334152B CN115334152B (en) 2023-03-28

Family

ID=83929185

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211125556.4A Active CN115334152B (en) 2022-09-16 2022-09-16 Method for submitting structured machine learning calculation task to calculation cluster

Country Status (1)

Country Link
CN (1) CN115334152B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110333941A (en) * 2019-06-28 2019-10-15 苏宁消费金融有限公司 A kind of real-time computing platform of big data based on sql and method
CN110609732A (en) * 2019-08-13 2019-12-24 平安普惠企业管理有限公司 Application program deployment method and device, computer equipment and storage medium
CN111198691A (en) * 2019-12-23 2020-05-26 杭州云徙科技有限公司 Application multi-runtime configuration and deployment method based on cloud platform
CN111679831A (en) * 2020-06-04 2020-09-18 同盾控股有限公司 Software development kit processing method, operation monitoring method, device and storage medium
CN111901294A (en) * 2020-06-09 2020-11-06 北京迈格威科技有限公司 Method for constructing online machine learning project and machine learning system
CN112925619A (en) * 2021-02-24 2021-06-08 深圳依时货拉拉科技有限公司 Big data real-time computing method and platform
WO2021185206A1 (en) * 2020-03-16 2021-09-23 第四范式(北京)技术有限公司 Resource allocation method and apparatus for cluster task, and computer apparatus and storage medium
US20210406068A1 (en) * 2020-06-30 2021-12-30 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and system for stream computation based on directed acyclic graph (dag) interaction

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110333941A (en) * 2019-06-28 2019-10-15 苏宁消费金融有限公司 A kind of real-time computing platform of big data based on sql and method
CN110609732A (en) * 2019-08-13 2019-12-24 平安普惠企业管理有限公司 Application program deployment method and device, computer equipment and storage medium
CN111198691A (en) * 2019-12-23 2020-05-26 杭州云徙科技有限公司 Application multi-runtime configuration and deployment method based on cloud platform
WO2021185206A1 (en) * 2020-03-16 2021-09-23 第四范式(北京)技术有限公司 Resource allocation method and apparatus for cluster task, and computer apparatus and storage medium
CN111679831A (en) * 2020-06-04 2020-09-18 同盾控股有限公司 Software development kit processing method, operation monitoring method, device and storage medium
CN111901294A (en) * 2020-06-09 2020-11-06 北京迈格威科技有限公司 Method for constructing online machine learning project and machine learning system
US20210406068A1 (en) * 2020-06-30 2021-12-30 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and system for stream computation based on directed acyclic graph (dag) interaction
CN112925619A (en) * 2021-02-24 2021-06-08 深圳依时货拉拉科技有限公司 Big data real-time computing method and platform

Also Published As

Publication number Publication date
CN115334152B (en) 2023-03-28

Similar Documents

Publication Publication Date Title
US6810364B2 (en) Automated testing of computer system components
US11099823B2 (en) Systems and methods for transformation of reporting schema
US8103913B2 (en) Application integration testing
US9256353B2 (en) Providing application and device management using entitlements
Kapadia et al. PUNCH: An architecture for web-enabled wide-area network-computing
US8010701B2 (en) Method and system for providing virtualized application workspaces
US7020699B2 (en) Test result analyzer in a distributed processing framework system and methods for implementing the same
US7752598B2 (en) Generating executable objects implementing methods for an information model
US7130881B2 (en) Remote execution model for distributed application launch and control
US8214809B2 (en) Grid-enabled ANT compatible with both stand-alone and grid-based computing systems
JPH10124468A (en) Resource managing method and computer
CN112463144A (en) Distributed storage command line service method, system, terminal and storage medium
US6038589A (en) Apparatus, method and computer program product for client/server computing with a transaction representation located on each transactionally involved server
US20030055862A1 (en) Methods, systems, and articles of manufacture for managing systems using operation objects
CN115334152B (en) Method for submitting structured machine learning calculation task to calculation cluster
US20070067488A1 (en) System and method for transferring data
US7188343B2 (en) Distributable multi-daemon configuration for multi-system management
US7752169B2 (en) Method, system and program product for centrally managing computer backups
CN113641641A (en) Switching method, switching system, equipment and storage medium of file storage service
US20050188377A1 (en) Mobile application morphing system and method
CN110765463B (en) WebLogic-based safety baseline reinforcement method
TW457422B (en) A computer software system for eliminating operating system multiple logins under remote program load with network provider dynamic link library
US20040255194A1 (en) Testing computer applications
KR100328816B1 (en) Automatic document generation and output system for output messages of intelligent information provision system
CN117271000A (en) Data processing method, device, system and equipment based on cloud platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant