WO2021237829A1

WO2021237829A1 - Method and system for integrating code repository with computing service

Info

Publication number: WO2021237829A1
Application number: PCT/CN2020/096730
Authority: WO
Inventors: 俞扬; 秦熔均; 沈雷彦; 冷俊杰; 管延明; 李济君
Original assignee: 南栖仙策(南京)科技有限公司
Priority date: 2020-05-25
Filing date: 2020-06-18
Publication date: 2021-12-02
Also published as: CN111338784A; CN111338784B

Abstract

Disclosed are a method and a system for integrating a code repository with a computing service. Gitea is embedded as a code repository module, an expandable computing resource is managed and provided in the form of a k8s cluster, distributed machine learning is supported by using a ray framework, and distributed storage is provided by means of ceph, so as to achieve management of a code repository, computing resources, and result storage on a unified platform. By means of the present invention, a user can directly initiate an artificial intelligent computing task in a code repository or a computing management module, and codes and computing resources used for a computing task are directly configured on an initiation page, without the need of performing code migration.

Description

A method and system for realizing the integration of code warehouse and computing service

Technical field

The invention relates to a method and system for realizing the integration of a code warehouse and computing services. Through a computing platform, the code warehouse and artificial intelligence computing can be operated and implemented in the same system, which belongs to the field of artificial intelligence technology.

Background technique

Generally, artificial intelligence algorithm research experiments mainly include the following processes:

(1) Write the test code and prepare the experimental data; (2) Prepare the experimental environment and actually conduct the experiment.

Therefore, the researcher's code repository and the experimental environment are prepared separately.

In the code hosting part, the generally adopted solutions are online code hosting platform or local management. Mainstream online code hosting services include github, gitlab, etc. After users create an account on a code hosting platform such as guthub and create a new code warehouse, they can write code remotely, and push code changes to the corresponding branch and version of github via https or ssh. In the actual experiment, each time the code is adjusted, it needs to be migrated to the computing platform. There are additional platform switching procedures and costs. This part should not be the content of the experimenter's attention.

In terms of computing platforms, the threshold for building software and hardware environments suitable for large-scale machine learning is relatively high, and high-performance computing platforms are usually required to be paired with specific software environments.

The current mainstream solution is to rent a virtual host from a cloud service provider, build an experimental environment by yourself, and then conduct training. If this scheme is adopted, on the one hand, computing resources will continue to incur costs after they are rented. On the other hand, before starting the experiment, the software environment needs to be installed in the virtual host provided by the cloud service provider, according to different network environments and installations. For software content, this preparation process may take several hours, which consumes more time and cost of experimenters, resulting in a higher cost of each experiment, and the proportion of experimental links that really generate value is reduced, which is less efficient.

Another solution for computing platforms is to purchase hardware directly and build a computing environment from the hardware. This kind of solution requires a high cost of one-time hardware investment, and needs to be responsible for its own operation and maintenance work, and the idle cost is also very prominent. For small and medium research institutions and individual research, the cost performance is lower.

Summary of the invention

Purpose of the invention: In order to overcome the problem of code and computing platform switching in artificial intelligence research in the prior art, the present invention proposes a new method and system that combines code hosting and computing resources into the same system, reducing meaningless Platform switching reduces the idle cost of computing resources in the form of pay-as-you-go.

Technical solution: A method to realize the integration of code warehouse and computing services, embedded gitea as a code warehouse module, manages and provides expandable computing resources in the form of k8s cluster, uses ray framework to support distributed machine learning, and provides distribution through ceph Type storage to realize the management of code warehouse, computing resources, and results storage on a unified platform; specifically including the following steps:

When the user initiates a calculation task, obtain the user’s new task information, and verify whether the new task information submitted by the user is incorrect; if the verification is passed, the task is created successfully, otherwise the user will be prompted with an error message; the task is created successfully, query the list of existing cluster resources , Determine whether there are computing resources that meet the specified computing resources of the created task. If not, the new task will enter the delayed queue state, and it will automatically retry when the cluster resources are sufficient. If the computing resources are sufficient, assign the corresponding computing node; call the task-related code from the code warehouse to the computing node, start the computing node, and bind storage resources to the corresponding computing node; start computing through the built-in distributed computing framework of the system Task, save the task execution log and task execution output data to the storage address in real time; display the task list through the interface, enter the task details interface, the system displays the task list in the calculation management interface, displays the current task execution status and statistical data, and realizes the user Monitor computing tasks and support users to manage computing tasks at the same time.

When users monitor and manage computing tasks, they send network requests to feed back the execution status of computing tasks and the occupancy of computing resources, and draw line graphs to show the occupancy of computing resources over time, and display the status of computing tasks through the monitoring interface The execution status realizes the monitoring function of the user. After the user clicks the monitoring link of the task, it is fed back to the user monitoring page. The page uses the embedded monitoring tool to refresh the task running data in real time for display.

The calculation tasks mainly have the following execution states: created, waiting, built, running, paused, stopped, and displayed to the user through the task details page; (1) Created: After receiving the user's new task operation, the verification is passed, The task is successfully created, and it is in the "created" state at this time; (2) Waiting state: the state where the k8s cluster has not completed the resource allocation after receiving the resource allocation notification; (3) Construction state: in the k8s cluster After the resource allocation is completed, the container image is being constructed; (4) Running status: After completing the resource allocation and container construction described above, the status of actually running the user task code; (5) Paused status: The computing task is suspended, and the resource is not reserved. Released and can continue to execute at any time; (6) Stop state: Provides a task stop function. After the user triggers, the system saves the current results of the task, then stops the operation and releases the corresponding resources, and the operation cannot be resumed; (7) End state: the state after the task is executed.

Through the monitoring interface, the user can monitor and manage the task status, and provide the functions of stopping, suspending, and resuming tasks. For a running task, after receiving the stop operation submitted by the user, according to the different execution state of the task currently, perform the following operations: (1) When the task is in the "created" state, change the task state to "stopped", And suspend the resource allocation of k8s cluster. (2) When the task is in the "waiting" state, change the task state to "stopped" and remove the task from the resource waiting queue. (3) When the task is in the "build" state, change the task state to "stopped" and notify the docker mirroring process to stop the construction, and at the same time cancel the resource allocation in the k8s cluster. (4) When the task is in the "running" state, change the task status to "stopped" and notify the k8s cluster to save the current result of the user task to the storage address, and then delete the corresponding task node container to release computing resources.

For running tasks, after receiving the pause operation submitted by the user, according to the different execution status of the task currently, perform the following operations: (1) When the task is in the "created" state, the system directly changes the task status to "paused" ", and suspend the resource allocation of the k8s cluster. (2) When the task is in the "waiting" state, change the task state to "suspended" and remove the task from the resource waiting queue. (3) When the task is in the "build" state, change the task state to "suspended" and notify the docker image process to stop building. (4) When the task is in the "running" state, change the task state to "suspended", and at the same time notify the k8s cluster to suspend the execution of user code, and at the same time, do not release computing resources, and be ready to continue execution at any time.

For a suspended task, after receiving the user resume operation, according to the different execution status of the task when the task is suspended, perform the following operations: (1) When the task is suspended, it is in the "created" state, and the task status is changed to "created", and Continue the resource allocation work of the k8s cluster. (2) When the task is suspended, it is in the "waiting" state, the task state is changed to "waiting", and the task is restored to the resource waiting queue. (3) When the task is suspended, it is in the "build" state, change the task state to "build", and notify the docker image process to rebuild the image. (4) The task is in the "running" state when the task is suspended, and the system changes the task state to "running", and at the same time notifies the k8s cluster to resume the execution of the user code.

A system for implementing the above-mentioned code warehouse and computing service integration method, including a code warehouse module, a computing node building module, a computing task monitoring and management module, and a storage module;

The code warehouse module is used to store the code executed by the computing task;

The computing task monitoring and management module realizes the interaction of the user's new computing task through the new task interface; the user inputs the new task information through the new task interface, the computing task monitoring and management module obtains the user's new task information, and verifies the new task information submitted by the user Whether there is an error; if the verification is passed, the computing task monitoring and management module will feedback that the user task is created successfully, otherwise the user will be prompted with an error message; after the task is created successfully, the computing task monitoring and management module will query the list of existing cluster resources to determine whether There are computing resources that meet the specified tasks of the created task. If not, the new task will enter the delayed queue state and automatically retry when the cluster resources are sufficient; if the computing resources are available, the computing node building module will be triggered, and the computing node building module will pass k8s allocates the corresponding computing node, calls the task-related code from the code warehouse to the computing node, starts the computing node, and binds storage resources to the corresponding computing node as a storage module, and the computing node is successfully constructed; the computing node is built in the system The distributed computing framework starts to execute computing tasks; the computing node saves task execution logs and task execution output data to the storage module in real time; the computing task monitoring and management module obtains the task execution logs and task execution output data on the storage module in real time through the interface The task list is displayed to the user in the form, the user enters the task details interface, the computing task monitoring and management module displays the task list in the computing management interface, displaying the execution status and statistical data of the current task, realizing the user monitoring of the computing task, and supporting the user to Compute tasks for management operations.

When users monitor and manage computing tasks through the computing monitoring and management module, they use the operation interface to send network requests for task monitoring and management. After the computing monitoring and management module receives the user’s network request, it will store the computing on the storage module. The task execution status is fed back to the user to realize the user's monitoring function. After the user clicks the task monitoring link, the embedded monitoring tool refreshes the task running data in real time and displays it to the user.

When users monitor and manage computing tasks through the computing monitoring and management module, the computing monitoring and management module also displays the occupation of computing resources over time by drawing a line graph to the user.

The computing monitoring and management module implements the user's task status management through the monitoring interface, and provides the functions of stopping, suspending, and resuming tasks; for running tasks, after receiving the stop operation submitted by the user, obtain the task execution status, Perform the following operations: (1) When the task is in the "created" state, change the task state to "stopped" and notify the computing node building module to terminate the resource allocation work of the k8s cluster; (2) when the task is in the "waiting" state , Change the task status to "stopped" and remove the task from the resource waiting queue; (3) When the task is in the "build" state, change the task status to "stopped" and notify the docker of the computing node to build the module The mirroring process stops building and at the same time cancels resource allocation in the k8s cluster. (4) When the task is in the "running" state, change the task status to "stopped" and notify the k8s cluster to save the current results of the user task to the storage module, and then destroy the corresponding task node container to release computing resources.

For a running task, after receiving the pause operation submitted by the user, the computing monitoring and management module obtains the task execution status information on the storage module, and according to the different execution status of the task currently, perform the following operations: (1) The task is in "Created" ”Status, the computing monitoring and management module directly changes the task status to “suspended” and notifies the suspension of the resource allocation work of the k8s cluster. (2) When the task is in the "waiting" state, change the task state to "suspended" and remove the task from the resource waiting queue. (3) When the task is in the "build" state, change the task state to "suspended" and notify the computing node building module docker image process to stop building. (4) When the task is in the "running" state, change the task state to "suspended", and at the same time notify the k8s cluster to suspend the execution of user code, and at the same time, do not release computing resources, and be ready to continue execution at any time.

For a suspended task, after the computing monitoring and management module receives the user resume operation, the computing monitoring and management module obtains the task execution status information on the storage module, and performs the following operations according to the different execution status when the task is suspended: (1) Task suspension When the task is in the "created" state, change the task state to "created", and notify the computing node building module to continue the resource allocation work of the k8s cluster; (2) When the task is suspended, it is in the "waiting" state, and the task state is changed to " Wait” and restore the task to the resource waiting queue; (3) When the task is suspended, it is in the “build” state, the task state is changed to “build”, and the docker image process of the computing node building module is notified to rebuild the image; (4) When the task is suspended, it is in the "running" state, and the system changes the task state to "running", and at the same time notifies the k8s cluster to resume the execution of the user code.

The calculation, monitoring and management module stores the above-mentioned status information changes to the storage module.

Beneficial effects: Compared with the prior art, the present invention provides a method and system for integrating code warehouses and computing services. Users can directly initiate artificial intelligence computing tasks in the code warehouse or computing management module. Computing resources are directly configured on the initiation page, without code migration.

Description of the drawings

Figure 1 is a flow chart of the method of the present invention.

Detailed ways

The present invention will be further clarified below in conjunction with specific examples. It should be understood that these examples are only used to illustrate the present invention and not to limit the scope of the present invention. After reading the present invention, those skilled in the art will give various equivalent forms of the present invention. All the modifications fall within the scope defined by the appended claims of this application.

A method to realize the integration of code warehouse and computing services, embed gitea as a code warehouse module, manage and provide expandable computing resources through the form of k8s cluster, use ray framework to support distributed machine learning, and provide distributed storage through ceph to realize code The warehouse, computing resources, and results are stored in a unified platform for management; as shown in Figure 1, the specific steps are as follows:

The user initiates a computing task, and the user provides new task information, including task name, task description, code branch, code version (default latest version), task entry file and computing resources used, and obtains user’s new task information through the version control system or https protocol , To verify whether the new task information submitted by the user is wrong; including: whether the task name is the same, whether the code branch exists, and whether the code version exists. If the verification is passed, the task is created successfully, otherwise the user will be prompted with an error message; after the task is created successfully, query the list of existing cluster resources to determine whether the computing resources specified by the created task are met. If not, the new task will be delayed Queued state, it will automatically retry when the cluster resources are sufficient. If the computing resources can be met, the corresponding computing node is allocated through k8s; the task-related code is called from the code warehouse to the computing node, the computing node is started, and storage resources are bound to the corresponding computing node; through the built-in distributed computing of the computing node system Frame, start the calculation task, save the task execution log and task execution output data to the storage address in real time; display the task list through the interface form, enter the task details interface, the system displays the task list in the calculation management interface, and displays the current task execution status and Statistical data enables users to monitor computing tasks, and supports users to manage computing tasks at the same time.

When users monitor and manage computing tasks, they can send network requests in real time. After receiving the user’s network requests, the computing nodes will feed back the execution status of the computing tasks and the occupancy of computing resources, and draw a line graph to show the computing resource changes. The time occupancy is displayed through the monitoring interface to display the execution status of the computing task to realize the user's monitoring function. After the user clicks the task monitoring link, it will be fed back to the user monitoring page. The page is embedded with tensorboard and other commonly used monitoring tools for artificial intelligence computing tasks. Real-time refresh task running data for display.

The calculation tasks mainly have the following execution states: created, waiting, built, running, paused, stopped, and displayed to the user through the task details page; (1) Created: After receiving the user's new task operation, the verification is passed, Notify the k8s cluster to start allocating resources and return to the user a message that the task has been created; (2) Waiting state: the state where the k8s cluster has not completed resource allocation after receiving the resource allocation notification; (3) Construction state: The resource allocation in the k8s cluster is completed, and the container image is being constructed; (4) Running status: After completing the resource allocation and container construction described above, the status of the actual running user code; (5) Paused status: Suspending the computing task, and the resource The state is reserved and not released, and execution can be continued at any time; (6)Stop state: Provides the task stop function. After the user triggers, the system saves the current results of the task, then stops running and releases all resources, and the operation cannot be resumed; (7) ) End state: the state after the task is executed.

Through the monitoring interface, the user can monitor and manage the task status, and provide the functions of stopping, suspending, and resuming tasks. For a running task, after receiving the stop operation submitted by the user, according to the different execution state of the task currently, perform the following operations: (1) When the task is in the "created" state, change the task state to "stopped", And suspend the resource allocation of k8s cluster. (2) When the task is in the "waiting" state, change the task state to "stopped" and remove the task from the resource waiting queue. (3) When the task is in the "build" state, change the task state to "stopped" and notify the docker mirroring process to stop the construction, and at the same time cancel the resource allocation in the k8s cluster. (4) When the task is in the "running" state, change the task status to "stopped" and notify the k8s cluster to save the current result of the user task to the storage address, and then destroy the corresponding task node container to release computing resources.

For a running task, after receiving the pause operation submitted by the user, according to the different execution state of the task currently, perform the following operations: (1) When the task is in the "created" state, the system directly changes the task state to "paused" ", and suspend the resource allocation of the k8s cluster. (2) When the task is in the "waiting" state, change the task state to "suspended" and remove the task from the resource waiting queue. (3) When the task is in the "build" state, change the task state to "suspended" and notify the docker image process to stop the construction through the message middleware. (4) When the task is in the "running" state, change the task state to "suspended", and at the same time notify the k8s cluster to suspend the execution of user code, and at the same time, do not release computing resources, and be ready to continue execution at any time.

For a suspended task, after receiving the user resume operation, according to the different execution status of the task when the task is suspended, perform the following operations: (1) When the task is suspended, it is in the "created" state, and the task status is changed to "created", and Continue the resource allocation work of the k8s cluster. (2) When the task is suspended, it is in the "waiting" state, the task state is changed to "waiting", and the task is restored to the resource waiting queue. (3) When the task is suspended, it is in the "build" state, change the task state to "build", and notify the docker image process to rebuild the image through the message middleware. (4) The task is in the "running" state when the task is suspended, and the system changes the task state to "running", and at the same time notifies the k8s cluster to resume the execution of the user code.

Through the http request, provide users with task execution logs and task execution output data saved in the storage address, and display them on the page, and provide file download links for users to download and browse.

By running multiple containers as computing nodes for executing tasks, and importing user code from the code warehouse into the container, the code is used for later task execution; binding the object storage and file storage resources obtained by generating virtual paths for the computing nodes, using The storage address for data input, monitoring data and result storage of calculation tasks; register tasks in the monitoring process, generate monitoring links, and start executing tasks; after execution, save logs and results to the storage address.

A system that realizes the integration of code warehouses and computing services, including code warehouse modules, computing node building modules, computing task monitoring and management modules, and storage modules;

The computing task monitoring and management module uses the new task interface for the user to interact with the new computing task; the user enters the task name, task description, code branch, code version (default latest version), task entry file and computing resources through the new task interface Such as new task information, the calculation task monitoring and management module obtains the user’s new task information through the version control system or https protocol, and verifies whether the new task information submitted by the user is incorrect; including: whether the task name is the same, whether the code branch exists, and the code Whether the version exists. If the verification is passed, the computing task monitoring and management module will feedback that the user task is created successfully, otherwise it will prompt the user with an error message; after the task is successfully created, the computing task monitoring and management module will query the list of existing cluster resources to determine whether the specified task is met If the computing resources are not satisfied, the new task will enter the delayed queue state, and it will automatically retry when the cluster resources are sufficient. If the computing resources are sufficient, the computing node building module is triggered, the computing node building module allocates the corresponding computing node through k8s, calls the task-related code from the code warehouse to the computing node, starts the computing node, and binds storage resources to the corresponding computing node , The construction of the computing node is successful; the computing node starts to perform computing tasks through the built-in distributed computing framework of the system. The computing node saves the task execution log and task execution output data to the storage module in real time; the computing task monitoring and management module obtains the task execution log and task execution output data on the storage module in real time, and displays the task list to the user through the interface, and the user enters the task In the detailed interface, the computing task monitoring and management module displays the task list in the computing management interface, displaying the execution status and statistical data of the current task, realizing the user's monitoring of the computing task, and supporting the user to manage the computing task.

When the user monitors and manages the computing task through the computing monitoring and management module, the user can send network requests in real time through the operation interface. After the computing monitoring and management module receives the user's network request, it requires the computing node to feed back the execution status and calculation of the computing task Resource occupancy, and draw a line chart to show the occupancy of computing resources over time, display the execution status of computing tasks through the monitoring interface, and realize the user monitoring function. After the user clicks the task monitoring link, it will be reported to the user monitoring page , The page uses monitoring tools commonly used for artificial intelligence computing tasks such as embedded tensorboard to refresh the task running data in real time for display.

The computing monitoring and management module realizes the user's management of the task status through the monitoring interface, and provides the functions of stopping, suspending, and resuming tasks. For running tasks, after receiving the stop operation submitted by the user, by obtaining the task execution status, perform the following operations: (1) When the task is in the "created" state, change the task status to "stopped" and notify the computing node The construction module suspends the resource allocation work of the k8s cluster, and stores the status information changes in the storage module. The following status information changes are also stored in the storage module. (2) When the task is in the "waiting" state, change the task state to "stopped" and remove the task from the resource waiting queue. (3) When the task is in the "build" state, change the task state to "stopped", and notify the docker image process of the computing node building module to stop the construction through the message middleware, and cancel the resource allocation in the k8s cluster. (4) When the task is in the "running" state, change the task status to "stopped" and notify the k8s cluster to save the current results of the user task to the storage module, and then destroy the corresponding task node container to release computing resources.

For running tasks, after receiving the pause operation submitted by the user, the calculation monitoring and management module will perform the following operations according to the different execution states of the task at present: (1) When the task is in the "created" state, the calculation monitoring and management module Change the task status directly to "suspended" and notify the suspension of the resource allocation work of the k8s cluster. (2) When the task is in the "waiting" state, change the task state to "suspended" and remove the task from the resource waiting queue. (3) When the task is in the "build" state, change the task state to "suspended" and notify the computing node building module docker image process to stop the construction through the message middleware. (4) When the task is in the "running" state, change the task state to "suspended", and at the same time notify the k8s cluster to suspend the execution of user code, and at the same time, do not release computing resources, and be ready to continue execution at any time.

For suspended tasks, after the computing monitoring and management module receives the user resume operation, according to the different execution status of the task when the task is suspended, the following operations are performed: (1) The task is in the "created" state when the task is suspended, and the task status is changed to " Created", and notify the continued resource allocation of the k8s cluster. (2) When the task is suspended, it is in the "waiting" state, the task state is changed to "waiting", and the task is restored to the resource waiting queue. (3) When the task is suspended, it is in the "build" state, the task state is changed to "build", and the docker image process of the computing node building module is notified to rebuild the image through the message middleware. (4) The task is in the "running" state when the task is suspended, and the system changes the task state to "running", and at the same time notifies the k8s cluster to resume the execution of the user code.

The storage module provides users with task execution logs and task execution output data stored in the storage address through HTTP requests, and displays them on the page through the calculation monitoring and management module, and provides file download links for users to download and browse.

Claims

A method for implementing code warehouse and computing service integration, which is characterized in that: when a user initiates a computing task, obtain the user's newly created computing task information, and verify whether the newly created computing task information submitted by the user is incorrect; if the verification is passed, the task is created successfully , Otherwise the user will be prompted with an error message; after the task is created successfully, query the list of existing cluster resources to determine whether the computing resources specified by the created computing task are met. If not, the new computing task will enter the delayed queue state and wait for the cluster resources It will automatically retry when sufficient; if the computing resources can meet the execution of the computing task, the computing node is allocated to the computing task for the execution of the computing task; the code related to the computing task is called from the code warehouse to the computing node, the computing node is started, and the storage is bound Resources are given to the corresponding computing node; through the built-in distributed computing framework of the computing node system, start to execute computing tasks, save the task execution log and task execution output data to the storage address in real time; display the list of computing tasks through the interface, and enter the task details interface; in The task list is displayed in the computing management interface, which enables users to monitor computing tasks and supports users to manage computing tasks.
The method for implementing code warehouse and computing service integration according to claim 1, characterized in that: when a user monitors and manages a computing task, after receiving a user request, feedback the execution status of the computing task and the occupation of computing resources, And by drawing a line chart to show the occupancy of computing resources over time, use the monitoring interface to display the execution status of the computing task, realize the user's monitoring function, provide a monitoring link, after the user clicks, the embedded monitoring tool will refresh the computing task in real time Run the data for display.
The method for implementing code warehouse and computing service integration according to claim 1, characterized in that: the execution status of the computing task displayed to the user includes six statuses: created, waiting, constructed, running, paused and stopped;

Created status: After receiving the user's new task operation, the verification is passed, and the task is created successfully;

Waiting state: in the process of using the k8s cluster to allocate resources, the state where the k8s cluster has not completed the resource allocation after receiving the resource allocation notification;

Construction status: The resource allocation in the k8s cluster is completed, and the container image is being constructed;

Running status: After completing the resource allocation and construction of the container image, the status of running the computing task code;

Suspended state: the state where the computing task is suspended, the resource reservation is not released, and the execution can be continued at any time;

Stop state: Provides the function of stopping the calculation task. After the user triggers it, the current result of the calculation task is saved, then the operation is stopped and the corresponding resource is released, and the operation cannot be resumed;

End state: the state after the calculation task is executed.
The method for implementing code warehouse and computing service integration according to claim 3, characterized in that: the user monitors and manages the task execution status through the monitoring interface, and provides the functions of stopping, suspending, and resuming tasks; Task, after receiving the stop task operation submitted by the user, according to the different execution status of the computing task currently, perform the following operations: 1. When the computing task is in the created state, change the computing task status to stopped, and terminate the k8s cluster Resource allocation work; 2. When the calculation task is in the waiting state, change the state of the calculation task to stopped, and remove the calculation task from the resource waiting queue; 3. When the calculation task is in the construction state, change the state of the calculation task to stopped , And notify the docker mirroring process to stop the construction, and cancel the resource allocation in the k8s cluster; 4. When the computing task is running, change the task status to stopped, and notify the k8s cluster at the same time to save the current results of the user's computing task to the storage address , And then delete the corresponding task node container to release computing resources.
The method for implementing code warehouse and computing service integration according to claim 4, characterized in that: for a computing task that is executing, after receiving a suspended task operation submitted by a user, it is executed according to the different execution state of the computing task currently The operations are as follows: 1. When the computing task is in the created state, change the task state to suspended and suspend the resource allocation work of the k8s cluster; 2. When the computing task is in the waiting state, change the task state to suspended and remove the task from the resource Wait for the queue to be removed; 3. When the computing task is in the build state, change the task state to suspended and notify the docker mirroring process to stop building; 4. When the computing task is in the running state, change the task state to suspended and notify the k8s cluster at the same time, Suspend the execution of user computing task code without releasing computing resources, and be ready to continue execution at any time.
The method for implementing code warehouse and computing service integration according to claim 4, characterized in that: for a computing task in a suspended state, after receiving the user resume task operation, it is different from the computing task when the suspended operation is executed. In the execution state, perform the following operations: 1. If the computing task is in the created state when the suspended operation is executed, change the state of the computing task to created, and continue the resource allocation work of the k8s cluster; 2. The computing task is in the created state when the suspended operation is executed In the waiting state, change the computing task state to waiting and restore the computing task to the resource waiting queue; 3. If the computing task is in the build state when the suspended operation is executed, change the computing task state to build and notify the docker mirroring process to restart Build a mirror image; Fourth, if the computing task is in the running state when the operation is suspended, the computing task state is changed to running, and the k8s cluster is notified to resume the execution of the user's computing task code.
The method for implementing code warehouse and computing service integration according to claim 1, characterized in that: when a user initiates a computing task, it is transmitted to the computing environment through a version control system or https protocol to obtain the user’s newly created task information; The user provides the task execution log and task execution output data saved in the storage address, and displays it on the page, and provides a file download link for the user to download and browse.
The method for realizing the integration of code warehouse and computing service according to claim 1, characterized in that: the new task information of the user includes: task name, task description, code branch, code version, task entry file and computing resources used; new calculation During the task, if the computing resources can meet the computing task, the corresponding computing node is allocated through k8s; multiple containers are run as computing nodes to perform the computing task, and the user computing task code is imported from the code warehouse into the container.
A system for integrating code warehouse and computing service, which is characterized in that it includes a code warehouse module, a computing node building module, a computing task monitoring and management module, and a storage module;

The code warehouse module is used to store the code executed by the computing task;

The computing task monitoring and management module realizes the interaction of the user's new computing task through the new task interface; the user inputs the new task information through the new task interface, the computing task monitoring and management module obtains the user's new task information, and verifies the new task information submitted by the user Whether there is an error; if the verification is passed, the computing task monitoring and management module will feedback that the user task is created successfully, otherwise the user will be prompted with an error message; after the task is created successfully, the computing task monitoring and management module will query the list of existing cluster resources to determine whether There are computing resources that meet the specified tasks of the created task. If not, the new task will enter the delayed queue state and automatically retry when the cluster resources are sufficient; if the computing resources are available, the computing node building module will be triggered, and the computing node building module will pass k8s allocates the corresponding computing node, calls the task-related code from the code warehouse to the computing node, starts the computing node, and binds storage resources to the corresponding computing node as a storage module, and the computing node is successfully constructed; the computing node is built in the system The distributed computing framework starts to execute computing tasks; the computing node saves task execution logs and task execution output data to the storage module in real time; the computing task monitoring and management module obtains the task execution logs and task execution output data on the storage module in real time through the interface The task list is displayed to the user in the form, the user enters the task details interface, the computing task monitoring and management module displays the task list in the computing management interface, displaying the execution status and statistical data of the current task, realizing the user monitoring of the computing task, and supporting the user to Compute tasks for management operations.
The code warehouse and computing service integration system according to claim 9, characterized in that: when a user monitors and manages computing tasks through the computing monitoring and management module, the user uses the operation interface to send network requests for task monitoring and management, and computing After the monitoring and management module receives the user’s network request, it will feed back the execution status of the computing task stored on the storage module to the user to realize the user’s monitoring function. After the user clicks on the task’s monitoring link, the embedded monitoring tool will provide real-time monitoring. Refresh the task running data and show it to the user; when the user monitors and manages the computing task through the computing monitoring and management module, the computing monitoring and management module also displays the occupation of computing resources over time by drawing a line chart to the user; The computing monitoring and management module realizes the user's management of the task status through the monitoring interface, and provides the functions of stopping, suspending, and resuming tasks.