WO2021237829A1 - Method and system for integrating code repository with computing service - Google Patents

Method and system for integrating code repository with computing service Download PDF

Info

Publication number
WO2021237829A1
WO2021237829A1 PCT/CN2020/096730 CN2020096730W WO2021237829A1 WO 2021237829 A1 WO2021237829 A1 WO 2021237829A1 CN 2020096730 W CN2020096730 W CN 2020096730W WO 2021237829 A1 WO2021237829 A1 WO 2021237829A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
computing
user
state
monitoring
Prior art date
Application number
PCT/CN2020/096730
Other languages
French (fr)
Chinese (zh)
Inventor
俞扬
秦熔均
沈雷彦
冷俊杰
管延明
李济君
Original Assignee
南栖仙策(南京)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南栖仙策(南京)科技有限公司 filed Critical 南栖仙策(南京)科技有限公司
Publication of WO2021237829A1 publication Critical patent/WO2021237829A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/485Task life-cycle, e.g. stopping, restarting, resuming execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5022Mechanisms to release resources

Definitions

  • the invention relates to a method and system for realizing the integration of a code warehouse and computing services.
  • a code warehouse and artificial intelligence computing can be operated and implemented in the same system, which belongs to the field of artificial intelligence technology.
  • artificial intelligence algorithm research experiments mainly include the following processes:
  • Mainstream online code hosting services include github, gitlab, etc.
  • users create an account on a code hosting platform such as guthub and create a new code warehouse, they can write code remotely, and push code changes to the corresponding branch and version of github via https or ssh.
  • a code hosting platform such as guthub and create a new code warehouse
  • users can write code remotely, and push code changes to the corresponding branch and version of github via https or ssh.
  • each time the code is adjusted it needs to be migrated to the computing platform.
  • the threshold for building software and hardware environments suitable for large-scale machine learning is relatively high, and high-performance computing platforms are usually required to be paired with specific software environments.
  • the current mainstream solution is to rent a virtual host from a cloud service provider, build an experimental environment by yourself, and then conduct training. If this scheme is adopted, on the one hand, computing resources will continue to incur costs after they are rented.
  • the software environment needs to be installed in the virtual host provided by the cloud service provider, according to different network environments and installations. For software content, this preparation process may take several hours, which consumes more time and cost of experimenters, resulting in a higher cost of each experiment, and the proportion of experimental links that really generate value is reduced, which is less efficient.
  • Another solution for computing platforms is to purchase hardware directly and build a computing environment from the hardware.
  • This kind of solution requires a high cost of one-time hardware investment, and needs to be responsible for its own operation and maintenance work, and the idle cost is also very prominent. For small and medium research institutions and individual research, the cost performance is lower.
  • the present invention proposes a new method and system that combines code hosting and computing resources into the same system, reducing meaningless Platform switching reduces the idle cost of computing resources in the form of pay-as-you-go.
  • the user When the user initiates a calculation task, obtain the user’s new task information, and verify whether the new task information submitted by the user is incorrect; if the verification is passed, the task is created successfully, otherwise the user will be prompted with an error message; the task is created successfully, query the list of existing cluster resources , Determine whether there are computing resources that meet the specified computing resources of the created task. If not, the new task will enter the delayed queue state, and it will automatically retry when the cluster resources are sufficient.
  • the computing resources are sufficient, assign the corresponding computing node; call the task-related code from the code warehouse to the computing node, start the computing node, and bind storage resources to the corresponding computing node; start computing through the built-in distributed computing framework of the system Task, save the task execution log and task execution output data to the storage address in real time; display the task list through the interface, enter the task details interface, the system displays the task list in the calculation management interface, displays the current task execution status and statistical data, and realizes the user Monitor computing tasks and support users to manage computing tasks at the same time.
  • the calculation tasks mainly have the following execution states: created, waiting, built, running, paused, stopped, and displayed to the user through the task details page; (1) Created: After receiving the user's new task operation, the verification is passed, The task is successfully created, and it is in the "created" state at this time; (2) Waiting state: the state where the k8s cluster has not completed the resource allocation after receiving the resource allocation notification; (3) Construction state: in the k8s cluster After the resource allocation is completed, the container image is being constructed; (4) Running status: After completing the resource allocation and container construction described above, the status of actually running the user task code; (5) Paused status: The computing task is suspended, and the resource is not reserved.
  • Stop state Provides a task stop function. After the user triggers, the system saves the current results of the task, then stops the operation and releases the corresponding resources, and the operation cannot be resumed; (7) End state: the state after the task is executed.
  • the user can monitor and manage the task status, and provide the functions of stopping, suspending, and resuming tasks.
  • For a running task after receiving the stop operation submitted by the user, according to the different execution state of the task currently, perform the following operations: (1) When the task is in the "created” state, change the task state to "stopped”, And suspend the resource allocation of k8s cluster. (2) When the task is in the "waiting” state, change the task state to "stopped” and remove the task from the resource waiting queue. (3) When the task is in the "build” state, change the task state to "stopped” and notify the docker mirroring process to stop the construction, and at the same time cancel the resource allocation in the k8s cluster. (4) When the task is in the "running” state, change the task status to "stopped” and notify the k8s cluster to save the current result of the user task to the storage address, and then delete the corresponding task node container to release computing resources.
  • a suspended task After receiving the user resume operation, according to the different execution status of the task when the task is suspended, perform the following operations: (1) When the task is suspended, it is in the "created” state, and the task status is changed to “created”, and Continue the resource allocation work of the k8s cluster. (2) When the task is suspended, it is in the "waiting” state, the task state is changed to "waiting", and the task is restored to the resource waiting queue. (3) When the task is suspended, it is in the "build” state, change the task state to "build”, and notify the docker image process to rebuild the image. (4) The task is in the "running” state when the task is suspended, and the system changes the task state to "running”, and at the same time notifies the k8s cluster to resume the execution of the user code.
  • a system for implementing the above-mentioned code warehouse and computing service integration method including a code warehouse module, a computing node building module, a computing task monitoring and management module, and a storage module;
  • the code warehouse module is used to store the code executed by the computing task
  • the computing task monitoring and management module realizes the interaction of the user's new computing task through the new task interface; the user inputs the new task information through the new task interface, the computing task monitoring and management module obtains the user's new task information, and verifies the new task information submitted by the user Whether there is an error; if the verification is passed, the computing task monitoring and management module will feedback that the user task is created successfully, otherwise the user will be prompted with an error message; after the task is created successfully, the computing task monitoring and management module will query the list of existing cluster resources to determine whether There are computing resources that meet the specified tasks of the created task.
  • the new task will enter the delayed queue state and automatically retry when the cluster resources are sufficient; if the computing resources are available, the computing node building module will be triggered, and the computing node building module will pass k8s allocates the corresponding computing node, calls the task-related code from the code warehouse to the computing node, starts the computing node, and binds storage resources to the corresponding computing node as a storage module, and the computing node is successfully constructed; the computing node is built in the system
  • the distributed computing framework starts to execute computing tasks; the computing node saves task execution logs and task execution output data to the storage module in real time; the computing task monitoring and management module obtains the task execution logs and task execution output data on the storage module in real time through the interface
  • the task list is displayed to the user in the form, the user enters the task details interface, the computing task monitoring and management module displays the task list in the computing management interface, displaying the execution status and statistical data of the current task, realizing the user monitoring of the computing task, and supporting the user to Comp
  • the computing monitoring and management module When users monitor and manage computing tasks through the computing monitoring and management module, they use the operation interface to send network requests for task monitoring and management. After the computing monitoring and management module receives the user’s network request, it will store the computing on the storage module. The task execution status is fed back to the user to realize the user's monitoring function. After the user clicks the task monitoring link, the embedded monitoring tool refreshes the task running data in real time and displays it to the user.
  • the computing monitoring and management module When users monitor and manage computing tasks through the computing monitoring and management module, the computing monitoring and management module also displays the occupation of computing resources over time by drawing a line graph to the user.
  • the computing monitoring and management module implements the user's task status management through the monitoring interface, and provides the functions of stopping, suspending, and resuming tasks; for running tasks, after receiving the stop operation submitted by the user, obtain the task execution status, Perform the following operations: (1) When the task is in the "created” state, change the task state to "stopped” and notify the computing node building module to terminate the resource allocation work of the k8s cluster; (2) when the task is in the "waiting” state , Change the task status to "stopped” and remove the task from the resource waiting queue; (3) When the task is in the "build” state, change the task status to "stopped” and notify the docker of the computing node to build the module The mirroring process stops building and at the same time cancels resource allocation in the k8s cluster. (4) When the task is in the "running” state, change the task status to "stopped” and notify the k8s cluster to save the current results of the user task to the storage module, and then destroy the corresponding task node container to release computing
  • the computing monitoring and management module For a running task, after receiving the pause operation submitted by the user, the computing monitoring and management module obtains the task execution status information on the storage module, and according to the different execution status of the task currently, perform the following operations: (1) The task is in "Created” ”Status, the computing monitoring and management module directly changes the task status to “suspended” and notifies the suspension of the resource allocation work of the k8s cluster. (2) When the task is in the "waiting" state, change the task state to "suspended” and remove the task from the resource waiting queue. (3) When the task is in the "build” state, change the task state to "suspended” and notify the computing node building module docker image process to stop building.
  • the computing monitoring and management module obtains the task execution status information on the storage module, and performs the following operations according to the different execution status when the task is suspended: (1) Task suspension When the task is in the "created” state, change the task state to "created”, and notify the computing node building module to continue the resource allocation work of the k8s cluster; (2) When the task is suspended, it is in the "waiting” state, and the task state is changed to " Wait” and restore the task to the resource waiting queue; (3) When the task is suspended, it is in the “build” state, the task state is changed to “build”, and the docker image process of the computing node building module is notified to rebuild the image; (4) When the task is suspended, it is in the “running” state, and the system changes the task state to "running”, and at the same time notifies the k8s cluster to resume the execution of the user code.
  • the calculation, monitoring and management module stores the above-mentioned status information changes to the storage module.
  • the present invention provides a method and system for integrating code warehouses and computing services. Users can directly initiate artificial intelligence computing tasks in the code warehouse or computing management module. Computing resources are directly configured on the initiation page, without code migration.
  • Figure 1 is a flow chart of the method of the present invention.
  • a method to realize the integration of code warehouse and computing services embed gitea as a code warehouse module, manage and provide expandable computing resources through the form of k8s cluster, use ray framework to support distributed machine learning, and provide distributed storage through ceph to realize code
  • the warehouse, computing resources, and results are stored in a unified platform for management; as shown in Figure 1, the specific steps are as follows:
  • the user initiates a computing task, and the user provides new task information, including task name, task description, code branch, code version (default latest version), task entry file and computing resources used, and obtains user’s new task information through the version control system or https protocol , To verify whether the new task information submitted by the user is wrong; including: whether the task name is the same, whether the code branch exists, and whether the code version exists. If the verification is passed, the task is created successfully, otherwise the user will be prompted with an error message; after the task is created successfully, query the list of existing cluster resources to determine whether the computing resources specified by the created task are met. If not, the new task will be delayed Queued state, it will automatically retry when the cluster resources are sufficient.
  • the corresponding computing node is allocated through k8s; the task-related code is called from the code warehouse to the computing node, the computing node is started, and storage resources are bound to the corresponding computing node; through the built-in distributed computing of the computing node system Frame, start the calculation task, save the task execution log and task execution output data to the storage address in real time; display the task list through the interface form, enter the task details interface, the system displays the task list in the calculation management interface, and displays the current task execution status and Statistical data enables users to monitor computing tasks, and supports users to manage computing tasks at the same time.
  • the calculation tasks mainly have the following execution states: created, waiting, built, running, paused, stopped, and displayed to the user through the task details page; (1) Created: After receiving the user's new task operation, the verification is passed, Notify the k8s cluster to start allocating resources and return to the user a message that the task has been created; (2) Waiting state: the state where the k8s cluster has not completed resource allocation after receiving the resource allocation notification; (3) Construction state: The resource allocation in the k8s cluster is completed, and the container image is being constructed; (4) Running status: After completing the resource allocation and container construction described above, the status of the actual running user code; (5) Paused status: Suspending the computing task, and the resource The state is reserved and not released, and execution can be continued at any time; (6)Stop state: Provides the task stop function. After the user triggers, the system saves the current results of the task, then stops running and releases all resources, and the operation cannot be resumed; (7) ) End state: the state after the task is executed.
  • the user can monitor and manage the task status, and provide the functions of stopping, suspending, and resuming tasks.
  • For a running task after receiving the stop operation submitted by the user, according to the different execution state of the task currently, perform the following operations: (1) When the task is in the "created” state, change the task state to "stopped”, And suspend the resource allocation of k8s cluster. (2) When the task is in the "waiting” state, change the task state to "stopped” and remove the task from the resource waiting queue. (3) When the task is in the "build” state, change the task state to "stopped” and notify the docker mirroring process to stop the construction, and at the same time cancel the resource allocation in the k8s cluster. (4) When the task is in the "running” state, change the task status to "stopped” and notify the k8s cluster to save the current result of the user task to the storage address, and then destroy the corresponding task node container to release computing resources.
  • a suspended task After receiving the user resume operation, according to the different execution status of the task when the task is suspended, perform the following operations: (1) When the task is suspended, it is in the "created” state, and the task status is changed to “created”, and Continue the resource allocation work of the k8s cluster. (2) When the task is suspended, it is in the "waiting” state, the task state is changed to "waiting", and the task is restored to the resource waiting queue. (3) When the task is suspended, it is in the "build” state, change the task state to "build”, and notify the docker image process to rebuild the image through the message middleware. (4) The task is in the "running” state when the task is suspended, and the system changes the task state to "running”, and at the same time notifies the k8s cluster to resume the execution of the user code.
  • the code By running multiple containers as computing nodes for executing tasks, and importing user code from the code warehouse into the container, the code is used for later task execution; binding the object storage and file storage resources obtained by generating virtual paths for the computing nodes, using The storage address for data input, monitoring data and result storage of calculation tasks; register tasks in the monitoring process, generate monitoring links, and start executing tasks; after execution, save logs and results to the storage address.
  • a system that realizes the integration of code warehouses and computing services including code warehouse modules, computing node building modules, computing task monitoring and management modules, and storage modules;
  • the computing task monitoring and management module uses the new task interface for the user to interact with the new computing task; the user enters the task name, task description, code branch, code version (default latest version), task entry file and computing resources through the new task interface
  • the calculation task monitoring and management module obtains the user’s new task information through the version control system or https protocol, and verifies whether the new task information submitted by the user is incorrect; including: whether the task name is the same, whether the code branch exists, and the code Whether the version exists.
  • the computing task monitoring and management module will feedback that the user task is created successfully, otherwise it will prompt the user with an error message; after the task is successfully created, the computing task monitoring and management module will query the list of existing cluster resources to determine whether the specified task is met If the computing resources are not satisfied, the new task will enter the delayed queue state, and it will automatically retry when the cluster resources are sufficient.
  • the computing node building module If the computing resources are sufficient, the computing node building module is triggered, the computing node building module allocates the corresponding computing node through k8s, calls the task-related code from the code warehouse to the computing node, starts the computing node, and binds storage resources to the corresponding computing node , The construction of the computing node is successful; the computing node starts to perform computing tasks through the built-in distributed computing framework of the system.
  • the computing node saves the task execution log and task execution output data to the storage module in real time; the computing task monitoring and management module obtains the task execution log and task execution output data on the storage module in real time, and displays the task list to the user through the interface, and the user enters the task In the detailed interface, the computing task monitoring and management module displays the task list in the computing management interface, displaying the execution status and statistical data of the current task, realizing the user's monitoring of the computing task, and supporting the user to manage the computing task.
  • the user can send network requests in real time through the operation interface.
  • the computing monitoring and management module receives the user's network request, it requires the computing node to feed back the execution status and calculation of the computing task Resource occupancy, and draw a line chart to show the occupancy of computing resources over time, display the execution status of computing tasks through the monitoring interface, and realize the user monitoring function.
  • the user clicks the task monitoring link it will be reported to the user monitoring page ,
  • the page uses monitoring tools commonly used for artificial intelligence computing tasks such as embedded tensorboard to refresh the task running data in real time for display.
  • the computing monitoring and management module realizes the user's management of the task status through the monitoring interface, and provides the functions of stopping, suspending, and resuming tasks.
  • For running tasks after receiving the stop operation submitted by the user, by obtaining the task execution status, perform the following operations: (1) When the task is in the "created” state, change the task status to "stopped” and notify the computing node
  • the construction module suspends the resource allocation work of the k8s cluster, and stores the status information changes in the storage module. The following status information changes are also stored in the storage module. (2) When the task is in the "waiting" state, change the task state to "stopped” and remove the task from the resource waiting queue.
  • the calculation monitoring and management module will perform the following operations according to the different execution states of the task at present: (1) When the task is in the "created” state, the calculation monitoring and management module Change the task status directly to “suspended” and notify the suspension of the resource allocation work of the k8s cluster. (2) When the task is in the "waiting” state, change the task state to "suspended” and remove the task from the resource waiting queue. (3) When the task is in the "build” state, change the task state to "suspended” and notify the computing node building module docker image process to stop the construction through the message middleware.
  • the computing monitoring and management module receives the user resume operation, according to the different execution status of the task when the task is suspended, the following operations are performed: (1) The task is in the "created” state when the task is suspended, and the task status is changed to " Created”, and notify the continued resource allocation of the k8s cluster. (2) When the task is suspended, it is in the "waiting” state, the task state is changed to "waiting", and the task is restored to the resource waiting queue. (3) When the task is suspended, it is in the "build” state, the task state is changed to "build”, and the docker image process of the computing node building module is notified to rebuild the image through the message middleware. (4) The task is in the "running” state when the task is suspended, and the system changes the task state to "running”, and at the same time notifies the k8s cluster to resume the execution of the user code.
  • the storage module provides users with task execution logs and task execution output data stored in the storage address through HTTP requests, and displays them on the page through the calculation monitoring and management module, and provides file download links for users to download and browse.

Abstract

Disclosed are a method and a system for integrating a code repository with a computing service. Gitea is embedded as a code repository module, an expandable computing resource is managed and provided in the form of a k8s cluster, distributed machine learning is supported by using a ray framework, and distributed storage is provided by means of ceph, so as to achieve management of a code repository, computing resources, and result storage on a unified platform. By means of the present invention, a user can directly initiate an artificial intelligent computing task in a code repository or a computing management module, and codes and computing resources used for a computing task are directly configured on an initiation page, without the need of performing code migration.

Description

一种实现代码仓库与计算服务整合的方法及系统A method and system for realizing the integration of code warehouse and computing service 技术领域Technical field
本发明涉及一种实现代码仓库与计算服务整合的方法及系统,通过计算平台,可将代码仓库与人工智能计算在同一个系统进行操作实施,属于人工智能技术领域。The invention relates to a method and system for realizing the integration of a code warehouse and computing services. Through a computing platform, the code warehouse and artificial intelligence computing can be operated and implemented in the same system, which belongs to the field of artificial intelligence technology.
背景技术Background technique
通常,人工智能算法研究实验主要包含如下过程:Generally, artificial intelligence algorithm research experiments mainly include the following processes:
(1)编写测试代码,准备实验数据;(2)准备实验环境,实际进行实验。(1) Write the test code and prepare the experimental data; (2) Prepare the experimental environment and actually conduct the experiment.
因此,研究人员的代码仓库与实验环境是分开准备的。Therefore, the researcher's code repository and the experimental environment are prepared separately.
在代码托管部分,一般采用的方案有在线代码托管平台或者本地管理。主流的在线代码托管服务有github、gitlab等。用户在guthub等代码托管平台创建账号,并新建代码仓库后,即可远程编写代码,并通过https或者ssh将代码改动推送到github对应的分支和版本。实际进行实验时,每次调整代码后都需要迁移到计算平台,存在额外的平台切换流程和成本,这部分不应是实验人员所关注的内容。In the code hosting part, the generally adopted solutions are online code hosting platform or local management. Mainstream online code hosting services include github, gitlab, etc. After users create an account on a code hosting platform such as guthub and create a new code warehouse, they can write code remotely, and push code changes to the corresponding branch and version of github via https or ssh. In the actual experiment, each time the code is adjusted, it needs to be migrated to the computing platform. There are additional platform switching procedures and costs. This part should not be the content of the experimenter's attention.
计算平台方面,适用于大规模机器学习的软硬件环境搭建的门槛较高,通常需要高性能计算平台与特定软件环境搭配。In terms of computing platforms, the threshold for building software and hardware environments suitable for large-scale machine learning is relatively high, and high-performance computing platforms are usually required to be paired with specific software environments.
目前主流的解决方案为从云服务商租用虚拟主机,自行搭建实验环境,再进行训练。如采用这种方案,一方面,计算资源在租用之后即持续产生成本,另一方面,在开始实验之前,需要在云服务商提供的虚拟主机中安装软件环境,跟据不同的网络环境和安装的软件内容,这一准备过程可能长达数个小时,消耗了实验人员较多的时间成本,导致每次实验成本较高,而其中真正产生价值的实验环节占比降低,较为低效。The current mainstream solution is to rent a virtual host from a cloud service provider, build an experimental environment by yourself, and then conduct training. If this scheme is adopted, on the one hand, computing resources will continue to incur costs after they are rented. On the other hand, before starting the experiment, the software environment needs to be installed in the virtual host provided by the cloud service provider, according to different network environments and installations. For software content, this preparation process may take several hours, which consumes more time and cost of experimenters, resulting in a higher cost of each experiment, and the proportion of experimental links that really generate value is reduced, which is less efficient.
另一种计算平台的方案是直接购买硬件,从硬件开始搭建计算环境。这种方案一次性投入的硬件成本较高,且需要自行负责运维工作,闲置成本也很突出。对于中小型研究机构和个人研究而言,性价比更低。Another solution for computing platforms is to purchase hardware directly and build a computing environment from the hardware. This kind of solution requires a high cost of one-time hardware investment, and needs to be responsible for its own operation and maintenance work, and the idle cost is also very prominent. For small and medium research institutions and individual research, the cost performance is lower.
发明内容Summary of the invention
发明目的:为克服现有技术中人工智能研究中代码与计算平台切换问题,本发明提出了一种新型的将代码托管与计算资源结合到同一个系统使用的方法和系统,减少了无意义的平台切换,以按量付费的形式降低了计算资源的闲置成本。Purpose of the invention: In order to overcome the problem of code and computing platform switching in artificial intelligence research in the prior art, the present invention proposes a new method and system that combines code hosting and computing resources into the same system, reducing meaningless Platform switching reduces the idle cost of computing resources in the form of pay-as-you-go.
技术方案:一种实现代码仓库与计算服务整合的方法,内嵌gitea作为代码仓库模块,通过k8s集群的形式管理和提供可拓展的计算资源,使用ray框架支持分布式机器学习,通过ceph提供分布式存储,实现代码仓库、计算资源、结果存储在统一平台管理;具体包括如下 步骤:Technical solution: A method to realize the integration of code warehouse and computing services, embedded gitea as a code warehouse module, manages and provides expandable computing resources in the form of k8s cluster, uses ray framework to support distributed machine learning, and provides distribution through ceph Type storage to realize the management of code warehouse, computing resources, and results storage on a unified platform; specifically including the following steps:
用户发起计算任务时,获取用户新建任务信息,校验用户提交的新建任务信息是否有误;如果校验通过,任务创建成功,否则给用户提示错误信息;任务创建成功,查询现有集群资源列表,判断是否有满足所建任务指定的计算资源,如不满足,则将新建任务进入延迟排队状态,等待集群资源充足时将自动重试。如计算资源可满足,分配对应的计算节点;从代码仓库调用与任务相关的代码到计算节点,启动计算节点,并且绑定存储资源给对应计算节点;通过系统内置的分布式计算框架,开始计算任务,实时将任务执行日志和任务执行输出数据保存至存储地址;通过界面展示任务列表,进入任务详情界面,系统在计算管理界面中展示任务列表,展示当前任务的执行状态和统计数据,实现用户对计算任务的监控,同时支持用户对计算任务进行管理操作。When the user initiates a calculation task, obtain the user’s new task information, and verify whether the new task information submitted by the user is incorrect; if the verification is passed, the task is created successfully, otherwise the user will be prompted with an error message; the task is created successfully, query the list of existing cluster resources , Determine whether there are computing resources that meet the specified computing resources of the created task. If not, the new task will enter the delayed queue state, and it will automatically retry when the cluster resources are sufficient. If the computing resources are sufficient, assign the corresponding computing node; call the task-related code from the code warehouse to the computing node, start the computing node, and bind storage resources to the corresponding computing node; start computing through the built-in distributed computing framework of the system Task, save the task execution log and task execution output data to the storage address in real time; display the task list through the interface, enter the task details interface, the system displays the task list in the calculation management interface, displays the current task execution status and statistical data, and realizes the user Monitor computing tasks and support users to manage computing tasks at the same time.
用户对计算任务进行监控和管理时,发送网络请求,反馈计算任务的执行状态和计算资源的占用情况,并通过绘制折线图的方式展示计算资源随时间的占用情况,通过监控界面展示计算任务的执行状态,实现用户的监控功能,用户点击任务的监控链接以后,反馈给用户监控页面,页面通过内嵌监控工具,实时刷新任务运行数据进行展示。When users monitor and manage computing tasks, they send network requests to feed back the execution status of computing tasks and the occupancy of computing resources, and draw line graphs to show the occupancy of computing resources over time, and display the status of computing tasks through the monitoring interface The execution status realizes the monitoring function of the user. After the user clicks the monitoring link of the task, it is fed back to the user monitoring page. The page uses the embedded monitoring tool to refresh the task running data in real time for display.
计算任务主要有如下几种执行状态:已创建、等待、构建、运行、暂停、停止,并通过任务详情页面展示给用户;(1)已创建:收到用户新建任务操作后,校验通过,任务创建成功,此时为“已创建”状态;(2)等待状态:k8s集群在收到分配资源通知后,还未完成资源分配工作时所处的状态;(3)构建状态:k8s集群中资源分配完毕,正在进行容器镜像的构建;(4)运行状态:做完前面所述资源分配和容器构建,实际运行用户任务代码的状态;(5)暂停状态:将计算任务暂停,资源保留不释放,随时可以继续执行的状态;(6)停止状态:提供了任务停止功能,用户触发后,系统对任务的当前结果进行保存,然后停止运行并释放对应的资源,不可恢复运行;(7)结束状态:任务被执行结束后的状态。The calculation tasks mainly have the following execution states: created, waiting, built, running, paused, stopped, and displayed to the user through the task details page; (1) Created: After receiving the user's new task operation, the verification is passed, The task is successfully created, and it is in the "created" state at this time; (2) Waiting state: the state where the k8s cluster has not completed the resource allocation after receiving the resource allocation notification; (3) Construction state: in the k8s cluster After the resource allocation is completed, the container image is being constructed; (4) Running status: After completing the resource allocation and container construction described above, the status of actually running the user task code; (5) Paused status: The computing task is suspended, and the resource is not reserved. Released and can continue to execute at any time; (6) Stop state: Provides a task stop function. After the user triggers, the system saves the current results of the task, then stops the operation and releases the corresponding resources, and the operation cannot be resumed; (7) End state: the state after the task is executed.
通过监控界面实现用户对任务状态的监控管理,提供停止任务、暂停任务、恢复任务的功能。对于运行中任务,接收到用户提交的停止操作后,跟据任务目前处于的不同执行状态,进行如下操作:(1)任务处于“已创建”状态时,将任务状态更改为“已停止”,并中止k8s集群的资源分配工作。(2)任务处于“等待”状态时,将任务状态更改为“已停止”,并将任务从资源等待队列移除。(3)任务处于“构建”状态时,将任务状态更改为“已停止”,并通知docker镜像进程停止构建,同时在k8s集群中取消资源分配。(4)任务处于“运行”状态时,将任务状态更改为“已停止”,同时通知k8s集群,保存用户任务的当前结果到存储地址,然后删除对应的任务节点容器,释放计算资源。Through the monitoring interface, the user can monitor and manage the task status, and provide the functions of stopping, suspending, and resuming tasks. For a running task, after receiving the stop operation submitted by the user, according to the different execution state of the task currently, perform the following operations: (1) When the task is in the "created" state, change the task state to "stopped", And suspend the resource allocation of k8s cluster. (2) When the task is in the "waiting" state, change the task state to "stopped" and remove the task from the resource waiting queue. (3) When the task is in the "build" state, change the task state to "stopped" and notify the docker mirroring process to stop the construction, and at the same time cancel the resource allocation in the k8s cluster. (4) When the task is in the "running" state, change the task status to "stopped" and notify the k8s cluster to save the current result of the user task to the storage address, and then delete the corresponding task node container to release computing resources.
对于运行中任务,接收到用户提交的暂停操作后,跟据任务目前处于的不同执行状态, 进行如下操作:(1)任务处于“已创建”状态时,系统直接将任务状态更改为“已暂停”,并暂停k8s集群的资源分配工作。(2)任务处于“等待”状态时,将任务状态更改为“暂停”,并将任务从资源等待队列移除。(3)任务处于“构建”状态时,将任务状态更改为“已暂停”,并通知docker镜像进程停止构建。(4)任务处于“运行”状态时,将任务状态更改为“已暂停”,同时通知k8s集群,暂停执行用户代码,同时不释放计算资源,准备随时继续执行。For running tasks, after receiving the pause operation submitted by the user, according to the different execution status of the task currently, perform the following operations: (1) When the task is in the "created" state, the system directly changes the task status to "paused" ", and suspend the resource allocation of the k8s cluster. (2) When the task is in the "waiting" state, change the task state to "suspended" and remove the task from the resource waiting queue. (3) When the task is in the "build" state, change the task state to "suspended" and notify the docker image process to stop building. (4) When the task is in the "running" state, change the task state to "suspended", and at the same time notify the k8s cluster to suspend the execution of user code, and at the same time, do not release computing resources, and be ready to continue execution at any time.
对于已暂停任务,接收到用户恢复操作后,跟据任务暂停时的不同执行状态,进行如下操作:(1)任务暂停时处于“已创建”状态,将任务状态更改为“已创建”,并继续k8s集群的资源分配工作。(2)任务暂停时处于“等待”状态,将任务状态更改为“等待”,并将任务恢复到资源等待队列。(3)任务暂停时处于“构建”状态,将任务状态更改为“构建”,并通知docker镜像进程重新构建镜像。(4)任务暂停时处于“运行”状态,系统将任务状态更改为“运行”,同时通知k8s集群,恢复执行用户代码。For a suspended task, after receiving the user resume operation, according to the different execution status of the task when the task is suspended, perform the following operations: (1) When the task is suspended, it is in the "created" state, and the task status is changed to "created", and Continue the resource allocation work of the k8s cluster. (2) When the task is suspended, it is in the "waiting" state, the task state is changed to "waiting", and the task is restored to the resource waiting queue. (3) When the task is suspended, it is in the "build" state, change the task state to "build", and notify the docker image process to rebuild the image. (4) The task is in the "running" state when the task is suspended, and the system changes the task state to "running", and at the same time notifies the k8s cluster to resume the execution of the user code.
一种用于实现上述代码仓库与计算服务整合方法的系统,包括代码仓库模块,计算节点构建模块,计算任务监控和管理模块,以及存储模块;A system for implementing the above-mentioned code warehouse and computing service integration method, including a code warehouse module, a computing node building module, a computing task monitoring and management module, and a storage module;
所述代码仓库模块用于存储计算任务执行的代码;The code warehouse module is used to store the code executed by the computing task;
所述计算任务监控和管理模块通过新建任务界面实现用户新建计算任务的交互;用户通过新建任务界面输入新建任务信息,计算任务监控和管理模块获取用户新建任务信息,校验用户提交的新建任务信息是否有误;如果校验通过,计算任务监控和管理模块反馈用户任务创建成功,否则给用户提示错误信息;任务创建成功后,所述计算任务监控和管理模块查询现有集群资源列表,判断是否有满足所建任务指定的计算资源,如不满足,则将新建任务进入延迟排队状态,等待集群资源充足时将自动重试;如计算资源可满足,触发计算节点构建模块,计算节点构建模块通过k8s分配对应的计算节点,从代码仓库调用与任务相关的代码到计算节点,启动计算节点,并且绑定存储资源给对应计算节点作为存储模块,构建计算节点成功;所述计算节点通过系统内置的分布式计算框架,开始执行计算任务;计算节点实时将任务执行日志和任务执行输出数据保存至存储模块;计算任务监控和管理模块实时获取存储模块上的任务执行日志和任务执行输出数据,通过界面形式向用户展示任务列表,用户进入任务详情界面,计算任务监控和管理模块在计算管理界面中展示任务列表,展示当前任务的执行状态和统计数据,实现用户对计算任务的监控,同时支持用户对计算任务进行管理操作。The computing task monitoring and management module realizes the interaction of the user's new computing task through the new task interface; the user inputs the new task information through the new task interface, the computing task monitoring and management module obtains the user's new task information, and verifies the new task information submitted by the user Whether there is an error; if the verification is passed, the computing task monitoring and management module will feedback that the user task is created successfully, otherwise the user will be prompted with an error message; after the task is created successfully, the computing task monitoring and management module will query the list of existing cluster resources to determine whether There are computing resources that meet the specified tasks of the created task. If not, the new task will enter the delayed queue state and automatically retry when the cluster resources are sufficient; if the computing resources are available, the computing node building module will be triggered, and the computing node building module will pass k8s allocates the corresponding computing node, calls the task-related code from the code warehouse to the computing node, starts the computing node, and binds storage resources to the corresponding computing node as a storage module, and the computing node is successfully constructed; the computing node is built in the system The distributed computing framework starts to execute computing tasks; the computing node saves task execution logs and task execution output data to the storage module in real time; the computing task monitoring and management module obtains the task execution logs and task execution output data on the storage module in real time through the interface The task list is displayed to the user in the form, the user enters the task details interface, the computing task monitoring and management module displays the task list in the computing management interface, displaying the execution status and statistical data of the current task, realizing the user monitoring of the computing task, and supporting the user to Compute tasks for management operations.
用户通过计算监控和管理模块对计算任务进行监控和管理时,利用操作界面发送关于任务监控和管理的网络请求,计算监控和管理模块收到用户的网络请求后,将存储在存储模块上的计算任务的执行状态反馈给用户,实现用户的监控功能,用户点击任务的监控链接以后, 通过内嵌的监控工具,实时刷新任务运行数据展示给用户。When users monitor and manage computing tasks through the computing monitoring and management module, they use the operation interface to send network requests for task monitoring and management. After the computing monitoring and management module receives the user’s network request, it will store the computing on the storage module. The task execution status is fed back to the user to realize the user's monitoring function. After the user clicks the task monitoring link, the embedded monitoring tool refreshes the task running data in real time and displays it to the user.
用户通过计算监控和管理模块对计算任务进行监控和管理时,计算监控和管理模块还通过绘制折线图的方式展示计算资源随时间的占用情况展示给用户。When users monitor and manage computing tasks through the computing monitoring and management module, the computing monitoring and management module also displays the occupation of computing resources over time by drawing a line graph to the user.
所述计算监控和管理模块通过监控界面实现用户对任务状态的管理,提供停止任务、暂停任务、恢复任务的功能;对于运行中任务,接收到用户提交的停止操作后,通过获取任务执行状态,进行如下操作:(1)任务处于“已创建”状态时,将任务状态更改为“已停止”,并通知计算节点构建模块中止k8s集群的资源分配工作;(2)任务处于“等待”状态时,将任务状态更改为“已停止”,并将任务从资源等待队列移除;(3)任务处于“构建”状态时,将任务状态更改为“已停止”,并通知计算节点构建模块的docker镜像进程停止构建,同时在k8s集群中取消资源分配。(4)任务处于“运行”状态时,将任务状态更改为“已停止”,同时通知k8s集群,保存用户任务的当前结果到存储模块,然后销毁对应的任务节点容器,释放计算资源。The computing monitoring and management module implements the user's task status management through the monitoring interface, and provides the functions of stopping, suspending, and resuming tasks; for running tasks, after receiving the stop operation submitted by the user, obtain the task execution status, Perform the following operations: (1) When the task is in the "created" state, change the task state to "stopped" and notify the computing node building module to terminate the resource allocation work of the k8s cluster; (2) when the task is in the "waiting" state , Change the task status to "stopped" and remove the task from the resource waiting queue; (3) When the task is in the "build" state, change the task status to "stopped" and notify the docker of the computing node to build the module The mirroring process stops building and at the same time cancels resource allocation in the k8s cluster. (4) When the task is in the "running" state, change the task status to "stopped" and notify the k8s cluster to save the current results of the user task to the storage module, and then destroy the corresponding task node container to release computing resources.
对于运行中任务,接收到用户提交的暂停操作后,计算监控和管理模块获取存储模块上任务执行状态信息,跟据任务目前处于的不同执行状态,进行如下操作:(1)任务处于“已创建”状态时,计算监控和管理模块直接将任务状态更改为“已暂停”,并通知暂停k8s集群的资源分配工作。(2)任务处于“等待”状态时,将任务状态更改为“暂停”,并将任务从资源等待队列移除。(3)任务处于“构建”状态时,将任务状态更改为“已暂停”,并通知计算节点构建模块docker镜像进程停止构建。(4)任务处于“运行”状态时,将任务状态更改为“已暂停”,同时通知k8s集群,暂停执行用户代码,同时不释放计算资源,准备随时继续执行。For a running task, after receiving the pause operation submitted by the user, the computing monitoring and management module obtains the task execution status information on the storage module, and according to the different execution status of the task currently, perform the following operations: (1) The task is in "Created" ”Status, the computing monitoring and management module directly changes the task status to “suspended” and notifies the suspension of the resource allocation work of the k8s cluster. (2) When the task is in the "waiting" state, change the task state to "suspended" and remove the task from the resource waiting queue. (3) When the task is in the "build" state, change the task state to "suspended" and notify the computing node building module docker image process to stop building. (4) When the task is in the "running" state, change the task state to "suspended", and at the same time notify the k8s cluster to suspend the execution of user code, and at the same time, do not release computing resources, and be ready to continue execution at any time.
对于已暂停任务,计算监控和管理模块接收到用户恢复操作后,计算监控和管理模块获取存储模块上任务执行状态信息,跟据任务暂停时的不同执行状态,进行如下操作:(1)任务暂停时处于“已创建”状态,将任务状态更改为“已创建”,并通知计算节点构建模块继续k8s集群的资源分配工作;(2)任务暂停时处于“等待”状态,将任务状态更改为“等待”,并将任务恢复到资源等待队列;(3)任务暂停时处于“构建”状态,将任务状态更改为“构建”,并通知计算节点构建模块的docker镜像进程重新构建镜像;(4)任务暂停时处于“运行”状态,系统将任务状态更改为“运行”,同时通知k8s集群,恢复执行用户代码。For a suspended task, after the computing monitoring and management module receives the user resume operation, the computing monitoring and management module obtains the task execution status information on the storage module, and performs the following operations according to the different execution status when the task is suspended: (1) Task suspension When the task is in the "created" state, change the task state to "created", and notify the computing node building module to continue the resource allocation work of the k8s cluster; (2) When the task is suspended, it is in the "waiting" state, and the task state is changed to " Wait” and restore the task to the resource waiting queue; (3) When the task is suspended, it is in the “build” state, the task state is changed to “build”, and the docker image process of the computing node building module is notified to rebuild the image; (4) When the task is suspended, it is in the "running" state, and the system changes the task state to "running", and at the same time notifies the k8s cluster to resume the execution of the user code.
计算监控和管理模块将上述状态信息的更改存储到存储模块。The calculation, monitoring and management module stores the above-mentioned status information changes to the storage module.
有益效果:与现有技术相比,本发明提供的一种实现代码仓库与计算服务整合的方法及系统,用户可直接在代码仓库或计算管理模块发起人工智能计算任务,计算任务使用的代码和计算资源均在发起页面直接配置,无需进行代码迁移。Beneficial effects: Compared with the prior art, the present invention provides a method and system for integrating code warehouses and computing services. Users can directly initiate artificial intelligence computing tasks in the code warehouse or computing management module. Computing resources are directly configured on the initiation page, without code migration.
附图说明Description of the drawings
图1是本发明方法流程图。Figure 1 is a flow chart of the method of the present invention.
具体实施方式Detailed ways
下面结合具体实施例,进一步阐明本发明,应理解这些实施例仅用于说明本发明而不用于限制本发明的范围,在阅读了本发明之后,本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。The present invention will be further clarified below in conjunction with specific examples. It should be understood that these examples are only used to illustrate the present invention and not to limit the scope of the present invention. After reading the present invention, those skilled in the art will give various equivalent forms of the present invention. All the modifications fall within the scope defined by the appended claims of this application.
实现代码仓库与计算服务整合的方法,内嵌gitea作为代码仓库模块,通过k8s集群的形式管理和提供可拓展的计算资源,使用ray框架支持分布式机器学习,通过ceph提供分布式存储,实现代码仓库、计算资源、结果存储在统一平台管理;如图1所示,具体包括如下步骤:A method to realize the integration of code warehouse and computing services, embed gitea as a code warehouse module, manage and provide expandable computing resources through the form of k8s cluster, use ray framework to support distributed machine learning, and provide distributed storage through ceph to realize code The warehouse, computing resources, and results are stored in a unified platform for management; as shown in Figure 1, the specific steps are as follows:
用户发起计算任务,用户提供新建任务信息,包括任务名称、任务描述、代码分支、代码版本(默认最新版本)、任务入口文件和使用的计算资源,通过版本控制系统或https协议获取用户新建任务信息,校验用户提交的新建任务信息是否有误;包括:任务名称是否重名、代码分支是否存在、代码版本是否存在。如果校验通过,任务创建成功,否则给用户提示错误信息;任务创建成功后,查询现有集群资源列表,判断是否有满足所建任务指定的计算资源,如不满足,则将新建任务进入延迟排队状态,等待集群资源充足时将自动重试。如计算资源可满足,通过k8s分配对应的计算节点;从代码仓库调用与任务相关的代码到计算节点,启动计算节点,并且绑定存储资源给对应计算节点;通过计算节点系统内置的分布式计算框架,开始计算任务,实时将任务执行日志和任务执行输出数据保存至存储地址;通过界面形式展示任务列表,进入任务详情界面,系统在计算管理界面中展示任务列表,展示当前任务的执行状态和统计数据,实现用户对计算任务的监控,同时支持用户对计算任务进行管理操作。The user initiates a computing task, and the user provides new task information, including task name, task description, code branch, code version (default latest version), task entry file and computing resources used, and obtains user’s new task information through the version control system or https protocol , To verify whether the new task information submitted by the user is wrong; including: whether the task name is the same, whether the code branch exists, and whether the code version exists. If the verification is passed, the task is created successfully, otherwise the user will be prompted with an error message; after the task is created successfully, query the list of existing cluster resources to determine whether the computing resources specified by the created task are met. If not, the new task will be delayed Queued state, it will automatically retry when the cluster resources are sufficient. If the computing resources can be met, the corresponding computing node is allocated through k8s; the task-related code is called from the code warehouse to the computing node, the computing node is started, and storage resources are bound to the corresponding computing node; through the built-in distributed computing of the computing node system Frame, start the calculation task, save the task execution log and task execution output data to the storage address in real time; display the task list through the interface form, enter the task details interface, the system displays the task list in the calculation management interface, and displays the current task execution status and Statistical data enables users to monitor computing tasks, and supports users to manage computing tasks at the same time.
用户对计算任务进行监控和管理时,可实时发送网络请求,收到用户的网络请求后,计算节点反馈计算任务的执行状态和计算资源的占用情况,并通过绘制折线图的方式展示计算资源随时间的占用情况,通过监控界面展示计算任务的执行状态,实现用户的监控功能,用户点击任务的监控链接以后,反馈给用户监控页面,页面通过内嵌tensorboard等人工智能计算任务常用的监控工具,实时刷新任务运行数据进行展示。When users monitor and manage computing tasks, they can send network requests in real time. After receiving the user’s network requests, the computing nodes will feed back the execution status of the computing tasks and the occupancy of computing resources, and draw a line graph to show the computing resource changes. The time occupancy is displayed through the monitoring interface to display the execution status of the computing task to realize the user's monitoring function. After the user clicks the task monitoring link, it will be fed back to the user monitoring page. The page is embedded with tensorboard and other commonly used monitoring tools for artificial intelligence computing tasks. Real-time refresh task running data for display.
计算任务主要有如下几种执行状态:已创建、等待、构建、运行、暂停、停止,并通过任务详情页面展示给用户;(1)已创建:收到用户新建任务操作后,校验通过,通知k8s集群开始分配资源,返回给用户任务已创建的消息;(2)等待状态:k8s集群在收到分配资源通知后,还未完成资源分配工作时所处的状态;(3)构建状态:k8s集群中资源分配完毕,正在进行容器镜像的构建;(4)运行状态:做完前面所述资源分配和容器构建,实际运行用户代码的状态;(5)暂停状态:将计算任务暂停,资源保留不释放,随时可以继续执行的状态; (6)停止状态:提供了任务停止功能,用户触发后,系统对任务的当前结果进行保存,然后停止运行并释放所有资源,不可恢复运行;(7)结束状态:任务被执行结束后的状态。The calculation tasks mainly have the following execution states: created, waiting, built, running, paused, stopped, and displayed to the user through the task details page; (1) Created: After receiving the user's new task operation, the verification is passed, Notify the k8s cluster to start allocating resources and return to the user a message that the task has been created; (2) Waiting state: the state where the k8s cluster has not completed resource allocation after receiving the resource allocation notification; (3) Construction state: The resource allocation in the k8s cluster is completed, and the container image is being constructed; (4) Running status: After completing the resource allocation and container construction described above, the status of the actual running user code; (5) Paused status: Suspending the computing task, and the resource The state is reserved and not released, and execution can be continued at any time; (6)Stop state: Provides the task stop function. After the user triggers, the system saves the current results of the task, then stops running and releases all resources, and the operation cannot be resumed; (7) ) End state: the state after the task is executed.
通过监控界面实现用户对任务状态的监控管理,提供停止任务、暂停任务、恢复任务的功能。对于运行中任务,接收到用户提交的停止操作后,跟据任务目前处于的不同执行状态,进行如下操作:(1)任务处于“已创建”状态时,将任务状态更改为“已停止”,并中止k8s集群的资源分配工作。(2)任务处于“等待”状态时,将任务状态更改为“已停止”,并将任务从资源等待队列移除。(3)任务处于“构建”状态时,将任务状态更改为“已停止”,并通知docker镜像进程停止构建,同时在k8s集群中取消资源分配。(4)任务处于“运行”状态时,将任务状态更改为“已停止”,同时通知k8s集群,保存用户任务的当前结果到存储地址,然后销毁对应的任务节点容器,释放计算资源。Through the monitoring interface, the user can monitor and manage the task status, and provide the functions of stopping, suspending, and resuming tasks. For a running task, after receiving the stop operation submitted by the user, according to the different execution state of the task currently, perform the following operations: (1) When the task is in the "created" state, change the task state to "stopped", And suspend the resource allocation of k8s cluster. (2) When the task is in the "waiting" state, change the task state to "stopped" and remove the task from the resource waiting queue. (3) When the task is in the "build" state, change the task state to "stopped" and notify the docker mirroring process to stop the construction, and at the same time cancel the resource allocation in the k8s cluster. (4) When the task is in the "running" state, change the task status to "stopped" and notify the k8s cluster to save the current result of the user task to the storage address, and then destroy the corresponding task node container to release computing resources.
对于运行中任务,接收到用户提交的暂停操作后,跟据任务目前处于的不同执行状态,进行如下操作:(1)任务处于“已创建”状态时,系统直接将任务状态更改为“已暂停”,并暂停k8s集群的资源分配工作。(2)任务处于“等待”状态时,将任务状态更改为“暂停”,并将任务从资源等待队列移除。(3)任务处于“构建”状态时,将任务状态更改为“已暂停”,并通过消息中间件,通知docker镜像进程停止构建。(4)任务处于“运行”状态时,将任务状态更改为“已暂停”,同时通知k8s集群,暂停执行用户代码,同时不释放计算资源,准备随时继续执行。For a running task, after receiving the pause operation submitted by the user, according to the different execution state of the task currently, perform the following operations: (1) When the task is in the "created" state, the system directly changes the task state to "paused" ", and suspend the resource allocation of the k8s cluster. (2) When the task is in the "waiting" state, change the task state to "suspended" and remove the task from the resource waiting queue. (3) When the task is in the "build" state, change the task state to "suspended" and notify the docker image process to stop the construction through the message middleware. (4) When the task is in the "running" state, change the task state to "suspended", and at the same time notify the k8s cluster to suspend the execution of user code, and at the same time, do not release computing resources, and be ready to continue execution at any time.
对于已暂停任务,接收到用户恢复操作后,跟据任务暂停时的不同执行状态,进行如下操作:(1)任务暂停时处于“已创建”状态,将任务状态更改为“已创建”,并继续k8s集群的资源分配工作。(2)任务暂停时处于“等待”状态,将任务状态更改为“等待”,并将任务恢复到资源等待队列。(3)任务暂停时处于“构建”状态,将任务状态更改为“构建”,并通过消息中间件,通知docker镜像进程重新构建镜像。(4)任务暂停时处于“运行”状态,系统将任务状态更改为“运行”,同时通知k8s集群,恢复执行用户代码。For a suspended task, after receiving the user resume operation, according to the different execution status of the task when the task is suspended, perform the following operations: (1) When the task is suspended, it is in the "created" state, and the task status is changed to "created", and Continue the resource allocation work of the k8s cluster. (2) When the task is suspended, it is in the "waiting" state, the task state is changed to "waiting", and the task is restored to the resource waiting queue. (3) When the task is suspended, it is in the "build" state, change the task state to "build", and notify the docker image process to rebuild the image through the message middleware. (4) The task is in the "running" state when the task is suspended, and the system changes the task state to "running", and at the same time notifies the k8s cluster to resume the execution of the user code.
通过http请求,为用户提供保存在存储地址任务执行日志和任务执行输出数据,并展示在页面上,提供文件下载链接,便于用户进行下载和浏览。Through the http request, provide users with task execution logs and task execution output data saved in the storage address, and display them on the page, and provide file download links for users to download and browse.
通过运行多个容器作为执行任务的计算节点,并将用户代码从代码仓库导入到容器中,代码用于后期任务执行;为计算节点绑定通过生成虚拟路径得到的对象存储及文件存储资源,用作计算任务的数据输入、监控数据和结果存储的存储地址;在监控进程中注册任务,生成监控链接,开始执行任务;执行完成后将日志和结果保存至存储地址。By running multiple containers as computing nodes for executing tasks, and importing user code from the code warehouse into the container, the code is used for later task execution; binding the object storage and file storage resources obtained by generating virtual paths for the computing nodes, using The storage address for data input, monitoring data and result storage of calculation tasks; register tasks in the monitoring process, generate monitoring links, and start executing tasks; after execution, save logs and results to the storage address.
实现代码仓库与计算服务整合的系统,包括代码仓库模块,计算节点构建模块,计算任 务监控和管理模块,以及存储模块;A system that realizes the integration of code warehouses and computing services, including code warehouse modules, computing node building modules, computing task monitoring and management modules, and storage modules;
计算任务监控和管理模块通过新建任务界面用于用户新建计算任务的交互;用户通过新建任务界面输入任务名称、任务描述、代码分支、代码版本(默认最新版本)、任务入口文件和使用的计算资源等新建任务信息,计算任务监控和管理模块通过版本控制系统或https协议获取用户新建任务信息,校验用户提交的新建任务信息是否有误;包括:任务名称是否重名、代码分支是否存在、代码版本是否存在。如果校验通过,计算任务监控和管理模块反馈用户任务创建成功,否则给用户提示错误信息;任务创建成功后,计算任务监控和管理模块查询现有集群资源列表,判断是否有满足所建任务指定的计算资源,如不满足,则将新建任务进入延迟排队状态,等待集群资源充足时将自动重试。如计算资源可满足,触发计算节点构建模块,计算节点构建模块通过k8s分配对应的计算节点,从代码仓库调用与任务相关的代码到计算节点,启动计算节点,并且绑定存储资源给对应计算节点,构建计算节点成功;计算节点通过系统内置的分布式计算框架,开始执行计算任务。计算节点实时将任务执行日志和任务执行输出数据保存至存储模块;计算任务监控和管理模块实时获取存储模块上的任务执行日志和任务执行输出数据,通过界面形式向用户展示任务列表,用户进入任务详情界面,计算任务监控和管理模块在计算管理界面中展示任务列表,展示当前任务的执行状态和统计数据,实现用户对计算任务的监控,同时支持用户对计算任务进行管理操作。The computing task monitoring and management module uses the new task interface for the user to interact with the new computing task; the user enters the task name, task description, code branch, code version (default latest version), task entry file and computing resources through the new task interface Such as new task information, the calculation task monitoring and management module obtains the user’s new task information through the version control system or https protocol, and verifies whether the new task information submitted by the user is incorrect; including: whether the task name is the same, whether the code branch exists, and the code Whether the version exists. If the verification is passed, the computing task monitoring and management module will feedback that the user task is created successfully, otherwise it will prompt the user with an error message; after the task is successfully created, the computing task monitoring and management module will query the list of existing cluster resources to determine whether the specified task is met If the computing resources are not satisfied, the new task will enter the delayed queue state, and it will automatically retry when the cluster resources are sufficient. If the computing resources are sufficient, the computing node building module is triggered, the computing node building module allocates the corresponding computing node through k8s, calls the task-related code from the code warehouse to the computing node, starts the computing node, and binds storage resources to the corresponding computing node , The construction of the computing node is successful; the computing node starts to perform computing tasks through the built-in distributed computing framework of the system. The computing node saves the task execution log and task execution output data to the storage module in real time; the computing task monitoring and management module obtains the task execution log and task execution output data on the storage module in real time, and displays the task list to the user through the interface, and the user enters the task In the detailed interface, the computing task monitoring and management module displays the task list in the computing management interface, displaying the execution status and statistical data of the current task, realizing the user's monitoring of the computing task, and supporting the user to manage the computing task.
用户通过计算监控和管理模块对计算任务进行监控和管理时,用户通过操作界面可实时发送网络请求,计算监控和管理模块收到用户的网络请求后,要求计算节点反馈计算任务的执行状态和计算资源的占用情况,并通过绘制折线图的方式展示计算资源随时间的占用情况,通过监控界面展示计算任务的执行状态,实现用户的监控功能,用户点击任务的监控链接以后,反馈给用户监控页面,页面通过内嵌tensorboard等人工智能计算任务常用的监控工具,实时刷新任务运行数据进行展示。When the user monitors and manages the computing task through the computing monitoring and management module, the user can send network requests in real time through the operation interface. After the computing monitoring and management module receives the user's network request, it requires the computing node to feed back the execution status and calculation of the computing task Resource occupancy, and draw a line chart to show the occupancy of computing resources over time, display the execution status of computing tasks through the monitoring interface, and realize the user monitoring function. After the user clicks the task monitoring link, it will be reported to the user monitoring page , The page uses monitoring tools commonly used for artificial intelligence computing tasks such as embedded tensorboard to refresh the task running data in real time for display.
计算监控和管理模块通过监控界面实现用户对任务状态的管理,提供停止任务、暂停任务、恢复任务的功能。对于运行中任务,接收到用户提交的停止操作后,通过获取任务执行状态,进行如下操作:(1)任务处于“已创建”状态时,将任务状态更改为“已停止”,并通知计算节点构建模块中止k8s集群的资源分配工作,并将状态信息的更改存储到存储模块。以下状态信息的更改,同样存储到存储模块。(2)任务处于“等待”状态时,将任务状态更改为“已停止”,并将任务从资源等待队列移除。(3)任务处于“构建”状态时,将任务状态更改为“已停止”,并通过消息中间件,通知计算节点构建模块的docker镜像进程停止构建,同时在k8s集群中取消资源分配。(4)任务处于“运行”状态时,将任务状态更改为“已停止”,同时通知k8s集群,保存用户任务的当前结果到存储模块,然后销毁对应的任务节点容器,释放 计算资源。The computing monitoring and management module realizes the user's management of the task status through the monitoring interface, and provides the functions of stopping, suspending, and resuming tasks. For running tasks, after receiving the stop operation submitted by the user, by obtaining the task execution status, perform the following operations: (1) When the task is in the "created" state, change the task status to "stopped" and notify the computing node The construction module suspends the resource allocation work of the k8s cluster, and stores the status information changes in the storage module. The following status information changes are also stored in the storage module. (2) When the task is in the "waiting" state, change the task state to "stopped" and remove the task from the resource waiting queue. (3) When the task is in the "build" state, change the task state to "stopped", and notify the docker image process of the computing node building module to stop the construction through the message middleware, and cancel the resource allocation in the k8s cluster. (4) When the task is in the "running" state, change the task status to "stopped" and notify the k8s cluster to save the current results of the user task to the storage module, and then destroy the corresponding task node container to release computing resources.
对于运行中任务,接收到用户提交的暂停操作后,计算监控和管理模块跟据任务目前处于的不同执行状态,进行如下操作:(1)任务处于“已创建”状态时,计算监控和管理模块直接将任务状态更改为“已暂停”,并通知暂停k8s集群的资源分配工作。(2)任务处于“等待”状态时,将任务状态更改为“暂停”,并将任务从资源等待队列移除。(3)任务处于“构建”状态时,将任务状态更改为“已暂停”,并通过消息中间件,通知计算节点构建模块docker镜像进程停止构建。(4)任务处于“运行”状态时,将任务状态更改为“已暂停”,同时通知k8s集群,暂停执行用户代码,同时不释放计算资源,准备随时继续执行。For running tasks, after receiving the pause operation submitted by the user, the calculation monitoring and management module will perform the following operations according to the different execution states of the task at present: (1) When the task is in the "created" state, the calculation monitoring and management module Change the task status directly to "suspended" and notify the suspension of the resource allocation work of the k8s cluster. (2) When the task is in the "waiting" state, change the task state to "suspended" and remove the task from the resource waiting queue. (3) When the task is in the "build" state, change the task state to "suspended" and notify the computing node building module docker image process to stop the construction through the message middleware. (4) When the task is in the "running" state, change the task state to "suspended", and at the same time notify the k8s cluster to suspend the execution of user code, and at the same time, do not release computing resources, and be ready to continue execution at any time.
对于已暂停任务,计算监控和管理模块接收到用户恢复操作后,跟据任务暂停时的不同执行状态,进行如下操作:(1)任务暂停时处于“已创建”状态,将任务状态更改为“已创建”,并通知继续k8s集群的资源分配工作。(2)任务暂停时处于“等待”状态,将任务状态更改为“等待”,并将任务恢复到资源等待队列。(3)任务暂停时处于“构建”状态,将任务状态更改为“构建”,并通过消息中间件,通知计算节点构建模块的docker镜像进程重新构建镜像。(4)任务暂停时处于“运行”状态,系统将任务状态更改为“运行”,同时通知k8s集群,恢复执行用户代码。For suspended tasks, after the computing monitoring and management module receives the user resume operation, according to the different execution status of the task when the task is suspended, the following operations are performed: (1) The task is in the "created" state when the task is suspended, and the task status is changed to " Created", and notify the continued resource allocation of the k8s cluster. (2) When the task is suspended, it is in the "waiting" state, the task state is changed to "waiting", and the task is restored to the resource waiting queue. (3) When the task is suspended, it is in the "build" state, the task state is changed to "build", and the docker image process of the computing node building module is notified to rebuild the image through the message middleware. (4) The task is in the "running" state when the task is suspended, and the system changes the task state to "running", and at the same time notifies the k8s cluster to resume the execution of the user code.
存储模块通过http请求,为用户提供保存在存储地址任务执行日志和任务执行输出数据,并通过计算监控和管理模块展示在页面上,提供文件下载链接,便于用户进行下载和浏览。The storage module provides users with task execution logs and task execution output data stored in the storage address through HTTP requests, and displays them on the page through the calculation monitoring and management module, and provides file download links for users to download and browse.

Claims (10)

  1. 一种实现代码仓库与计算服务整合的方法,其特征在于:用户发起计算任务时,获取用户新建计算任务信息,校验用户提交的新建计算任务信息是否有误;如果校验通过,任务创建成功,否则给用户提示错误信息;任务创建成功后,查询现有集群资源列表,判断是否有满足所建计算任务指定的计算资源,如不满足,则将新建计算任务进入延迟排队状态,等待集群资源充足时将自动重试;如计算资源能满足执行计算任务,分配计算节点给计算任务用于执行计算任务;从代码仓库调用与计算任务相关的代码到计算节点,启动计算节点,并且绑定存储资源给对应计算节点;通过计算节点系统内置的分布式计算框架,开始执行计算任务,实时将任务执行日志和任务执行输出数据保存至存储地址;通过界面展示计算任务列表,进入任务详情界面;在计算管理界面中展示任务列表,实现用户对计算任务的监控,同时支持用户对计算任务进行管理操作。A method for implementing code warehouse and computing service integration, which is characterized in that: when a user initiates a computing task, obtain the user's newly created computing task information, and verify whether the newly created computing task information submitted by the user is incorrect; if the verification is passed, the task is created successfully , Otherwise the user will be prompted with an error message; after the task is created successfully, query the list of existing cluster resources to determine whether the computing resources specified by the created computing task are met. If not, the new computing task will enter the delayed queue state and wait for the cluster resources It will automatically retry when sufficient; if the computing resources can meet the execution of the computing task, the computing node is allocated to the computing task for the execution of the computing task; the code related to the computing task is called from the code warehouse to the computing node, the computing node is started, and the storage is bound Resources are given to the corresponding computing node; through the built-in distributed computing framework of the computing node system, start to execute computing tasks, save the task execution log and task execution output data to the storage address in real time; display the list of computing tasks through the interface, and enter the task details interface; in The task list is displayed in the computing management interface, which enables users to monitor computing tasks and supports users to manage computing tasks.
  2. 根据权利要求1所述的实现代码仓库与计算服务整合的方法,其特征在于:用户对计算任务进行监控和管理时,接收到用户请求后,反馈计算任务的执行状态和计算资源的占用情况,并通过绘制折线图的方式展示计算资源随时间的占用情况,利用监控界面展示计算任务的执行状态,实现用户的监控功能,提供监控链接,用户点击后,通过内嵌监控工具,实时刷新计算任务运行数据进行展示。The method for implementing code warehouse and computing service integration according to claim 1, characterized in that: when a user monitors and manages a computing task, after receiving a user request, feedback the execution status of the computing task and the occupation of computing resources, And by drawing a line chart to show the occupancy of computing resources over time, use the monitoring interface to display the execution status of the computing task, realize the user's monitoring function, provide a monitoring link, after the user clicks, the embedded monitoring tool will refresh the computing task in real time Run the data for display.
  3. 根据权利要求1所述的实现代码仓库与计算服务整合的方法,其特征在于:展示给用户的计算任务执行状态包括已创建、等待、构建、运行、暂停和停止六种状态;The method for implementing code warehouse and computing service integration according to claim 1, characterized in that: the execution status of the computing task displayed to the user includes six statuses: created, waiting, constructed, running, paused and stopped;
    已创建状态:指收到用户新建任务操作后,校验通过,任务创建成功;Created status: After receiving the user's new task operation, the verification is passed, and the task is created successfully;
    等待状态:利用k8s集群分配资源的过程中,k8s集群收到分配资源通知后,还未完成资源分配工作时所处的状态;Waiting state: in the process of using the k8s cluster to allocate resources, the state where the k8s cluster has not completed the resource allocation after receiving the resource allocation notification;
    构建状态:k8s集群中资源分配完毕,正在进行容器镜像的构建;Construction status: The resource allocation in the k8s cluster is completed, and the container image is being constructed;
    运行状态:做完所述资源分配和容器镜像的构建,运行计算任务代码的状态;Running status: After completing the resource allocation and construction of the container image, the status of running the computing task code;
    暂停状态:将计算任务暂停,资源保留不释放,随时能继续执行的状态;Suspended state: the state where the computing task is suspended, the resource reservation is not released, and the execution can be continued at any time;
    停止状态:提供了计算任务停止功能,用户触发后,对计算任务的当前结果进行保存,然后停止运行并释放对应的资源,不可恢复运行;Stop state: Provides the function of stopping the calculation task. After the user triggers it, the current result of the calculation task is saved, then the operation is stopped and the corresponding resource is released, and the operation cannot be resumed;
    结束状态:计算任务被执行结束后的状态。End state: the state after the calculation task is executed.
  4. 根据权利要求3所述的实现代码仓库与计算服务整合的方法,其特征在于:通过监控界面实现用户对任务执行状态的监控管理,提供停止任务、暂停任务、恢复任务的功能;对于执行中计算任务,接收到用户提交的停止任务操作后,跟据计算任务目前处于的不同执行状态,进行如下操作:一、计算任务处于已创建状态时,将计算任务状态更改为停止,并中止k8s集群的资源分配工作;二、计算任务处于等待状态时,将计算任务状态更改为已停止, 并将计算任务从资源等待队列移除;三、计算任务处于构建状态时,将计算任务状态更改为已停止,并通知docker镜像进程停止构建,同时在k8s集群中取消资源分配;四、计算任务处于运行状态时,将任务状态更改为已停止,同时通知k8s集群,保存用户计算任务的当前结果到存储地址,然后删除对应的任务节点容器,释放计算资源。The method for implementing code warehouse and computing service integration according to claim 3, characterized in that: the user monitors and manages the task execution status through the monitoring interface, and provides the functions of stopping, suspending, and resuming tasks; Task, after receiving the stop task operation submitted by the user, according to the different execution status of the computing task currently, perform the following operations: 1. When the computing task is in the created state, change the computing task status to stopped, and terminate the k8s cluster Resource allocation work; 2. When the calculation task is in the waiting state, change the state of the calculation task to stopped, and remove the calculation task from the resource waiting queue; 3. When the calculation task is in the construction state, change the state of the calculation task to stopped , And notify the docker mirroring process to stop the construction, and cancel the resource allocation in the k8s cluster; 4. When the computing task is running, change the task status to stopped, and notify the k8s cluster at the same time to save the current results of the user's computing task to the storage address , And then delete the corresponding task node container to release computing resources.
  5. 根据权利要求4所述的实现代码仓库与计算服务整合的方法,其特征在于:对于执行中的计算任务,接收到用户提交的暂停任务操作后,跟据计算任务目前处于的不同执行状态,进行如下操作:一、计算任务处于已创建状态时,将任务状态更改为暂停,并暂停k8s集群的资源分配工作;二、计算任务处于等待状态时,将任务状态更改为暂停,并将任务从资源等待队列移除;三、计算任务处于构建状态时,将任务状态更改为暂停,并通知docker镜像进程停止构建;四、计算任务处于运行状态时,将任务状态更改为暂停,同时通知k8s集群,暂停执行用户计算任务代码,同时不释放计算资源,准备随时继续执行。The method for implementing code warehouse and computing service integration according to claim 4, characterized in that: for a computing task that is executing, after receiving a suspended task operation submitted by a user, it is executed according to the different execution state of the computing task currently The operations are as follows: 1. When the computing task is in the created state, change the task state to suspended and suspend the resource allocation work of the k8s cluster; 2. When the computing task is in the waiting state, change the task state to suspended and remove the task from the resource Wait for the queue to be removed; 3. When the computing task is in the build state, change the task state to suspended and notify the docker mirroring process to stop building; 4. When the computing task is in the running state, change the task state to suspended and notify the k8s cluster at the same time, Suspend the execution of user computing task code without releasing computing resources, and be ready to continue execution at any time.
  6. 根据权利要求4所述的实现代码仓库与计算服务整合的方法,其特征在于:对于处于暂停状态的计算任务,接收到用户恢复任务操作后,跟据计算任务被执行暂停操作时所处的不同执行状态,进行如下操作:一、计算任务被执行暂停操作时处于已创建状态的,将计算任务状态更改为已创建,并继续k8s集群的资源分配工作;二、计算任务被执行暂停操作时处于等待状态的,将计算任务状态更改为等待,并将计算任务恢复到资源等待队列;三、计算任务被执行暂停操作时处于构建状态的,将计算任务状态更改为构建,并通知docker镜像进程重新构建镜像;四、计算任务被执行暂停操作时处于运行状态的,将计算任务状态更改为运行,同时通知k8s集群,恢复执行用户计算任务代码。The method for implementing code warehouse and computing service integration according to claim 4, characterized in that: for a computing task in a suspended state, after receiving the user resume task operation, it is different from the computing task when the suspended operation is executed. In the execution state, perform the following operations: 1. If the computing task is in the created state when the suspended operation is executed, change the state of the computing task to created, and continue the resource allocation work of the k8s cluster; 2. The computing task is in the created state when the suspended operation is executed In the waiting state, change the computing task state to waiting and restore the computing task to the resource waiting queue; 3. If the computing task is in the build state when the suspended operation is executed, change the computing task state to build and notify the docker mirroring process to restart Build a mirror image; Fourth, if the computing task is in the running state when the operation is suspended, the computing task state is changed to running, and the k8s cluster is notified to resume the execution of the user's computing task code.
  7. 根据权利要求1所述的实现代码仓库与计算服务整合的方法,其特征在于:用户发起计算任务时,通过版本控制系统或https协议传输到计算环境实现获取用户新建任务信息;通过http请求,为用户提供保存在存储地址的任务执行日志和任务执行输出数据,并展示在页面上,提供文件下载链接,让用户进行下载和浏览。The method for implementing code warehouse and computing service integration according to claim 1, characterized in that: when a user initiates a computing task, it is transmitted to the computing environment through a version control system or https protocol to obtain the user’s newly created task information; The user provides the task execution log and task execution output data saved in the storage address, and displays it on the page, and provides a file download link for the user to download and browse.
  8. 根据权利要求1所述的实现代码仓库与计算服务整合的方法,其特征在于:用户新建任务信息包括:任务名称、任务描述、代码分支、代码版本、任务入口文件和使用的计算资源;新建计算任务时,如计算资源可满足计算任务,通过k8s分配对应的计算节点;通过运行多个容器作为执行计算任务的计算节点,并将用户计算任务代码从代码仓库导入到容器中。The method for realizing the integration of code warehouse and computing service according to claim 1, characterized in that: the new task information of the user includes: task name, task description, code branch, code version, task entry file and computing resources used; new calculation During the task, if the computing resources can meet the computing task, the corresponding computing node is allocated through k8s; multiple containers are run as computing nodes to perform the computing task, and the user computing task code is imported from the code warehouse into the container.
  9. 一种代码仓库与计算服务整合的系统,其特征在于:包括代码仓库模块,计算节点构建模块,计算任务监控和管理模块,以及存储模块;A system for integrating code warehouse and computing service, which is characterized in that it includes a code warehouse module, a computing node building module, a computing task monitoring and management module, and a storage module;
    所述代码仓库模块用于存储计算任务执行的代码;The code warehouse module is used to store the code executed by the computing task;
    所述计算任务监控和管理模块通过新建任务界面实现用户新建计算任务的交互;用户通 过新建任务界面输入新建任务信息,计算任务监控和管理模块获取用户新建任务信息,校验用户提交的新建任务信息是否有误;如果校验通过,计算任务监控和管理模块反馈用户任务创建成功,否则给用户提示错误信息;任务创建成功后,所述计算任务监控和管理模块查询现有集群资源列表,判断是否有满足所建任务指定的计算资源,如不满足,则将新建任务进入延迟排队状态,等待集群资源充足时将自动重试;如计算资源可满足,触发计算节点构建模块,计算节点构建模块通过k8s分配对应的计算节点,从代码仓库调用与任务相关的代码到计算节点,启动计算节点,并且绑定存储资源给对应计算节点作为存储模块,构建计算节点成功;所述计算节点通过系统内置的分布式计算框架,开始执行计算任务;计算节点实时将任务执行日志和任务执行输出数据保存至存储模块;计算任务监控和管理模块实时获取存储模块上的任务执行日志和任务执行输出数据,通过界面形式向用户展示任务列表,用户进入任务详情界面,计算任务监控和管理模块在计算管理界面中展示任务列表,展示当前任务的执行状态和统计数据,实现用户对计算任务的监控,同时支持用户对计算任务进行管理操作。The computing task monitoring and management module realizes the interaction of the user's new computing task through the new task interface; the user inputs the new task information through the new task interface, the computing task monitoring and management module obtains the user's new task information, and verifies the new task information submitted by the user Whether there is an error; if the verification is passed, the computing task monitoring and management module will feedback that the user task is created successfully, otherwise the user will be prompted with an error message; after the task is created successfully, the computing task monitoring and management module will query the list of existing cluster resources to determine whether There are computing resources that meet the specified tasks of the created task. If not, the new task will enter the delayed queue state and automatically retry when the cluster resources are sufficient; if the computing resources are available, the computing node building module will be triggered, and the computing node building module will pass k8s allocates the corresponding computing node, calls the task-related code from the code warehouse to the computing node, starts the computing node, and binds storage resources to the corresponding computing node as a storage module, and the computing node is successfully constructed; the computing node is built in the system The distributed computing framework starts to execute computing tasks; the computing node saves task execution logs and task execution output data to the storage module in real time; the computing task monitoring and management module obtains the task execution logs and task execution output data on the storage module in real time through the interface The task list is displayed to the user in the form, the user enters the task details interface, the computing task monitoring and management module displays the task list in the computing management interface, displaying the execution status and statistical data of the current task, realizing the user monitoring of the computing task, and supporting the user to Compute tasks for management operations.
  10. 根据权利要求9所述的代码仓库与计算服务整合的系统,其特征在于:用户通过计算监控和管理模块对计算任务进行监控和管理时,利用操作界面发送关于任务监控和管理的网络请求,计算监控和管理模块收到用户的网络请求后,将存储在存储模块上的计算任务的执行状态反馈给用户,实现用户的监控功能,用户点击任务的监控链接以后,通过内嵌的监控工具,实时刷新任务运行数据展示给用户;用户通过计算监控和管理模块对计算任务进行监控和管理时,计算监控和管理模块还通过绘制折线图的方式展示计算资源随时间的占用情况展示给用户;所述计算监控和管理模块通过监控界面实现用户对任务状态的管理,提供停止任务、暂停任务、恢复任务的功能。The code warehouse and computing service integration system according to claim 9, characterized in that: when a user monitors and manages computing tasks through the computing monitoring and management module, the user uses the operation interface to send network requests for task monitoring and management, and computing After the monitoring and management module receives the user’s network request, it will feed back the execution status of the computing task stored on the storage module to the user to realize the user’s monitoring function. After the user clicks on the task’s monitoring link, the embedded monitoring tool will provide real-time monitoring. Refresh the task running data and show it to the user; when the user monitors and manages the computing task through the computing monitoring and management module, the computing monitoring and management module also displays the occupation of computing resources over time by drawing a line chart to the user; The computing monitoring and management module realizes the user's management of the task status through the monitoring interface, and provides the functions of stopping, suspending, and resuming tasks.
PCT/CN2020/096730 2020-05-25 2020-06-18 Method and system for integrating code repository with computing service WO2021237829A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010445874.3 2020-05-25
CN202010445874.3A CN111338784B (en) 2020-05-25 2020-05-25 Method and system for realizing integration of code warehouse and computing service

Publications (1)

Publication Number Publication Date
WO2021237829A1 true WO2021237829A1 (en) 2021-12-02

Family

ID=71183019

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/096730 WO2021237829A1 (en) 2020-05-25 2020-06-18 Method and system for integrating code repository with computing service

Country Status (2)

Country Link
CN (1) CN111338784B (en)
WO (1) WO2021237829A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114253598A (en) * 2021-12-22 2022-03-29 浪潮卓数大数据产业发展有限公司 Code hosting method and tool of online coding system
CN114489942A (en) * 2022-01-19 2022-05-13 西安交通大学 Application cluster-oriented queue task scheduling method and system
CN115080254A (en) * 2022-08-24 2022-09-20 北京向量栈科技有限公司 Method and system for adjusting computing task resources in computing cluster
CN117112157A (en) * 2023-07-04 2023-11-24 中国人民解放军陆军工程大学 General distributed computing system for task based on CLTS scheduling algorithm

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035238A (en) * 2020-09-11 2020-12-04 曙光信息产业(北京)有限公司 Task scheduling processing method and device, cluster system and readable storage medium
CN112700014B (en) * 2020-11-18 2023-09-29 脸萌有限公司 Method, device, system and electronic equipment for deploying federal learning application
CN112632113B (en) * 2020-12-31 2022-02-11 北京九章云极科技有限公司 Operator management method and operator management system
CN114691241B (en) * 2022-04-19 2024-01-19 中煤航测遥感集团有限公司 Task execution method, device, electronic equipment and storage medium
CN117009089B (en) * 2023-09-28 2023-12-12 南京庆文信息科技有限公司 Robot cluster supervision and management system based on distributed computing and UWB positioning
CN117519953B (en) * 2024-01-08 2024-04-05 北京大学 Separated memory management method for server-oriented non-perception calculation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106529673A (en) * 2016-11-17 2017-03-22 北京百度网讯科技有限公司 Deep learning network training method and device based on artificial intelligence
US9800517B1 (en) * 2013-10-31 2017-10-24 Neil Anderson Secure distributed computing using containers
CN108268308A (en) * 2018-01-22 2018-07-10 广州欧赛斯信息科技有限公司 A kind of continuous integrating method, system and device based on container platform

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106909451A (en) * 2017-02-28 2017-06-30 郑州云海信息技术有限公司 A kind of distributed task dispatching system and method
CN107229520B (en) * 2017-04-27 2019-10-18 北京数人科技有限公司 Data center operating system
CN107733977B (en) * 2017-08-31 2020-11-03 北京百度网讯科技有限公司 Cluster management method and device based on Docker
CN109445802B (en) * 2018-09-25 2022-08-26 众安信息技术服务有限公司 Privatized Paas platform based on container and method for publishing application thereof
CN109522025B (en) * 2018-10-30 2021-07-20 深圳市小赢信息技术有限责任公司 Code issuing system based on git

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9800517B1 (en) * 2013-10-31 2017-10-24 Neil Anderson Secure distributed computing using containers
CN106529673A (en) * 2016-11-17 2017-03-22 北京百度网讯科技有限公司 Deep learning network training method and device based on artificial intelligence
CN108268308A (en) * 2018-01-22 2018-07-10 广州欧赛斯信息科技有限公司 A kind of continuous integrating method, system and device based on container platform

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XUELI XUE: " A Distributed Task Scheduling Platform: XXL-JOB", 17 May 2017 (2017-05-17), pages 1 - 38, XP055872872, Retrieved from the Internet <URL:https://github.com/xuxueli/xxl-job/blob/v1.7/README.md> [retrieved on 20211214] *
ZHANG CHENGCHENG: "Research and Implementation of Container Cluster Management Platform Based on Docker", THESIS FOR MASTER DEGREE, 3 June 2019 (2019-06-03), pages 1 - 85, XP055872861, [retrieved on 20211214] *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114253598A (en) * 2021-12-22 2022-03-29 浪潮卓数大数据产业发展有限公司 Code hosting method and tool of online coding system
CN114253598B (en) * 2021-12-22 2023-09-05 浪潮卓数大数据产业发展有限公司 Code hosting method and tool for online coding system
CN114489942A (en) * 2022-01-19 2022-05-13 西安交通大学 Application cluster-oriented queue task scheduling method and system
CN114489942B (en) * 2022-01-19 2024-02-23 西安交通大学 Queue task scheduling method and system for application cluster
CN115080254A (en) * 2022-08-24 2022-09-20 北京向量栈科技有限公司 Method and system for adjusting computing task resources in computing cluster
CN115080254B (en) * 2022-08-24 2023-09-22 北京向量栈科技有限公司 Method and system for adjusting computing task resources in computing cluster
CN117112157A (en) * 2023-07-04 2023-11-24 中国人民解放军陆军工程大学 General distributed computing system for task based on CLTS scheduling algorithm

Also Published As

Publication number Publication date
CN111338784A (en) 2020-06-26
CN111338784B (en) 2020-12-22

Similar Documents

Publication Publication Date Title
WO2021237829A1 (en) Method and system for integrating code repository with computing service
US10453010B2 (en) Computer device, method, and apparatus for scheduling business flow
CN105893157B (en) A kind of open distributed system resource management and task scheduling system and method
JP6215715B2 (en) Method and system for managing a cloud computing environment
US7779298B2 (en) Distributed job manager recovery
WO2020248708A1 (en) Method and device for submitting spark work
US20100036957A1 (en) Method and System for Implementing Transfer of a Network Session
CN103135943B (en) Self-adaptive IO (Input Output) scheduling method of multi-control storage system
US20140245296A1 (en) System and method for virtualization aware server maintenance mode
KR20210064186A (en) Robot Scheduling for Robotic Process Automation
CN105138389A (en) Method and system for managing virtual devices in cluster
US9104488B2 (en) Support server for redirecting task results to a wake-up server
CN104899274A (en) High-efficiency remote in-memory database access method
CN105022659A (en) Virtual machine state control method and system
WO2023168923A1 (en) Robotic process automation cloud service system and implementation method
US20220405122A1 (en) Systems, methods, and apparatuses for processing routine interruption requests
CN105516267A (en) Efficient operation method for cloud platform
US7657590B2 (en) Load balancing system and method
CN113658351A (en) Product production method and device, electronic equipment and storage medium
CN116302423A (en) Distributed task scheduling method and system for cloud management platform
CN111522630B (en) Method and system for executing planned tasks based on batch dispatching center
CN113515356B (en) Lightweight distributed resource management and task scheduler and method
CN113326098B (en) Cloud management platform supporting KVM virtualization and container virtualization
CN109446641A (en) A kind of multistage Reliability modeling analysis method of cloud computing service system
US11113106B2 (en) Coordinating distributed task execution

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20938257

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20938257

Country of ref document: EP

Kind code of ref document: A1