WO2022193855A1 - 一种任务状态更新方法、装置、设备及介质 - Google Patents

一种任务状态更新方法、装置、设备及介质 Download PDF

Info

Publication number
WO2022193855A1
WO2022193855A1 PCT/CN2022/074599 CN2022074599W WO2022193855A1 WO 2022193855 A1 WO2022193855 A1 WO 2022193855A1 CN 2022074599 W CN2022074599 W CN 2022074599W WO 2022193855 A1 WO2022193855 A1 WO 2022193855A1
Authority
WO
WIPO (PCT)
Prior art keywords
pod
task
event
state
update
Prior art date
Application number
PCT/CN2022/074599
Other languages
English (en)
French (fr)
Inventor
邢良占
Original Assignee
山东英信计算机技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 山东英信计算机技术有限公司 filed Critical 山东英信计算机技术有限公司
Priority to US18/268,307 priority Critical patent/US11915035B1/en
Publication of WO2022193855A1 publication Critical patent/WO2022193855A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3017Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is implementing multitasking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/542Event management; Broadcasting; Multicasting; Notifications
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/4557Distribution of virtual machine instances; Migration and load balancing

Definitions

  • the present application relates to the field of computer technology, and in particular, to a method, apparatus, device and medium for updating task status.
  • Deep learning platform management needs to manage and control the tasks of the platform. The most important point is to manage the life cycle of the tasks. These are all state update mechanisms that depend on the tasks.
  • K8S kubernetes
  • API returns the status information of the Pod in real time, and then the platform maps and returns the status information of the task; the other is that the background timing task periodically queries the API provided by the underlying K8S platform to return the status information of the Pod, and then the platform gets the task's status after mapping. Status information is saved to the platform's database.
  • the above two task update methods cannot be well adapted to the task status update in the scenario of large-scale cluster, multi-user parallel use, and a large number of tasks running, and may even cause the training task to be unable to end accurately, and the status information of the task cannot be displayed correctly. and other related information, causing problems for users using the deep learning platform. Therefore, how to accurately, real-time, and effectively update the task status and solve the problem of slow response to query task status in the scenario of large-scale clusters, multi-user parallel use, and a large number of tasks running has become the key to the gradual improvement of the deep learning platform.
  • the purpose of this application is to provide a task state update method, device, device and medium, which improves the real-time and accuracy of task state update in large-scale clusters, multi-user parallel use, and a large number of task operation scenarios, At the same time, the response speed of querying the task status is also improved.
  • Its specific plan is as follows:
  • a first aspect of the present application provides a task state update method, applied to a deep learning platform, including:
  • K8S event listener Use the K8S event listener to monitor K8S events, obtain Pod state change events, and generate and publish corresponding Pod state update events based on the Pod state change events;
  • the Pod state update listener uses the Pod state update listener to monitor the Pod state update event.
  • the Pod state update listener monitors the Pod state update event
  • the Pod state corresponding to the Pod state update event is determined as the Pod state update event.
  • the Pod state of the corresponding target task in the deep learning platform and generate and publish the task state update event corresponding to the target task;
  • K8S event listener to monitor K8S events to obtain Pod state change events, including:
  • K8S event listener uses the K8S event listener to monitor the K8S event, filter the monitored K8S event, and obtain a Pod state change event.
  • the filtering of the monitored K8S events includes:
  • the K8S events are filtered according to the space names of the K8S events that are monitored.
  • generating and publishing a corresponding Pod state update event based on the Pod state change event including:
  • the corresponding Pod status update events are generated and published.
  • determining the Pod state corresponding to the Pod state update event as the Pod state of the corresponding target task in the deep learning platform including:
  • the Pod state data corresponding to the Pod state update event is mapped to the Pod state of the corresponding target task in the deep learning platform through the Pod state mapper.
  • the Pod state of the current target task is updated to the Pod state of the target task, including:
  • the Pod state of the target task is mapped to the current state of the target task through a task state mapper.
  • the method further includes:
  • a second aspect of the present application provides a task state update device, which is applied to a deep learning platform, including:
  • the first monitoring module is used to monitor K8S events by using the K8S event listener, obtain a Pod state change event, and generate and publish a corresponding Pod state update event based on the Pod state change event;
  • the second monitoring module is configured to use the Pod status update listener to monitor the Pod status update event.
  • the Pod status update listener monitors the Pod status update event
  • the Pod status update event will be monitored.
  • the corresponding Pod state is determined as the Pod state of the corresponding target task in the deep learning platform, and a task state update event corresponding to the target task is generated and published;
  • the third monitoring module is configured to use the task status update listener to monitor the task status update event, and when the task status update listener monitors the task status update event, the current target task The state is updated to the Pod state of the target task.
  • a third aspect of the present application provides an electronic device comprising a processor and a memory; wherein the memory is used to store a computer program, the computer program being loaded and executed by the processor to realize the aforementioned task state Update method.
  • a fourth aspect of the present application provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when the computer-executable instructions are loaded and executed by a processor, the aforementioned task status update is implemented method.
  • K8S event listener first create a K8S event listener, a Pod status update listener and a task status update listener; then use the K8S event listener to monitor K8S events to obtain a Pod status change event, and based on the Pod status change
  • the event generates and publishes the corresponding Pod state update event; then use the Pod state update listener to monitor the Pod state update event, and when the Pod state update listener listens to the Pod state update event, then the Pod state update event is monitored.
  • the Pod state corresponding to the Pod state update event is determined as the Pod state of the corresponding target task in the deep learning platform, and the task state update event corresponding to the target task is generated and published; finally, the task state update listener is utilized
  • the task state update event is monitored, and when the task state update listener monitors the task state update event, the current state of the target task is updated to the Pod state of the target task.
  • This application uses K8S event listeners, Pod status update listeners and task status update listeners to monitor K8S events, Pod status update events and task status update events in the task status update process, respectively, and analyze in real time to update the status of the task.
  • the above steps deeply integrate the capabilities of K8S, improve the real-time and accuracy of task status update in large-scale clusters, multi-user parallel use, and a large number of task running scenarios, and also improve the response speed of querying task status.
  • FIG. 2 is a schematic diagram of a specific task status update method provided by the application
  • FIG. 3 is a schematic diagram of the correspondence between a task and a Pod provided by the application;
  • FIG. 4 is a schematic diagram of an event generation and publishing process provided by the present application.
  • FIG. 6 is a schematic diagram of a mapping process provided by the present application.
  • FIG. 7 is a schematic structural diagram of a task status update device provided by the present application.
  • FIG. 8 is a structural diagram of an electronic device for updating task status provided by the present application.
  • the existing task status update method that returns the status information of the Pod in real time and maps it to the task status information and returns by querying the API provided by the underlying K8S platform in real time, or the background timing task, periodically querying the API provided by the underlying K8S platform, returning
  • the state information of the Pod is mapped to the task state information, and the task state update method saved in the platform database cannot be well adapted to the task state update in the scenario of large-scale clusters, multi-user parallel use, and a large number of tasks running.
  • the present application provides a task status update scheme, which uses K8S event listener, Pod status update listener and task status update listener to respectively update K8S events, Pod status update events and task status in the task status update process.
  • the update event is monitored and analyzed in real time to update the status of tasks.
  • K8S capabilities Through the deep integration of K8S capabilities, the real-time and accuracy of task status updates in large-scale clusters, multi-user parallel use, and a large number of task running scenarios are improved. The response speed of querying the task status is improved.
  • FIG. 1 is a flowchart of a task state update method provided by an embodiment of the present application, which is applied to a deep learning platform.
  • the task status update method includes:
  • S11 Create K8S event listeners, Pod status update listeners, and task status update listeners.
  • K8S Event Listener when the deep learning platform is started, a K8S event listener (K8S Event Listener), a Pod status update listener (Pod Status Change Event Listener) and a task status update listener (Task Status Change Event Listener) are created ),as shown in picture 2.
  • the K8S event listener is used to monitor the event (Filter Event) released by the underlying K8S platform, the Pod status update listener is used to monitor the Pod status update event, and the task status update listener is used to update the task status update time. to monitor.
  • K8S is a containerized open source application used to manage multiple hosts in the cloud platform.
  • K8S provides a mechanism for deploying, planning, updating and maintaining applications, and realizes the management of services by defining various types of resources.
  • a Pod is the foundation of all business types. It is a combination of one or more containers that share storage, networking, and namespaces, as well as specifications for how to operate.
  • all containers are uniformly arranged and scheduled, and run in a shared context, that is, a Pod is a container in which tasks run. A task may be run by multiple Pods at the same time, but a Pod can only run one task at a time , the corresponding relationship between Pod and task is shown in Figure 3.
  • K8S events are stored in Etcd and record major events in cluster operation, such as events of resource objects such as Pod, Node, and Kubelet, and events of some custom resource objects.
  • Pod which is the basic unit of scheduling, is a resource , this embodiment only pays attention to the Event of the Pod resource object.
  • AIStation is a deep learning platform that provides intelligent AI containerized deployment and more efficient distributed training for training tasks. It is an artificial intelligence development resource platform for artificial intelligence enterprise training scenarios, which can realize containerized deployment. , visual development, centralized management and other functions to provide users with extremely high-performance AI computing resources to achieve efficient computing power support, accurate resource management and scheduling, agile data integration and acceleration, and streamlined AI scenarios and business integration , effectively open up the development environment, computing resources and data resources, improve development efficiency, users can create different deep learning framework environments through the AIStation platform, and can freely develop models, debug models through command lines, and then quickly through the development platform.
  • K8S events can only be monitored by listeners based on the K8S mechanism.
  • the events inside the deep learning platform AIStation are all based on the spring framework, so the corresponding listeners are also implemented based on the spring framework. Therefore, the K8S event listener described in this embodiment is created based on the K8S system framework, and can only monitor events on the K8S platform.
  • the Pod status update listener is created based on the Spring event mechanism of the deep learning platform, and monitors the Pod status update time.
  • the task status update listener is also based on the deep learning platform. Created by Spring's event mechanism, it monitors the task status update time.
  • the number of the K8S event listeners, the Pod status update listeners, and the task status update listeners can be flexibly set according to the number of events, the number of actually running tasks, etc., so as to improve the efficiency of event monitoring.
  • S12 Use the K8S event listener to monitor K8S events, obtain Pod state change events, and generate and publish corresponding Pod state update events based on the Pod state change events.
  • the K8S events record the major events in the running of the cluster, including but not limited to the related events of the Pod resource object.
  • the resource that can represent the task status is the Pod resource. Therefore, in this embodiment, it is necessary to start from all the The compliant change events related to the Pod state, that is, the Pod state change events, are filtered out from the K8S events. On this basis, the corresponding Pod state update event is generated and published based on the Pod state change event.
  • the event source of the Pod state update event is the Pod state change event
  • the Pod state update event is triggered and generated by the Pod state change event, and is published by the broadcaster, as shown in FIG. 4 .
  • the monitoring mechanism between the Pod state change event and the Pod state update event is caused.
  • the types are different.
  • the Pod state update event is created based on the Spring framework.
  • the Pod state update event contains part of the data of the Pod state change event, and it can be considered that the Pod state update event contains the simplified information of the Pod state change event.
  • S13 Use the Pod state update listener to monitor the Pod state update event, and when the Pod state update listener monitors the Pod state update event, determine the Pod state corresponding to the Pod state update event is the Pod status of the corresponding target task in the deep learning platform, and generates and publishes a task status update event corresponding to the target task.
  • the number of the Pod state update listeners can be Set flexibly according to the number of the Pod state update events, for example, the number of the Pod state update listeners may be positively correlated with the number of the Pod state update events.
  • the Pod state update listener monitors the Pod state update event
  • the Pod state corresponding to the Pod state update event is determined as the Pod state of the corresponding target task in the deep learning platform.
  • the deep learning platform generates and publishes a task status update event corresponding to the target task.
  • the generation and publishing method of the task status update event is consistent with the generation and publishing method of the aforementioned Pod status update event.
  • the task status update event is generated and published.
  • the update event is triggered and generated by the Pod state update event, and is published by the broadcaster.
  • the task status update listener is started and the task status update listener is used to monitor the task status update event in real time.
  • the number of the Pod status update listeners can be updated according to the Pod status.
  • the number of events is set flexibly, for example, the number of the task status update listeners is positively correlated with the task update event.
  • the task state update listener monitors the task state update event, it updates the current state of the target task to the Pod state of the target task.
  • the target task may correspond to multiple Pods, that is, the same target task may be run by multiple Pods at the same time, but the same Pod has the same A moment can only correspond to one of the target tasks.
  • the target task corresponds to only one Pod, that is, when the target task is run by one Pod, the state corresponding to the Pod is undoubtedly determined as the Pod state of the target task.
  • the target task corresponds to multiple Pods at the same time, that is, when the target task is run by multiple Pods at the same time, it is necessary to make corresponding logical judgments on the states corresponding to multiple Pods according to preset rules, and then comprehensively determine the current report the status of the target task and update it.
  • the task update method described in this embodiment can update the task status more quickly.
  • the task update method described in this embodiment can Update task status more accurately.
  • the embodiments of the present application first create K8S event listeners, Pod state update listeners and task state update listeners; then use the K8S event listeners to monitor K8S events to obtain Pod state change events, and based on the Pod
  • the state change event generates and publishes the corresponding Pod state update event; then use the Pod state update listener to monitor the Pod state update event, and when the Pod state update listener listens to the Pod state update event, then Determine the Pod state corresponding to the Pod state update event as the Pod state of the corresponding target task in the deep learning platform, and generate and publish a task state update event corresponding to the target task;
  • the listener monitors the task state update event, and when the task state update listener monitors the task state update event, it updates the current state of the target task to the Pod state of the target task.
  • the embodiment of the present application utilizes the K8S event listener, the Pod status update listener and the task status update listener to monitor the K8S event, the Pod status update event and the task status update event in the task status update process, respectively, and analyze in real time to update the task status update event.
  • the above steps deeply integrate the capabilities of K8S, which improves the real-time and accuracy of task status updates in large-scale clusters, multi-user parallel use, and a large number of task running scenarios, and also improves the response speed of querying task status.
  • FIG. 5 is a flowchart of a specific task state update method provided by an embodiment of the present application, which is applied to a deep learning platform.
  • the task status update method includes:
  • S21 Create K8S event listeners, Pod status update listeners, and task status update listeners.
  • step S21 for the specific process of step S21, reference may be made to the corresponding content disclosed in the foregoing embodiments, which will not be repeated here.
  • the K8S event listener is started and used to monitor the K8S event, and the monitored K8S event is filtered to obtain a Pod state change event.
  • the filtering process is the process of filtering out the Pod state change event from all the K8S events.
  • the K8S events may be filtered according to the space names of the K8S events that are monitored.
  • the space name is composed of recorded objects and timestamps, which can reflect whether the event under the space name is a Pod state change Therefore, the K8S event can be filtered according to the space name of the K8S event to obtain the compliant Pod state change event.
  • S23 Extract target data from the data message corresponding to the Pod state change event, use the target data to reconstruct the data message, and generate a corresponding Pod state update event according to the reconstructed data message and perform release.
  • the target data is first extracted from the data message corresponding to the Pod state change event, and the target data is used to reconstruct the data message, and then the corresponding Pod is generated according to the reconstructed data message Status update event and publish it.
  • Both the K8S event and the Pod state change event exist in the format of data packets, which store a large amount of information. Since the K8S event includes some events at the bottom of all K8S platforms, the amount of data is relatively large, and the corresponding packets are also relatively large. Therefore, it is necessary to filter and filter to extract the effective content, that is, the target data, and finally use the effective content and generate a data packet of a simplified event based on the Spring event mechanism of the deep learning platform, so as to obtain the Pod state. Update the event and publish it with the broadcaster.
  • S24 Use the Pod state update listener to monitor the Pod state update event, when the Pod state update listener monitors the Pod state update event, then use the Pod state mapper to monitor the Pod state update event.
  • the corresponding Pod state data is mapped to the Pod state of the corresponding target task in the deep learning platform, and a task state update event corresponding to the target task is generated and published.
  • the Pod state mapper will The Pod state data corresponding to the Pod state update event is mapped to the Pod state of the corresponding target task in the deep learning platform, and a task state update event corresponding to the target task is generated and published at the same time.
  • the task status update listener is started and used to monitor the task status update event.
  • the task status update The state mapper maps the Pod state of the target task to the current state of the target task.
  • the process in which the task state mapper maps the Pod state of the target task to the current state of the target task is a response process to the preset rules described in the preceding embodiments, that is, according to the preset rules The process of determining the current state of the target task from the Pod state of the target task. Specifically, as shown in FIG.
  • the target task when the target task corresponds to only one Pod, it has a unique Pod state, and the Pod state is directly determined as the current state of the target task.
  • the target task corresponds to N Pods (Pod1, Pod2, ..., Podn)
  • the target task has N Pod states (Pod1 state, Pod2 state, ..., Podn state)
  • the preset rule can be For "there is a Pod status that is running, the current task status is running" or "there is a Pod status that is in error, the current task status is error", the preset rules can be set according to actual business needs, this embodiment This is not limited.
  • the deep learning platform cyclically executes the above steps S21 to S26 to update the status of tasks managed and controlled by the deep learning platform in real time, so as to manage and control the life cycle of the platform running tasks.
  • the K8S event listener, the Pod status update listener and the task status update listener are destroyed to release platform resources.
  • the destruction method is called through a predefined class implementation interface to destroy the K8S event listener, the Pod status update listener, and the task status update listener.
  • the embodiments of the present application mainly provide a task status update mechanism for a deep learning training platform, which is suitable for the task status update scenario of a deep learning training platform under a large-scale cluster.
  • a task status update mechanism for a deep learning training platform which is suitable for the task status update scenario of a deep learning training platform under a large-scale cluster.
  • Displaying task status information and other related information can cause problems when users use the deep learning platform.
  • an embodiment of the present application also discloses a task state update device, which is applied to a deep learning platform, including:
  • a creation module 11 is used to create K8S event listeners, Pod status update listeners and task status update listeners;
  • the first monitoring module 12 is configured to use the K8S event listener to monitor K8S events, obtain a Pod state change event, and generate and publish a corresponding Pod state update event based on the Pod state change event;
  • the second monitoring module 13 is configured to use the Pod status update listener to monitor the Pod status update event, and when the Pod status update listener monitors the Pod status update event, update the Pod status
  • the Pod state corresponding to the event is determined as the Pod state of the corresponding target task in the deep learning platform, and a task state update event corresponding to the target task is generated and published;
  • the third monitoring module 14 is configured to use the task status update listener to monitor the task status update event, and when the task status update listener monitors the task status update event, the current target task The status is updated to the Pod status of the target task.
  • the embodiments of the present application first create K8S event listeners, Pod state update listeners and task state update listeners; then use the K8S event listeners to monitor K8S events to obtain Pod state change events, and based on the Pod
  • the state change event generates and publishes the corresponding Pod state update event; then use the Pod state update listener to monitor the Pod state update event, and when the Pod state update listener listens to the Pod state update event, then Determine the Pod state corresponding to the Pod state update event as the Pod state of the corresponding target task in the deep learning platform, and generate and publish a task state update event corresponding to the target task;
  • the listener monitors the task state update event, and when the task state update listener monitors the task state update event, it updates the current state of the target task to the Pod state of the target task.
  • the embodiment of the present application utilizes the K8S event listener, the Pod status update listener and the task status update listener to monitor the K8S event, the Pod status update event and the task status update event in the task status update process, respectively, and analyze in real time to update the task status update event.
  • the above steps deeply integrate the capabilities of K8S, which improves the real-time and accuracy of task status updates in large-scale clusters, multi-user parallel use, and a large number of task running scenarios, and also improves the response speed of querying task status.
  • the first monitoring module 12 specifically includes:
  • a filtering unit configured to monitor the K8S event by using the K8S event listener, filter the monitored K8S event, and obtain a Pod state change event.
  • an extraction unit used for extracting target data from the data message corresponding to the Pod state change event
  • the reconstruction unit is configured to reconstruct the data message by using the target data, and generate and publish a corresponding Pod state update event according to the reconstructed data message.
  • the second monitoring module 13 is further configured to map the Pod state data corresponding to the Pod state update event to the Pod of the corresponding target task in the deep learning platform through the Pod state mapper state.
  • the third monitoring module 14 is further configured to map the Pod state of the target task to the current state of the target task through a task state mapper.
  • the task status update apparatus further includes a destruction module, configured to perform a destruction operation on the K8S event listener, the Pod status update listener and the task status update listener.
  • FIG. 8 is a structural diagram of an electronic device 20 according to an exemplary embodiment, and the contents in the diagram should not be considered as any limitation on the scope of use of the present application.
  • FIG. 8 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present application.
  • the electronic device 20 may specifically include: at least one processor 21 , at least one memory 22 , a power supply 23 , a communication interface 24 , an input and output interface 25 and a communication bus 26 .
  • the memory 22 is used to store a computer program, and the computer program is loaded and executed by the processor 21 to implement the relevant steps in the task state update method disclosed in any of the foregoing embodiments.
  • the power supply 23 is used to provide working voltage for each hardware device on the electronic device 20;
  • the communication interface 24 can create a data transmission channel between the electronic device 20 and external devices, and the communication protocol it follows is applicable Any communication protocol in the technical solution of the present application is not specifically limited here;
  • the input and output interface 25 is used to obtain external input data or output data to the outside world, and its specific interface type can be selected according to specific application needs, here No specific limitation is made.
  • the memory 22 serves as a resource storage carrier, which can be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc.
  • the resources stored on the memory 22 can include the operating system 221, the computer program 222, the state change data 223, etc., and the storage method can be Temporary storage or permanent storage.
  • the operating system 221 is used to manage and control each hardware device and computer program 222 on the electronic device 20, so as to realize the operation and processing of the massive state change data 223 in the memory 22 by the processor 21, which can be Windows Server, Netware, Unix, Linux, etc.
  • the computer program 222 may further include a computer program that can be used to complete other specific tasks in addition to the computer program that can be used to complete the task status update method performed by the electronic device 20 disclosed in any of the foregoing embodiments.
  • Data 223 may include state change data collected by electronic device 20 .
  • an embodiment of the present application further discloses a storage medium, where a computer program is stored in the storage medium, and when the computer program is loaded and executed by a processor, the steps of the task state update method disclosed in any of the foregoing embodiments are implemented. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本申请公开了一种任务状态更新方法、装置、设备及介质,应用于深度学习平台,包括:利用K8S事件监听器监听K8S事件,得到Pod状态变更事件,基于Pod状态变更事件生成Pod状态更新事件(S12);利用Pod状态更新监听器监听Pod状态更新事件,当监听到Pod状态更新事件,将Pod状态更新事件对应的Pod状态确定为深度学习平台中相应的目标任务的Pod状态,并生成任务状态更新事件(S13);利用任务状态更新监听器监听任务状态更新事件,当监听到任务状态更新事件,将当前目标任务的状态更新为目标任务的Pod状态(S14)。本申请通过监听并实时分析K8S事件来更新任务状态,提高大规模集群、多用户并行使用、大量任务运行场景下的任务状态更新的实时性和准确度,同时也提高查询任务状态的响应速度。

Description

一种任务状态更新方法、装置、设备及介质
本申请要求在2021年3月18日提交中国专利局、申请号为202110290936.2、发明名称为“一种任务状态更新方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别涉及一种任务状态更新方法、装置、设备及介质。
背景技术
目前,以深度学习为代表的人工智能技术取得了飞速的发展,这些技术正落地应用于各行各业。随着深度学习的广泛应用,很多领域产生了大量的、强烈的高效便捷训练人工智能模型方面的需求,而这些训练都是依赖于深度学习训练平台。深度学习平台管理需要对平台的任务进行管控,其中最重要的一点就是对任务的生命周期进行管理,这些都是依赖任务的状态更新机制。
在目前的技术中,大部分深度学习平台都是支持平台任务状态的更新的,主要通过两种方式来实现训练任务的状态更新,一种是通过实时查询底层kubernetes(以下简称K8S)平台提供的API,实时返回Pod的状态信息,然后平台进行映射返回任务的状态信息;另外一种是后台定时任务定时通过查询底层K8S平台提供的API,返回Pod的状态信息,然后平台进行映射后得到任务的状态信息保存到平台的数据库中。上述两种任务更新方式不能很好的适应大规模集群、多用户并行使用、大量任务运行的场景下的任务状态更新,甚至可能会导致训练任务无法准确的结束,无法正确的展示任务的状态信息以及相关的其他信息,导致用户使用深度学习平台时出现问题。因此,如何在大规模集群、多用户并行使用、大量任务运行的场景下准确、实时、有效的更新任务状态并解决查询任务状态响应慢的问题,成为深度学习平台逐步完善的关键之举。
发明内容
有鉴于此,本申请的目的在于提供一种任务状态更新方法、装置、设备及介质,提高了大规模集群、多用户并行使用、大量任务运行场景下的任务状态更新的实时性和准确度,同时也提高了查询任务状态的响应速度。其具体方案如下:
本申请的第一方面提供了一种任务状态更新方法,应用于深度学习平台,包括:
创建K8S事件监听器、Pod状态更新监听器和任务状态更新监听器;
利用所述K8S事件监听器对K8S事件进行监听,得到Pod状态变更事件,并基于所述Pod状态变更事件生成及发布相应的Pod状态更新事件;
利用所述Pod状态更新监听器对所述Pod状态更新事件进行监听,当所述Pod状态更新监听器监听到所述Pod状态更新事件,则将所述Pod状态更新事件对应的Pod状态确定为所述深度学习平台中相应的目标任务的Pod状态,并生成及发布与所述目标任务对应的任务状态更新事件;
利用所述任务状态更新监听器对所述任务状态更新事件进行监听,当所述任务状态更新监听器监听到所述任务状态更新事件,则将当前所述目标任务的状态更新为所述目标任务的Pod状态。
可选的,所述利用所述K8S事件监听器对K8S事件进行监听,得到Pod状态变更事件,包括:
利用所述K8S事件监听器对所述K8S事件进行监听,对监听到的所述K8S事件进行过滤,得到Pod状态变更事件。
可选的,所述对监听到的所述K8S事件进行过滤,包括:
根据监听到的所述K8S事件的空间名称对所述K8S事件进行过滤。
可选的,所述基于所述Pod状态变更事件生成及发布相应的Pod状态更新事件,包括:
从与所述Pod状态变更事件对应的数据报文中提取出目标数据,并利用所述目标数据重构数据报文;
根据重构后的数据报文生成相应的Pod状态更新事件并进行发布。
可选的,所述将所述Pod状态更新事件对应的Pod状态确定为所述深度学习平台中相应的目标任务的Pod状态,包括:
通过Pod状态映射器将所述Pod状态更新事件对应的Pod状态数据映射为所述深度学习平台中相应的目标任务的Pod状态。
可选的,所述将当前所述目标任务的Pod状态更新为所述目标任务的Pod状态,包括:
通过任务状态映射器将所述目标任务的Pod状态映射为当前所述目标任务的状态。
可选的,所述将当前所述目标任务的状态更新为所述目标任务的Pod状态之后,还包括:
对所述K8S事件监听器、所述Pod状态更新监听器和所述任务状态更新监听器进行销毁操作。
本申请的第二方面提供了一种任务状态更新装置,应用于深度学习平台,包括:
创建模块,用于创建K8S事件监听器、Pod状态更新监听器和任务状态更新监听器;
第一监听模块,用于利用所述K8S事件监听器对K8S事件进行监听,得到Pod状态变更事件,并基于所述Pod状态变更事件生成及发布相应的Pod状态更新事件;
第二监听模块,用于利用所述Pod状态更新监听器对所述Pod状态更新事件进行监听,当所述Pod状态更新监听器监听到所述Pod状态更新事件,则将所述Pod状态更新事件对应的Pod状态确定为所述深度学习平台中相应的目标任务的Pod状态,并生成及发布与所述目标任务对应的任务状态更新事件;
第三监听模块,用于利用所述任务状态更新监听器对所述任务状态更新事件进行监听,当所述任务状态更新监听器监听到所述任务状态更新事件,则将当前所述目标任务的状态更新为所述目标任务的Pod状态。
本申请的第三方面提供了一种电子设备,所述电子设备包括处理器和存储器;其中所述存储器用于存储计算机程序,所述计算机程序由所述处理器加载并执行以实现前述任务状态更新方法。
本申请的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机可执行指令,所述计算机可执行指令被处理器加载并执行时,实现前述任务状态更新方法。
本申请中,先创建K8S事件监听器、Pod状态更新监听器和任务状态更新监听器;然后利用所述K8S事件监听器对K8S事件进行监听,得到Pod状态变更事件,并基于所述Pod状态变更事件生成及发布相应的Pod状态更新事件;接着利用所述Pod状态更新监听器对所述Pod状态更新事件进行监听,当所述Pod状态更新监听器监听到所述Pod状态更新事件,则将所述Pod状态更新事件对应的Pod状态确定为所述深度学习平台中相应的目标任务的Pod状态,并生成及发布与所述目标任务对应的任务状态更新事件;最后利用所述任务状态更新监听器对所述任务状态更新事件进行监听,当所述任务状态更新监听器监听到所述任务状态更新事件,则将当前所述目标任务的状态更新为所述目标任务的Pod状态。本申请利用K8S事件监听器、Pod状态更新监听器和任务状态更新监听器分别对任务状态更新过程中的K8S事件、Pod状态更新事件和任务状态更新事件进行监听并实时分析来更新任务的状态,上述步骤深度集成K8S的能力,提高了大规模集群、多用户并行使用、大量任务运行场景下的任务状态更新的实时性和准确度,同时也提高了查询任务状态的响应速度。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请提供的一种任务状态更新方法流程图;
图2为本申请提供的一种具体的任务状态更新方法示意图;
图3为本申请提供的一种任务与Pod的对应关系示意图;
图4为本申请提供的一种事件生成及发布过程示意图;
图5为本申请提供的一种具体的任务状态更新方法流程图;
图6为本申请提供的一种映射过程示意图;
图7为本申请提供的一种任务状态更新装置结构示意图;
图8为本申请提供的一种任务状态更新电子设备结构图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
现有的通过实时查询底层K8S平台提供的API,实时返回Pod的状态信息并映射为任务状态信息并返回的任务状态更新方式,或是后台定时任务,定时通过查询底层K8S平台提供的API,返回Pod的状态信息并映射为任务状态信息,保存至平台数据库中的任务状态更新方式,均不能很好的适应大规模集群、多用户并行使用、大量任务运行的场景下的任务状态更新。针对上述技术缺陷,本申请提供一种任务状态更新方案,利用K8S事件监听器、Pod状态更新监听器和任务状态更新监听器分别对任务状态更新过程中的K8S事件、Pod状态更新事件和任务状态更新事件进行监听并实时分析来更新任务的状态,通过深度集成K8S的能力,提高了大规模集群、多用户并行使用、大量任务运行场景下的任务状态更新的实时性和准确度,同时也提高了查询任务状态的响应速度。
图1为本申请实施例提供的一种任务状态更新方法流程图,应用于深度学习平台。参见图1所示,该任务状态更新方法包括:
S11:创建K8S事件监听器、Pod状态更新监听器和任务状态更新监听器。
本实施例中,在所述深度学习平台启动的时候,创建K8S事件监听器(K8S Event Listener)、Pod状态更新监听器(Pod Status Change Event Listener)和任务状态更新监听器(Task Status Change Event Listener),如图2所示。所述K8S事件监听器用于对底层K8S平台发布的事件(Filter Event)进行监听,所述Pod状态更新监听器用于对Pod状态更新事件进行监听,所述任务状态更新 监听器用于对任务状态更新时间进行监听。K8S是一个用于管理云平台中多个主机上的容器化的开源应用,K8S提供了一种对应用进行部署、规划、更新、维护的机制,通过定义各种类型的资源来实现管理服务的各项功能,目的是让部署容器化的应用简单并且高效。在K8S集群中,Pod是所有业务类型的基础,它是一个或多个容器的组合,这些容器共享存储、网络和命名空间,以及如何运行的规范。在Pod中,所有容器都被统一安排和调度,并运行在共享的上下文中,也即Pod是任务运行的容器,一个任务可能由多个Pod同时进行运行,但一个Pod同时只能运行一个任务,Pod与任务的对应关系具体如图3所示。K8S事件存储在Etcd中,记录了集群运行中的各大事件,如Pod、Node、Kubelet等资源对象的Event,也可以是一些自定义资源对象的Event,作为调度基本单元的Pod是一种资源,本实施例仅关注Pod资源对象的Event。
需要说明的是,监听器和事件之间的监听机制是配套。以AIStation为例,AIStation是一种为训练任务提供智能的AI容器化部署以及更具效率的分布式训练的深度学习平台,面向人工智能企业训练场景的人工智能开发资源平台,可实现容器化部署、可视化开发、集中化管理等功能,为用户提供极致高性能的AI计算资源,实现高效的计算力支撑、精准的资源管理和调度、敏捷的数据整合及加速、流程化的AI场景及业务整合,有效打通开发环境、计算资源与数据资源,提升开发效率,用户通过AIStation平台能够创建不同的深度学习框架环境,可以自由的进行模型的开发,通过命令行方式进行调试模型,然后通过开发平台快速提交到训练平台,达到开发训练一体化解决方案。K8S的事件只能用基于K8S机制的监听器进行监听,深度学习平台AIStation内部的事件都是基于spring框架,所以对应的监听器也是基于spring框架实现。因此,本实施例中所述K8S事件监听器是基于K8S系统框架创建的,只能对K8S平台的事件进行监听。所述Pod状态更新监听器是基于所述深度学习平台的Spring的事件机制创建的,对所述Pod状态更新时间进行监听,相同的,所述任务状态更新监听器也是基于所述深度学习平台的Spring的事件机制创建的,对任务状态更新时间进行监听。另外,可以根据事件数量、实际运行任务数量等灵活设置所述K8S事件监听器、所述Pod状态更新监听器和所述任务状态更新监听器的数量,以提高事件监听效率。
S12:利用所述K8S事件监听器对K8S事件进行监听,得到Pod状态变更事件,并基于所述Pod状态变更事件生成及发布相应的Pod状态更新事件。
本实施例中,启动所述K8S事件监听器并利用所述K8S事件监听器对所述K8S事件进行监听,得到Pod状态变更事件,然后基于所述Pod状态变更事件生成及发布相应的Pod状态更新事件。如前所述,所述K8S事件记录了集群运行中的各大事件,包括但不限于Pod资源对象的相关事件,能表征任务状态的资源是Pod资源,因此本实施例中,需要从全部的K8S事件中筛选出合规的与Pod状态相关的变更事件,也即所述Pod状态变更事件。在此基础上,基于所述Pod状态变更事件生成相应的所述Pod状态更新事件并进行发布。所述Pod状态更新事件的事件源是所述Pod状态变更事件,所述Pod状态更新事件由所述Pod状态变更事件触发生成,并由广播器进行发布,如图4所示。
一方面,从类型上来说,由于创建所述Pod状态变更事件与创建所述Pod状态更新事件之间的逻辑架构不同,导致所述Pod状态变更事件与所述Pod状态更新事件之间的监听机制类型不同,本实施例中,所述Pod状态更新事件的创建基础是Spring框架。另一方面,从信息量上来说,所述Pod状态更新事件包含的是所述Pod状态变更事件的部分数据,可以认为所述Pod状态更新事件包含所述Pod状态变更事件的精简信息。
S13:利用所述Pod状态更新监听器对所述Pod状态更新事件进行监听,当所述Pod状态更新监听器监听到所述Pod状态更新事件,则将所述Pod状态更新事件对应的Pod状态确定为所述深度学习平台中相应的目标任务的Pod状态,并生成及发布与所述目标任务对应的任务状态更新事件。
本实施例中,启动所述Pod状态更新监听器并利用所述Pod状态更新监听器对所述深度学习平台发布的所述Pod状态更新事件进行实时监听,所述Pod状态更新监听器的数量可以根据所述Pod状态更新事件的数量灵活设置,如所述Pod状态更新监听器的数量可以与所述Pod状态更新事件的数量呈正相关关系。当所述Pod状态更新监听器监听到所述Pod状态更新事件时,则将所述Pod状态更新事件对应的Pod状态确定为所述深度学习平台中相应的目标任务的Pod状态。同时,所述深度学习平台生成与所述目标任务对应的任务状态更新事件并进行发布,所述任务状态更新事件的生成及发布方法与前述Pod状态更新事件的生成及发布方法一致,所述任务更新事件由所述Pod状态更新事件触 发生成,并由广播器进行发布。
S14:利用所述任务状态更新监听器对所述任务状态更新事件进行监听,当所述任务状态更新监听器监听到所述任务状态更新事件,则将当前所述目标任务的状态更新为所述目标任务的Pod状态。
本实施例中,启动任务状态更新监听器并利用所述任务状态更新监听器对所述任务状态更新事件进行实时监听,同样的,所述Pod状态更新监听器的数量可以根据所述Pod状态更新事件的数量灵活设置,如所述任务状态更新监听器的数量与所述任务更新事件呈正相关关系。当所述任务状态更新监听器监听到所述任务状态更新事件时,则将当前所述目标任务的状态更新为所述目标任务的Pod状态。如前所述,所述目标任务与Pod之间存在对应关系,同一个所述目标任务可能对应多个Pod,也即同一个所述目标任务可能同时由多个Pod运行,但同一个Pod同一时刻只能与一个所述目标任务对应。当所述目标任务仅对应一个Pod时,也即当所述目标任务由一个Pod运行时,毫无疑问将该Pod对应的状态确定为所述目标任务的Pod状态。但当所述目标任务同时对应多个Pod时,也即当所述目标任务同时由多个Pod运行时,需要根据预设规则对多个Pod对应的状态进行相应的逻辑判断后综合确定当前所述目标任务的状态并进行更新。一方面,相对于现有技术中通过实时查询底层K8S平台提供的API,实时返回Pod的状态信息来确定对应任务的状态信息,本实施例中所述的任务更新方法能更加快速更新任务状态,另一方面,相对于现有技术中定时查询底层K8S平台提供的API,返回Pod的状态信息并存储至深度学习平台数据库来确定对应任务的状态信息,本实施例中所述的任务更新方法能更加准确更新任务状态。
可见,本申请实施例先创建K8S事件监听器、Pod状态更新监听器和任务状态更新监听器;然后利用所述K8S事件监听器对K8S事件进行监听,得到Pod状态变更事件,并基于所述Pod状态变更事件生成及发布相应的Pod状态更新事件;接着利用所述Pod状态更新监听器对所述Pod状态更新事件进行监听,当所述Pod状态更新监听器监听到所述Pod状态更新事件,则将所述Pod状态更新事件对应的Pod状态确定为所述深度学习平台中相应的目标任务的Pod状态,并生成及发布与所述目标任务对应的任务状态更新事件;最后利用所述任务状态更新监听器对所述任务状态更新事件进行监听,当所述任务状态更 新监听器监听到所述任务状态更新事件,则将当前所述目标任务的状态更新为所述目标任务的Pod状态。本申请实施例利用K8S事件监听器、Pod状态更新监听器和任务状态更新监听器分别对任务状态更新过程中的K8S事件、Pod状态更新事件和任务状态更新事件进行监听并实时分析来更新任务的状态,上述步骤深度集成K8S的能力,提高了大规模集群、多用户并行使用、大量任务运行场景下的任务状态更新的实时性和准确度,同时也提高了查询任务状态的响应速度。
图5为本申请实施例提供的一种具体的任务状态更新方法流程图,应用于深度学习平台。参见图5所示,该任务状态更新方法包括:
S21:创建K8S事件监听器、Pod状态更新监听器和任务状态更新监听器。
本实施例中,关于步骤S21的具体过程,可以参考前述实施例中公开的相应内容,在此不再进行赘述。
S22:利用所述K8S事件监听器对所述K8S事件进行监听,对监听到的所述K8S事件进行过滤,得到Pod状态变更事件。
本实施例中,启动并利用所述K8S事件监听器对所述K8S事件进行监听,对监听到的所述K8S事件进行过滤,得到Pod状态变更事件。所述过滤的过程,即是从全部的所述K8S事件中筛选出所述Pod状态变更事件的过程。具体的,可以根据监听到的所述K8S事件的空间名称对所述K8S事件进行过滤。一般来说,由于K8S事件是Kubelet负责用来记录多个容器运行过程中的事件,所述空间名称由被记录的对象和时间戳构成,能够反映所述空间名称下的事件是否为Pod状态变更事件,因此可以根据K8S事件的空间名称对所述K8S事件进行过滤,以得到合规的所述Pod状态变更事件。
S23:从与所述Pod状态变更事件对应的数据报文中提取出目标数据,利用所述目标数据重构数据报文,并根据重构后的数据报文生成相应的Pod状态更新事件并进行发布。
本实施例中,首先从与所述Pod状态变更事件对应的数据报文中提取出目标数据,并利用所述目标数据重构数据报文,然后根据重构后的数据报文生成相应的Pod状态更新事件并进行发布。所述K8S事件和所述Pod状态变更事件均以数据报文的格式存在,其中存储有大量信息,由于K8S事件包含所有的 K8S平台底层的一些事件,数据量比较大,对应的报文也比较大,因此需要过滤筛选提取其中的有效内容也即所述目标数据,最终利用该有效内容并基于所述深度学习平台的Spring的事件机制生成精简事件的数据报文,以此得到所述Pod状态更新事件并利用广播器进行发布。
S24:利用所述Pod状态更新监听器对所述Pod状态更新事件进行监听,当所述Pod状态更新监听器监听到所述Pod状态更新事件,则通过Pod状态映射器将所述Pod状态更新事件对应的Pod状态数据映射为所述深度学习平台中相应的目标任务的Pod状态,并生成及发布与所述目标任务对应的任务状态更新事件。
本实施例中,启动并利用所述Pod状态更新监听器对所述Pod状态更新事件进行监听,当所述Pod状态更新监听器监听到所述Pod状态更新事件,则通过Pod状态映射器将所述Pod状态更新事件对应的Pod状态数据映射为所述深度学习平台中相应的目标任务的Pod状态,同时生成及发布与所述目标任务对应的任务状态更新事件。
S25:利用所述任务状态更新监听器对所述任务状态更新事件进行监听,当所述任务状态更新监听器监听到所述任务状态更新事件,则通过任务状态映射器将所述目标任务的Pod状态映射为当前所述目标任务的状态。
本实施例中,在上述基础上,启动并利用所述任务状态更新监听器对所述任务状态更新事件进行监听,当所述任务状态更新监听器监听到所述任务状态更新事件,则通过任务状态映射器将所述目标任务的Pod状态映射为当前所述目标任务的状态。所述任务状态映射器将所述目标任务的Pod状态映射为当前所述目标任务的状态的过程是一个对前述实施例中所述的预设规则的响应过程,也即根据所述预设规则由所述目标任务的Pod状态确定当前所述目标任务的状态的过程。具体如图6所示,当所述目标任务只对应一个Pod时,则具有唯一的Pod状态,直接将该Pod状态确定为当前所述目标任务的状态。当所述目标任务对应N个Pod(Pod1、Pod2、…、Podn)时,所述目标任务具有N个Pod状态(Pod1状态、Pod2状态、…、Podn状态),此时所述预设规则可以为“存在一个Pod状态为运行中,当前任务状态为运行中”或者“存在一个Pod状态为错误,当前任务状态为错误”,所述预设规则可以根据实际业务需求自行设定,本实施例对此不进行限定。
S26:对所述K8S事件监听器、所述Pod状态更新监听器和所述任务状态更新监听器进行销毁操作。
本实施例中,所述深度学习平台循环执行上述步骤S21至S26,实时更新所述深度学习平台管控的任务的状态,从而对平台运行任务的生命周期进行管控。当所述深度学习平台停止运行时,将所述K8S事件监听器、所述Pod状态更新监听器和所述任务状态更新监听器进行销毁操作,释放平台资源。具体的,通过预先定义的类实现接口调用里面的销毁方法,对所述K8S事件监听器、所述Pod状态更新监听器和所述任务状态更新监听器进行销毁。
可见,本申请实施例主要提供一种深度学习训练平台任务状态更新机制,适用于大规模集群下的深度学习训练平台中任务状态更新的场景,在解决大规模集群、多用户并行使用、大量任务运行的场景下,如何准确、实时、有效的更新任务状态以及克服查询任务状态响应慢的问题的情况下,同时也避免了由于训练任务状态更新不及时使得训练任务无法准确结束,从而无法正确的展示任务的状态信息以及相关的其他信息,导致用户使用深度学习平台时出现问题。
参见图7所示,本申请实施例还相应公开了一种任务状态更新装置,应用于深度学习平台,包括:
创建模块11,用于创建K8S事件监听器、Pod状态更新监听器和任务状态更新监听器;
第一监听模块12,用于利用所述K8S事件监听器对K8S事件进行监听,得到Pod状态变更事件,并基于所述Pod状态变更事件生成及发布相应的Pod状态更新事件;
第二监听模块13,用于利用所述Pod状态更新监听器对所述Pod状态更新事件进行监听,当所述Pod状态更新监听器监听到所述Pod状态更新事件,则将所述Pod状态更新事件对应的Pod状态确定为所述深度学习平台中相应的目标任务的Pod状态,并生成及发布与所述目标任务对应的任务状态更新事件;
第三监听模块14,用于利用所述任务状态更新监听器对所述任务状态更新事件进行监听,当所述任务状态更新监听器监听到所述任务状态更新事件,则将当前所述目标任务的状态更新为所述目标任务的Pod状态。
可见,本申请实施例先创建K8S事件监听器、Pod状态更新监听器和任务状态更新监听器;然后利用所述K8S事件监听器对K8S事件进行监听,得到Pod状态变更事件,并基于所述Pod状态变更事件生成及发布相应的Pod状态更新事件;接着利用所述Pod状态更新监听器对所述Pod状态更新事件进行监听,当所述Pod状态更新监听器监听到所述Pod状态更新事件,则将所述Pod状态更新事件对应的Pod状态确定为所述深度学习平台中相应的目标任务的Pod状态,并生成及发布与所述目标任务对应的任务状态更新事件;最后利用所述任务状态更新监听器对所述任务状态更新事件进行监听,当所述任务状态更新监听器监听到所述任务状态更新事件,则将当前所述目标任务的状态更新为所述目标任务的Pod状态。本申请实施例利用K8S事件监听器、Pod状态更新监听器和任务状态更新监听器分别对任务状态更新过程中的K8S事件、Pod状态更新事件和任务状态更新事件进行监听并实时分析来更新任务的状态,上述步骤深度集成K8S的能力,提高了大规模集群、多用户并行使用、大量任务运行场景下的任务状态更新的实时性和准确度,同时也提高了查询任务状态的响应速度。
在一些具体实施例中,所述第一监听模块12,具体包括:
过滤单元,用于利用所述K8S事件监听器对所述K8S事件进行监听,对监听到的所述K8S事件进行过滤,得到Pod状态变更事件。
提取单元,用于从与所述Pod状态变更事件对应的数据报文中提取出目标数据;
重构单元,用于利用所述目标数据重构数据报文,并根据重构后的数据报文生成相应的Pod状态更新事件并进行发布。
在一些具体实施例中,所述第二监听模块13,具体还用于通过Pod状态映射器将所述Pod状态更新事件对应的Pod状态数据映射为所述深度学习平台中相应的目标任务的Pod状态。
在一些具体实施例中,所述第三监听模块14,具体还用于通过任务状态映射器将所述目标任务的Pod状态映射为当前所述目标任务的状态。
在一些具体实施例中,所述任务状态更新装置还包括销毁模块,用于对所述K8S事件监听器、所述Pod状态更新监听器和所述任务状态更新监听器进行销毁操作。
进一步的,本申请实施例还提供了一种电子设备。图8是根据一示例性实施例示出的电子设备20结构图,图中的内容不能认为是对本申请的使用范围的任何限制。
图8为本申请实施例提供的一种电子设备20的结构示意图。该电子设备20,具体可以包括:至少一个处理器21、至少一个存储器22、电源23、通信接口24、输入输出接口25和通信总线26。其中,所述存储器22用于存储计算机程序,所述计算机程序由所述处理器21加载并执行,以实现前述任一实施例公开的任务状态更新方法中的相关步骤。
本实施例中,电源23用于为电子设备20上的各硬件设备提供工作电压;通信接口24能够为电子设备20创建与外界设备之间的数据传输通道,其所遵循的通信协议是能够适用于本申请技术方案的任意通信协议,在此不对其进行具体限定;输入输出接口25,用于获取外界输入数据或向外界输出数据,其具体的接口类型可以根据具体应用需要进行选取,在此不进行具体限定。
另外,存储器22作为资源存储的载体,可以是只读存储器、随机存储器、磁盘或者光盘等,其上所存储的资源可以包括操作系统221、计算机程序222及状态变更数据223等,存储方式可以是短暂存储或者永久存储。
其中,操作系统221用于管理与控制电子设备20上的各硬件设备以及计算机程序222,以实现处理器21对存储器22中海量状态变更数据223的运算与处理,其可以是Windows Server、Netware、Unix、Linux等。计算机程序222除了包括能够用于完成前述任一实施例公开的由电子设备20执行的任务状态更新方法的计算机程序之外,还可以进一步包括能够用于完成其他特定工作的计算机程序。数据223可以包括电子设备20收集到的状态变更数据。
进一步的,本申请实施例还公开了一种存储介质,所述存储介质中存储有计算机程序,所述计算机程序被处理器加载并执行时,实现前述任一实施例公开的任务状态更新方法步骤。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个…”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
以上对本申请所提供的任务状态更新方法、装置、设备及存储介质进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (10)

  1. 一种任务状态更新方法,其特征在于,应用于深度学习平台,包括:
    创建K8S事件监听器、Pod状态更新监听器和任务状态更新监听器;
    利用所述K8S事件监听器对K8S事件进行监听,得到Pod状态变更事件,并基于所述Pod状态变更事件生成及发布相应的Pod状态更新事件;
    利用所述Pod状态更新监听器对所述Pod状态更新事件进行监听,当所述Pod状态更新监听器监听到所述Pod状态更新事件,则将所述Pod状态更新事件对应的Pod状态确定为所述深度学习平台中相应的目标任务的Pod状态,并生成及发布与所述目标任务对应的任务状态更新事件;
    利用所述任务状态更新监听器对所述任务状态更新事件进行监听,当所述任务状态更新监听器监听到所述任务状态更新事件,则将当前所述目标任务的状态更新为所述目标任务的Pod状态。
  2. 根据权利要求1所述的任务状态更新方法,其特征在于,所述利用所述K8S事件监听器对K8S事件进行监听,得到Pod状态变更事件,包括:
    利用所述K8S事件监听器对所述K8S事件进行监听,对监听到的所述K8S事件进行过滤,得到Pod状态变更事件。
  3. 根据权利要求2所述的任务状态更新方法,其特征在于,所述对监听到的所述K8S事件进行过滤,包括:
    根据监听到的所述K8S事件的空间名称对所述K8S事件进行过滤。
  4. 根据权利要求1所述的任务状态更新方法,其特征在于,所述基于所述Pod状态变更事件生成及发布相应的Pod状态更新事件,包括:
    从与所述Pod状态变更事件对应的数据报文中提取出目标数据,并利用所述目标数据重构数据报文;
    根据重构后的数据报文生成相应的Pod状态更新事件并进行发布。
  5. 根据权利要求1所述的任务状态更新方法,其特征在于,所述将所述Pod状态更新事件对应的Pod状态确定为所述深度学习平台中相应的目标任务的Pod状态,包括:
    通过Pod状态映射器将所述Pod状态更新事件对应的Pod状态数据映射为所述深度学习平台中相应的目标任务的Pod状态。
  6. 根据权利要求1所述的任务状态更新方法,其特征在于,所述将当前所述目标任务的Pod状态更新为所述目标任务的Pod状态,包括:
    通过任务状态映射器将所述目标任务的Pod状态映射为当前所述目标任务的状态。
  7. 根据权利要求1至6任一项所述的任务状态更新方法,其特征在于,所述将当前所述目标任务的状态更新为所述目标任务的Pod状态之后,还包括:
    对所述K8S事件监听器、所述Pod状态更新监听器和所述任务状态更新监听器进行销毁操作。
  8. 一种任务状态更新装置,其特征在于,应用于深度学习平台,包括:
    创建模块,用于创建K8S事件监听器、Pod状态更新监听器和任务状态更新监听器;
    第一监听模块,用于利用所述K8S事件监听器对K8S事件进行监听,得到Pod状态变更事件,并基于所述Pod状态变更事件生成及发布相应的Pod状态更新事件;
    第二监听模块,用于利用所述Pod状态更新监听器对所述Pod状态更新事件进行监听,当所述Pod状态更新监听器监听到所述Pod状态更新事件,则将所述Pod状态更新事件对应的Pod状态确定为所述深度学习平台中相应的目标任务的Pod状态,并生成及发布与所述目标任务对应的任务状态更新事件;
    第三监听模块,用于利用所述任务状态更新监听器对所述任务状态更新事件进行监听,当所述任务状态更新监听器监听到所述任务状态更新事件,则将当前所述目标任务的状态更新为所述目标任务的Pod状态。
  9. 一种电子设备,其特征在于,所述电子设备包括处理器和存储器;其中所述存储器用于存储计算机程序,所述计算机程序由所述处理器加载并执行以实现如权利要求1至7任一项所述的任务状态更新方法。
  10. 一种计算机可读存储介质,其特征在于,用于存储计算机可执行指令,所述计算机可执行指令被处理器加载并执行时,实现如权利要求1至7任一项所述的任务状态更新方法。
PCT/CN2022/074599 2021-03-18 2022-01-28 一种任务状态更新方法、装置、设备及介质 WO2022193855A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/268,307 US11915035B1 (en) 2021-03-18 2022-01-28 Task state updating method and apparatus, device, and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110290936.2 2021-03-18
CN202110290936.2A CN113010385B (zh) 2021-03-18 2021-03-18 一种任务状态更新方法、装置、设备及介质

Publications (1)

Publication Number Publication Date
WO2022193855A1 true WO2022193855A1 (zh) 2022-09-22

Family

ID=76409701

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/074599 WO2022193855A1 (zh) 2021-03-18 2022-01-28 一种任务状态更新方法、装置、设备及介质

Country Status (3)

Country Link
US (1) US11915035B1 (zh)
CN (1) CN113010385B (zh)
WO (1) WO2022193855A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010385B (zh) 2021-03-18 2022-10-28 山东英信计算机技术有限公司 一种任务状态更新方法、装置、设备及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200301782A1 (en) * 2019-03-20 2020-09-24 International Business Machines Corporation Scalable multi-framework multi-tenant lifecycle management of deep learning applications
CN111741257A (zh) * 2020-05-21 2020-10-02 深圳市商汤科技有限公司 数据处理方法及装置、电子设备及存储介质
CN112000363A (zh) * 2020-07-30 2020-11-27 苏州浪潮智能科技有限公司 一种管理大数据组件配置文件的方法及系统
CN112433818A (zh) * 2020-11-30 2021-03-02 上海天旦网络科技发展有限公司 使Kubernetes持久化的方法和系统
CN112486634A (zh) * 2020-12-09 2021-03-12 浪潮云信息技术股份公司 一种实现容器云平台整体监控的方法
CN113010385A (zh) * 2021-03-18 2021-06-22 山东英信计算机技术有限公司 一种任务状态更新方法、装置、设备及介质

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140013247A1 (en) * 2012-07-03 2014-01-09 salesforce.com,inc. Systems and methods for providing a customized user interface for publishing into a feed
CN107666525B (zh) * 2017-09-08 2020-11-24 北京京东尚科信息技术有限公司 集群容器ip分配的方法和装置
CN108039975B (zh) * 2017-12-21 2020-08-28 北京搜狐新媒体信息技术有限公司 容器集群管理系统及其应用方法
CN108062246B (zh) * 2018-01-25 2019-06-14 北京百度网讯科技有限公司 用于深度学习框架的资源调度方法和装置
CN110213309B (zh) * 2018-03-13 2022-02-01 腾讯科技(深圳)有限公司 一种绑定关系管理的方法、设备及存储介质
CN110321115B (zh) * 2018-03-30 2022-12-13 中移(苏州)软件技术有限公司 一种Pod创建方法及设备
CN109831500B (zh) * 2019-01-30 2020-04-28 无锡华云数据技术服务有限公司 Kubernetes集群中配置文件与Pod的同步方法
CN110162471B (zh) * 2019-04-28 2023-08-11 中国工商银行股份有限公司 一种基于容器云的压力测试方法及系统
CN110502340A (zh) * 2019-08-09 2019-11-26 广东浪潮大数据研究有限公司 一种资源动态调整方法、装置、设备及存储介质
CN110912972B (zh) * 2019-11-07 2022-08-19 北京浪潮数据技术有限公司 一种业务处理方法、系统、电子设备及可读存储介质
CN111431740B (zh) * 2020-03-16 2023-07-14 深信服科技股份有限公司 数据的传输方法、装置、设备及计算机可读存储介质
CN111352717B (zh) * 2020-03-24 2023-04-07 广西梯度科技股份有限公司 一种实现kubernetes自定义调度器的方法
CN111427665A (zh) * 2020-03-27 2020-07-17 合肥本源量子计算科技有限责任公司 一种量子应用云平台及量子计算任务的处理方法
CN111538563A (zh) * 2020-04-14 2020-08-14 北京宝兰德软件股份有限公司 一种对Kubernetes的事件分析方法及装置
CN111897625B (zh) * 2020-06-23 2023-10-20 新浪技术(中国)有限公司 一种基于Kubernetes集群的资源事件回溯方法、系统及电子设备
CN112039963B (zh) * 2020-08-21 2023-04-07 广州虎牙科技有限公司 一种处理器的绑定方法、装置、计算机设备和存储介质
CN112104486A (zh) * 2020-08-31 2020-12-18 中国—东盟信息港股份有限公司 一种基于Kubernetes容器的网络端点切片的方法及其系统
CN112104723B (zh) * 2020-09-07 2024-03-15 腾讯科技(深圳)有限公司 一种多集群的数据处理系统及方法
CN112068935A (zh) * 2020-09-15 2020-12-11 北京值得买科技股份有限公司 kubernetes程序部署监控方法、装置以及设备
CN112087522B (zh) * 2020-09-18 2021-10-22 北京航空航天大学 一种面向工业机器人数据处理的边云协同流程编排系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200301782A1 (en) * 2019-03-20 2020-09-24 International Business Machines Corporation Scalable multi-framework multi-tenant lifecycle management of deep learning applications
CN111741257A (zh) * 2020-05-21 2020-10-02 深圳市商汤科技有限公司 数据处理方法及装置、电子设备及存储介质
CN112000363A (zh) * 2020-07-30 2020-11-27 苏州浪潮智能科技有限公司 一种管理大数据组件配置文件的方法及系统
CN112433818A (zh) * 2020-11-30 2021-03-02 上海天旦网络科技发展有限公司 使Kubernetes持久化的方法和系统
CN112486634A (zh) * 2020-12-09 2021-03-12 浪潮云信息技术股份公司 一种实现容器云平台整体监控的方法
CN113010385A (zh) * 2021-03-18 2021-06-22 山东英信计算机技术有限公司 一种任务状态更新方法、装置、设备及介质

Also Published As

Publication number Publication date
US20240054003A1 (en) 2024-02-15
CN113010385A (zh) 2021-06-22
CN113010385B (zh) 2022-10-28
US11915035B1 (en) 2024-02-27

Similar Documents

Publication Publication Date Title
US10756949B2 (en) Log file processing for root cause analysis of a network fabric
US10027540B2 (en) Automatically determining unpopulated entries of a provisioning template for a hosted computing environment
CN105608203B (zh) 一种基于Hadoop平台的物联网日志处理方法和装置
US9531609B2 (en) Virtual service automation
EP3709227B1 (en) System and method for interoperable communication of an automation system component with multiple information sources
CN105653425B (zh) 基于复杂事件处理引擎的监控系统
CN102571420B (zh) 一种网元数据管理方法及系统
US20160357424A1 (en) Collapsing and placement of applications
US11924240B2 (en) Mechanism for identifying differences between network snapshots
CN106548288B (zh) 电力多场景多态实例管理系统及方法
US11503063B2 (en) Systems and methods for detecting hidden vulnerabilities in enterprise networks
US10826803B2 (en) Mechanism for facilitating efficient policy updates
CN104731943A (zh) 一种服务器和数据处理方法
WO2019223178A1 (zh) 跨平台任务调度方法、系统、计算机设备和存储介质
WO2022193855A1 (zh) 一种任务状态更新方法、装置、设备及介质
Trunov et al. Legacy applications model integration to support scientific experiment
Branco et al. Managing very large distributed data sets on a data grid
US8166143B2 (en) Methods, systems and computer program products for invariant representation of computer network information technology (IT) managed resources
CN109324892B (zh) 分布式管理方法、分布式管理系统及装置
US20230012641A1 (en) Securing network resources from known threats
US20180329792A1 (en) Network device monitoring
Singh Cluster-level logging of containers with containers: Logging challenges of container-based cloud deployments
CN102761570A (zh) 基于代理的网格资源监控系统及监控方法
US20230315580A1 (en) Disaster recovery in a cell model for an extensibility platform
CN110858806B (zh) 节点部署文件的生成方法及装置、节点部署方法及装置、电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22770214

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18268307

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22770214

Country of ref document: EP

Kind code of ref document: A1