CN113157476A

CN113157476A - Processing method and device for display card fault in virtual cloud environment

Info

Publication number: CN113157476A
Application number: CN202110384811.6A
Authority: CN
Inventors: 张浩然; 程童; 赵欢; 吕亚霖
Original assignee: Zuoyebang Education Technology Beijing Co Ltd
Current assignee: Zuoyebang Education Technology Beijing Co Ltd
Priority date: 2021-04-10
Filing date: 2021-04-10
Publication date: 2021-07-23

Abstract

The invention relates to the technical field of processing of display card faults, and discloses a method for processing display card faults in a virtual cloud environment, which comprises the following steps: monitoring a change event representing the fault of the display card; performing data synchronization processing on stored display card resource information according to the change event, wherein the display card resource information comprises a cluster node in the virtual cloud environment, a topological structure among a display card and a service and display card state information; and based on the synchronized display card resource information, performing subsequent processing on the display card fault in the virtual cloud environment, and synchronously updating the display card resource information. According to the processing method for the display card fault in the virtual cloud environment, the service or cluster node corresponding to the fault display card can be processed aiming at the change event of the display card fault, so that the fault display card is shielded, and the stable operation of a virtual frame is guaranteed; and the display card resource information is synchronously updated, the real-time update of the display card resource information is ensured, and the real-time monitoring and the real-time processing are carried out aiming at the display card fault in the virtual cloud environment.

Description

Processing method and device for display card fault in virtual cloud environment

Technical Field

The invention relates to the technical field of processing of display card faults, in particular to a method and a device for processing display card faults in a virtual cloud environment.

Background

In actual use, a graphics card fault phenomenon often occurs, a graphics card fault in normal use can cause that a service is unavailable, the service running on the graphics card should not continue to provide the service and needs to be scheduled away, and no relevant processing mode for the GPU graphics card fault in a virtual cloud environment such as a kubernets environment exists in the industry at present.

Before, a physical machine GPU cluster is provided with a fault display card shielding scheme, and when the physical machine GPU cluster is migrated to a virtualized cloud environment based on kubernets and containers, because one of the characteristics of virtualization is that an application and a server are decoupled, the host machine of specific running of the application is also uncertain and can be frequently migrated by the scheduling of a scheduler. What is perceived by the client, i.e. an entry address, is not known at all to the following runtime environment. Therefore, a fault display card shielding scheme working in a physical machine environment is invalid, a display card fault processing method aiming at a cloud environment does not exist in the prior art, and the cloud environment needs to be redesigned.

In addition, for the existing virtualized architecture, the scheduling process cannot shield the display card fault, so that the new architecture cannot stably operate.

In view of the above, the present invention is particularly proposed.

Disclosure of Invention

The invention aims to: the shielding scheme of the fault display card aiming at the virtual cloud environment is provided, and the stable operation of the virtual framework is ensured.

In order to achieve the above purpose, the invention provides the following technical scheme:

the invention provides a method for processing a display card fault in a virtual cloud environment, which comprises the following steps:

monitoring a change event representing the fault of the display card;

performing data synchronization processing on stored display card resource information according to the change event, wherein the display card resource information comprises a cluster node in the virtual cloud environment, a topological structure among a display card and a service and display card state information;

and based on the synchronized display card resource information, performing subsequent processing on the display card fault in the virtual cloud environment, and synchronously updating the display card resource information.

As an optional implementation manner of the present invention, the display card resource information is stored by using an adjacency matrix; the synchronization processing is data synchronization processing for the adjacency matrix.

As an optional embodiment of the present invention, the virtual cloud environment is a kubernetes-based virtual cloud environment; the processing method further comprises the following steps: acquiring display card fault information, and creating a configuration file in a name space to record the display card fault information;

the monitoring of the change event for representing the display card fault comprises the following steps: monitoring a change event of the configuration file under the name space;

optionally, receiving a change event for the configuration file under the namespace by maintaining a long connection with an apiserver component of kubernets.

As an optional implementation manner of the present invention, the acquiring the display card failure information and creating a configuration file in a namespace to record the display card failure information includes:

step S1001, starting a display card detection program on each GPU machine;

step S1002, the video card detection program periodically acquires fault information of all video cards;

step S1003, if the display card fault information is acquired, the configuration file is created in the name space;

optionally, in step S1001, starting a graphics card detection program on each GPU machine by starting a daemon of kubernets;

optionally, in step S1002, the graphics card detecting program periodically obtains fault information of all the graphics cards through an API interface driven by the graphics card.

As an optional embodiment of the present invention, the performing data synchronization processing on the stored graphics card resource information according to the change event includes one or more of the following:

judging whether the change event is a fault new event or not, if so, acquiring an IP address of a fault display card and a display card number list, and storing the IP address and the display card number into the display card resource information;

judging whether the change event is a display card state change event or not, if so, acquiring the IP address and the display card number list of the display card to be modified, modifying the display card resource information based on the IP address and the display card number list of the display card to be modified, and changing the display card state;

and judging whether the change event is a fault deletion event, if so, acquiring the IP address of the display card corresponding to the fault to be deleted, and deleting the corresponding fault data in the display card resource information based on the IP address.

As an optional embodiment of the present invention, the performing subsequent processing on the graphics card fault in the virtual cloud environment includes:

when the display card is distributed for the newly created service, the fault display card is not used any more;

and expelling the existing service on the fault display card.

As an optional embodiment of the present invention, the evicting existing services on the failed graphics card includes: in a virtual cloud environment of kubernets, deleting existing services on a fault display card when the fault information of the display card is acquired by subscribing the change event of the namespace characterization display card fault, and synchronizing the service change information to the display card resource information;

optionally, when the display card resource information is stored in the adjacency matrix, the evicting the existing service on the failed display card includes:

subscribing a configuration file change event in the name space and judging the type of the change event, wherein the configuration file is a file which is created in the name space and records the fault information of the display card when the fault information of the display card is acquired;

when the received change event is a configuration file new event or a configuration file modification event, acquiring the information of a fault display card related to the change event, judging whether the total fault display card related to the change event of the GPU machine exceeds a set value, traversing the adjacent matrix if the total fault display card related to the change event of the GPU machine exceeds the set value, deleting the services on all display cards corresponding to the GPU machine, and deleting the GPU machine so that kubernets complete the subsequent migration and restart operation of the fault display card service; if not, finding a service list corresponding to the fault display card related to the change event in the adjacency matrix and deleting the service list, so that the kubernets can complete the subsequent migration restart operation of the deleted service.

As an optional embodiment of the present invention, in a virtual cloud environment of kubernets, and when the display card resource information is stored in an adjacency matrix, the allocating a display card for a newly created service without using the failed display card includes:

when newly created service starts the distribution of the display cards, the adjacent matrix is obtained, the GPU machine is traversed, and available display card resources are obtained, wherein the available display card resources have filtered out the fault display cards;

and judging whether the available display card resources are sufficient or not, selecting the available display card to distribute the newly created service, searching the GPU machine and the display card number corresponding to the display card distributed with the newly created service in the adjacent matrix, and updating the service list.

The embodiment of the invention also provides a processing device for the display card fault in the virtual cloud environment, which comprises the following steps:

the data processing module is used for storing the display card resource information in the virtual cloud environment, and carrying out data synchronization processing on the stored display card resource information according to the type of a change event when the change event representing the display card fault is monitored, wherein the display card resource information comprises cluster nodes, a topological structure among display cards and services and display card state information;

and the subsequent processing module is used for performing subsequent processing on the display card fault in the virtual cloud environment based on the synchronized display card resource information, and synchronously updating the display card resource information through the data processing module.

As an optional implementation manner of this embodiment, the virtual cloud environment is a kubernets-based virtual cloud environment, and the processing apparatus further includes: the system comprises a probe module, a display card fault information acquisition module and a configuration file generation module, wherein the probe module is used for acquiring the display card fault information of each GPU machine and creating the configuration file in a name space to record the display card fault information;

optionally, when the data processing module monitors a profile change event: judging whether the configuration file change event is a configuration file new event or not, if so, acquiring an IP address and a display card number list of a fault display card contained in the new event, and storing fault display card information into display card resource information according to the IP address and the display card number; judging whether the configuration file change event is a display card state change event or not, if so, acquiring an IP address and a display card number list pointed by the display card state change event, modifying the display card resource information based on the IP address and the display card number list, and changing the display card state; judging whether the configuration file change event is a configuration file deletion event or not, if so, acquiring the IP address and the display card number of the display card pointed by the configuration file deletion event, and deleting corresponding fault data in the display card resource information based on the IP address and the display card number;

optionally, the subsequent processing module includes an expeller module, where the expeller module is configured to delete an existing service on a failed graphics card and synchronize service change information to graphics card resource information when a graphics card failure is obtained by subscribing to a configuration file change event of the namespace;

optionally, the subsequent processing module includes a scheduling module, where the scheduling module is configured to perform display card allocation for a newly created service, and when the scheduling module starts display card allocation, the scheduling module acquires available display card resources according to the display card resource information, where a fault display card has been filtered out from the available display card resources, and determines whether the available display card resources are sufficient, selects an available display card to allocate the newly created service, and updates a service list corresponding to the display card in the display card resource information;

further optionally, the data processing module stores the display card resource information by using an adjacency matrix;

further optionally, the data processing module starts a single instance in a global scope, registers a namespace, and monitors a configuration file change event under the namespace; judging the type of the change event, and performing data synchronization processing on the adjacent matrix according to the type of the change event; optionally, the data processing module receives a change event of the configuration file in the namespace by maintaining a long connection with an apiserver of the kubernets system;

further optionally, the probe module is deployed by using daemonset, and the main process is as follows:

starting a corresponding judgment program on each GPU machine, circularly judging each display card, calling a drive API (application program interface) to judge the state of the display card, returning to the step of circularly judging if the display card is judged to be healthy, and creating/modifying/deleting configuration file resources containing the node IP and the display card in the name space of the probe module if the display card is judged to be unhealthy;

optionally, the single instance is started in the overall range of the evictor module, the single instance shares a name space with the data processing module, a change event of a configuration file in the name space is subscribed, when the change event is a new event or a modification event, whether the number of the failed display cards of the new event or the modification event exceeds a set value is judged, if yes, the adjacent matrix is traversed, the service lists corresponding to all the display cards of the GPU machine are deleted, and if not, the service lists corresponding to the display cards are found in the adjacent matrix and deleted;

optionally, the scheduling module starts a single instance in a global scope, when the newly created service starts the display card allocation, the adjacency matrix is obtained, the GPU machine is traversed, the failed display card is filtered, whether the display card resources are sufficient is judged, the newly created service is allocated by the available display card, then the corresponding GPU machine and the display card are searched in the adjacency matrix, and the service list is updated.

The invention also provides an electronic device, which comprises a processor and a memory, wherein the memory is used for storing a computer executable program, and when the computer program is executed by the processor, the processor executes the processing method for processing the display card fault in the virtual cloud environment.

Compared with the prior art, the invention has the beneficial effects that:

the processing method for the display card fault in the virtual cloud environment stores the display card resource information in the virtual environment, such as the topological structure and the state among cluster nodes, display cards and services; monitoring a change event representing the fault of the display card, and when the change event is monitored, performing data synchronization processing on the stored display card resource information according to the change event so as to realize real-time updating of the display card resource information; then, based on the synchronized display card resource information, subsequent processing is performed on the display card fault in the virtual cloud environment, and the service or cluster node corresponding to the fault display card can be processed according to the change event of the display card fault, so that the fault display card is shielded, and the stable operation of the virtual frame is guaranteed; and the display card resource information is synchronously updated, the display card resource information is ensured to be updated in real time, real-time monitoring is carried out aiming at the virtual cloud environment, and real-time monitoring and real-time processing are carried out aiming at the display card fault in the virtual cloud environment.

Through the scheme of the invention, related functions can be provided for other modules for inquiring and modifying IP, display card numbers and service data.

According to the method, aiming at the scene that the display card in the virtual environment breaks down, services existing on the broken-down display card can be expelled according to the stored display card resource information in the virtual cloud environment; and controlling the new service to not use the fault display card any more.

The invention can shield the fault display card aiming at the virtual environment and ensure the stable operation of the virtual frame.

Description of the drawings:

FIG. 1 is a block flow diagram of a method for handling graphics card failures in a virtual cloud environment, according to some embodiments of the invention;

fig. 2 is a block diagram of a process of acquiring display card failure information and creating a configuration file in a namespace to record the display card failure information in a processing method of a display card failure in a virtual cloud environment according to some embodiments of the present invention;

fig. 3 is a block flow diagram illustrating a process of performing data synchronization processing on stored display card resource information according to the change event in a processing method of a display card failure in a virtual cloud environment according to some embodiments of the present invention;

FIG. 4 is a schematic diagram of a global processing flow of a method for processing a graphics card failure in a virtual cloud environment according to some embodiments of the present invention;

FIG. 5 is a schematic diagram of a manner of storing a adjacency matrix of data processing modules according to some embodiments of the present invention;

FIG. 6 is a block diagram of a flow of an evictor module evicting an existing service on a failing graphics card, according to some embodiments of the invention;

FIG. 7 is a block flow diagram of a scheduler module newly creating a service, according to some embodiments of the invention;

FIG. 8 is a block flow diagram of the operation of a probe module according to some embodiments of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments.

Thus, the following detailed description of the embodiments of the invention is not intended to limit the scope of the invention as claimed, but is merely representative of some embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the features and technical solutions in the embodiments and embodiments of the present invention may be arbitrarily combined without conflict.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present invention, it should be noted that the terms "upper", "lower", and the like refer to orientations or positional relationships based on those shown in the drawings, or orientations or positional relationships that are conventionally arranged when the products of the present invention are used, or orientations or positional relationships that are conventionally understood by those skilled in the art, and such terms are used for convenience of description and simplification of the description, and do not refer to or imply that the devices or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like are used merely to distinguish one description from another, and are not to be construed as indicating or implying relative importance.

Referring to fig. 1, a processing method for a display card fault in a virtual cloud environment provided by this embodiment includes:

step 101, monitoring a change event representing the display card fault.

The method comprises the steps of monitoring whether a display card in the virtual cloud environment fails or not, and acquiring display card fault information required by subsequent display card fault processing. The specific implementation manner of the monitoring is various, and the implementation effect of the scheme is not affected, so that the embodiment is not limited. The design can be designed by a person skilled in the art according to specific situations. For example, in a virtual cloud environment implemented by using a kubernets system, a change event of a configuration file in a namespace may be received by maintaining a long connection with an apiserver; and creating config file resources in a namespace in advance when the failure of the display card is detected to provide a realization basis.

And 102, performing data synchronization processing on stored display card resource information according to the change event, wherein the display card resource information comprises a cluster node in the virtual cloud environment, a topological structure among a display card and a service, and display card state information.

In this embodiment, the topology structures among the cluster nodes, the display cards and the services in the virtual cloud environment and the display card state information are stored in the display card resource information in advance.

When a change event representing a display card fault is monitored, the display card fault information is obtained according to the change event, data synchronization processing is carried out on display card resource information used for storing display card related information in the virtual cloud environment, and the display card fault information can be recorded in time through the data synchronization processing so as to be convenient for subsequent display card fault processing.

And 103, performing subsequent processing on the display card fault in the virtual cloud environment based on the synchronized display card resource information, and synchronously updating the display card resource information.

In the step, subsequent processing is carried out on the display card fault in the virtual cloud environment according to the synchronized display card resource information, and the display card resource information is updated synchronously if necessary. The following processing of the display card fault mainly includes: when the display card is distributed for the newly created service, the fault display card is not used any more; expelling existing services on the failing graphics card. The fault display card refers to a display card which is contained in a variation event for representing the fault of the display card and can not work normally.

In the processing method for the display card fault in the virtual cloud environment of the embodiment, display card resource information in the virtual environment, such as a topological structure and a state among cluster nodes, display cards and services, is stored; monitoring a change event representing the fault of the display card, and when the change event is monitored, performing data synchronization processing on the stored display card resource information according to the change event so as to realize real-time updating of the display card resource information; then, based on the synchronized display card resource information, subsequent processing is performed on the display card fault in the virtual cloud environment, and the service or cluster node corresponding to the fault display card can be processed according to the change event of the display card fault, so that the fault display card is shielded, and the stable operation of the virtual frame is guaranteed; and the display card resource information is synchronously updated, the display card resource information is ensured to be updated in real time, and real-time monitoring and real-time processing are carried out on the display card fault in the virtual cloud environment.

As an optional implementation manner of this embodiment, this embodiment may employ an adjacency matrix to store the graphics card resource information; the synchronization processing is data synchronization processing for the adjacency matrix.

Illustratively, referring to fig. 5, a schematic diagram of storing the graphics card resource information by using a adjacency matrix is shown. The display card resource information in this example includes the number of the display card, whether the display card is faulty, the total display memory and name of the display card, the used display memory, the service information running on the display card, and the like.

In some embodiments, the virtual cloud environment of the present embodiment is a kubernetes-based virtual cloud environment; the processing method further comprises the following steps:

s100, acquiring display card fault information, and creating a configuration file in a name space to record the display card fault information;

step 101, monitoring a change event representing a display card fault includes: and monitoring the change event of the configuration file under the name space. It should be noted that different files can be selected as carriers based on different virtual environment systems, and the configuration file is not limited to the configuration file described in this embodiment as long as the result of the video card failure detection can be recorded and used for sharing the video card information change.

Optionally, in a virtual cloud environment based on kubernets, the change event of the configuration file in the namespace may be received by maintaining a long connection with an apiserver component of kubernets.

For example, referring to fig. 2, in step S100 of this embodiment, acquiring the display card failure information, and creating a configuration file in a namespace to record the display card failure information may include:

step S1001, starting a display card detection program on each GPU machine;

alternatively, in step S1001, the graphics card detection program may be started on each GPU machine by starting the daemon of kubernets.

Optionally, in step S1002, the graphics card detecting program may periodically obtain fault information of all the graphics cards through an API interface driven by the graphics card. The specific detection period can be determined by a person skilled in the art according to actual conditions. Or, the optimal period is determined according to experiments, and the balance between the system overhead and the function realization is achieved. API is the abbreviation of Application Programming Interface, which means Application program Interface. The API interface can enable the software designed by a programmer to call the program built in the display card, so that the software can automatically communicate with the drive program of the hardware, and the design efficiency of the program is greatly improved. The programmer only needs to write the program code conforming to the interface, the existing functions of the display card can be fully exerted, the specific performance and parameters of hardware do not need to be known, and the efficiency of program development is greatly simplified.

Further, referring to fig. 3, in step S102 of this embodiment, the data synchronization process performed on the stored video card resource information according to the change event may include one or more of the following:

judging whether the change event is a fault new event or not, if so, acquiring an IP address and a display card number list of a fault display card, storing the IP address and the display card number into the display card resource information, and changing the state of the display card;

judging whether the change event is a display card state change event or not, if so, acquiring the IP address of the display card to be modified and a display card number list, and modifying the display card resource information based on the IP address of the display card to be modified and the display card number list;

In the method for processing a display card fault in a virtual cloud environment according to this embodiment, in step S103, performing subsequent processing on the display card fault in the virtual cloud environment includes: when a display card is distributed for the newly created service, the fault display card is not used; and expelling the existing service on the fault display card.

Further, the evicting the existing service on the failed graphics card in this embodiment may include: in a virtual cloud environment of kubernets, existing services on a failed display card are deleted by subscribing a change event of the namespace characterization display card fault, and service change information is synchronized to the display card resource information. Therefore, when a fault display card is found, the existing service on the fault display card can be timely expelled, and service faults and data loss are avoided.

Referring to fig. 6, when the graphics card resource information is stored in the adjacency matrix, the step includes:

and subscribing a configuration file change event in the name space and judging the type of the change event, wherein the configuration file is a file which is created in the name space and records the fault information of the display card when the fault information of the display card is obtained.

In the embodiment, the display card fault information is recorded through the configuration file, and the configuration file can be shared in the name space, so that sharing, calling and synchronization of the display card fault information are realized. For example, in the K8s environment, a config file provided by K8s may be used as a configuration file, and what we want to do is to use this function to update the association between the maintenance machine, the display card, and the service working on the display card, and the content of the config file configuration may be determined by the developer according to the requirement.

When the received change event is a configuration file new event or a configuration file modification event, acquiring the information of the fault display card related to the change event, and judging whether the total fault display card of the machine related to the change event exceeds a set value, if so, traversing the adjacent matrix, deleting the services on all the display cards of the corresponding machine, and deleting the machine so that kubernets complete the subsequent migration restart operation of the fault display card service;

if not, finding a service list corresponding to the fault display card related to the change event in the adjacency matrix and deleting the service list, so that the kubernets can complete the subsequent migration restart operation of the deleted service.

Under the Kubernets environment, when the service on all the display cards of the corresponding machine is deleted and the machine is deleted, the kubernets can automatically complete the subsequent migration and restart operation of the fault display card service; when the service list corresponding to the fault display card related to the change event is found and the kubernets are deleted, the subsequent migration restart operation of the deleted service can be automatically completed. In this embodiment, node deletion and service eviction are completed by using the original Kubernetes function.

In this embodiment, when a configuration file new event or a configuration file modification event occurs, migration restart is performed on a service running on a failed graphics card associated with the configuration file new event or the configuration file modification event, so as to ensure normal and stable running of the service. And for machines with excessive fault display cards, the machines need to be deleted, and related personnel are reminded to maintain.

Optionally, it is determined whether the number of the failed display cards of the GPU machine after the new event of the configuration file or the modification event of the configuration file exceeds half, that is, the set value may be half of the total number of the display cards of the GPU machine.

Referring to fig. 7, in the method for processing a graphics card fault in a virtual cloud environment according to this embodiment, when the graphics card resource information is stored in an adjacency matrix in a virtual cloud environment of kubernets, and a graphics card is allocated to a newly created service, the method for processing a graphics card fault in a virtual cloud environment does not use the faulty graphics card any more, and includes:

when the display card allocation is started for the newly created service, the adjacent matrix is obtained, the GPU machine is traversed, and available display card resources are obtained, wherein the available display card resources have filtered out the fault display card;

and judging whether the available display card resources are sufficient or not, selecting the available display card to distribute the newly created service, searching the machine and the display card number corresponding to the display card distributed with the newly created service in the adjacent matrix, and updating the service list.

The processing method for the display card fault in the virtual cloud environment can provide healthy display card distribution service for newly created service, ensure stable operation of the newly created service, maximally configure display card resources, and improve stability and efficiency of operation of a system.

Referring to fig. 4, in particular, the present embodiment provides a processing method for a graphics card fault in a virtual environment based on kubernets (K8 s for short), including a data processing module, a probe module, a scheduling module, and an evictor module. The implementation adopts the steps of creating a configuration file (config file) in a name space to record the received video card fault information, acquiring the video card fault information by monitoring the change event of the configuration file in the name space,

the data processing module stores cluster nodes, display cards and service topological structures in a mode of an adjacent matrix. The data processing module of this embodiment may be a model layer, and adopts a single-instance deployment manner. Starting a single instance in the global scope of the data processing module, registering a name space and monitoring a change event of a config file in the name space; and judging the variation event, and performing data synchronization processing on the adjacent matrix according to the type of the variation event.

The classification and corresponding processing flow for the change event in this embodiment can be as follows:

and if the change event is judged to be a new event (indicating that a new fault occurs), analyzing the config file by a split method, acquiring an IP and a display card number list from the config file, and storing the IP and the display card number into the adjacent matrix.

Preferably, the failing graphics card number data is stored in the adjacency matrix in IP packets.

If the change event is a change event, analyzing the config file by a split method, acquiring an IP and a display card number list from the config file, traversing and comparing the adjacent matrix, finding the changed display card, and updating the state data (recording the fault or recovering to be normal)

If the event is a deletion event: obtaining the ip from the config file by a split method, and updating the data in the ip to be recovered to normal

Preferably, the data processing module may compare the traversal packet with the adjacency matrix according to the IP packet, find a changed graphics card, and update the state data: and recording the display card fault or the display card is recovered to be normal.

In order to monitor the change event in the namespace of the data processing module, the data processing module described in this embodiment receives the change event in the config file in the namespace by maintaining a long connection with the apiserver of kubernetes.

The data processing module of this embodiment may also provide methods for other modules for IP, display card number and service data query and modification:

1. providing a query function: and traversing the adjacent matrix, and inquiring the display card information list and the running service list in an IP or IP + display card number mode.

2. And providing a deleting service function: and traversing the adjacent matrix, and selecting to delete the service lists of all the display cards on one node or delete the service lists on the appointed display card in the mode of IP or IP + number of the display card.

3. Providing a delete node function: traversing the adjacency matrix, deleting all the display card service lists on one node by inputting the IP, and deleting the node.

4. Providing an add function: and traversing the adjacent matrix, and adding a service list of the appointed display card on one node in a mode of IP + display card number + service.

Referring to fig. 4, the probe module is deployed in a manner of daemoset (daemoset is a concept of kubernets, and a daemon process is started on each server meeting the requirement).

The probe module of the embodiment has the main functions of: the state of the display card of the node is checked, data is organized and transmitted to the data module, and the data structure is as follows:

config file structure:

apiVersion:v1

kind:ConfigMap

metadata:

name:unhealthy-GPU-192.168.1.123

data:

GPUs:"1,7"。

referring to fig. 8, the main process of the probe module of this embodiment includes:

step S101, starting daemoset, and starting a corresponding detection program on each GPU machine;

step S102, circularly judging each display card: the program circularly obtains error information of all the display cards through the api driven by the display cards, if the error information is normal, the program sleeps to wait for the next detection, and the step S102 is repeated; if not, step S103 is executed.

Step S103, creating a config file resource containing the node IP and the display card in the namespace of the probe module.

Preferably, the step S101 detection procedure may be started periodically or once every predetermined period.

For a failed display card in a virtual cloud environment, when the display card failure is detected, services existing on the failed display card need to be expelled, and newly created services are not distributed to the failed display card.

Referring to fig. 4, the evictor module of the present embodiment is used to evict a service that has been running on a failing graphics card.

The evictor module can delete the existing service on the fault display card by subscribing the change event of the config file, and synchronize the service change information to the data layer (the adjacency matrix). The deployment mode of the evictor module in the embodiment is single-instance deployment.

The single instance is started in the overall scope of the expeller module, the single instance shares a name space with the data processing module, and the change event of the config file in the name space is subscribed, and when the change event is received, the judgment is carried out: when the change event is judged to be a new event or a modification event, judging whether the number of the fault video cards of the new event or the modification event exceeds a set value or not, if so, traversing the adjacency matrix, deleting the services on all the video cards of the corresponding machine, and deleting the machine so that kubernets complete the subsequent service migration restarting operation of the fault video cards; if not, finding a service list corresponding to the fault display card related to the change event in the adjacency matrix and deleting the service list, so that the kubernets can complete the subsequent migration restart operation of the deleted service.

Preferably, after the current fault new event or display card state change event is judged, whether the number of the fault display cards on the GPU machine exceeds half is judged.

Specifically, the flow of the processing of the evictor module for the change event is as follows:

for a new event: and acquiring the number of the display cards of the event by a split method, judging whether the number of the display cards with faults of the machine exceeds a set value, if so, performing a machine offline process, and if not, expelling the service process.

For graphics card state change events: and acquiring a display card list of the event fault through a split method, acquiring the display card with the fault through a data processing module, combining two groups of data, judging whether the number of the fault display cards exceeds a set value, if so, performing a machine offline flow, and if not, performing an eviction service flow.

The machine offline process comprises the following steps: calling a method of deleting a node by the data processing module, deleting all services of the machine, removing the node from the adjacency matrix (further, kubernets complete the migration restart operation of the evicted service), and deleting the machine node from the kubernets (kubernets remove the machine from the cluster).

Further, the manager may also be notified by mail to perform machine rework.

The service flow of the eviction comprises the following steps: and calling a method for deleting the service by the data processing module by using the IP and the card number of the fault display card as parameters, and deleting the service operated on the fault display card. Further, kubernets will complete the migration restart operation of the deleted service.

Aiming at the scene that the display card in the virtual environment has a fault, the flow needing to be processed comprises a new service flow which does not use the fault display card any more. Referring to fig. 4, the scheduling module is configured to perform video card allocation on a newly created service, where the deployment mode is single-instance deployment.

Referring to fig. 7, the main flow of the scheduling module of this embodiment includes:

step S201, starting a single instance in a global scope, and sharing a name space with a data processing module;

step S202, starting display card distribution for the newly created service, comprising:

step S2021, acquiring an adjacent matrix, traversing the GPU machine, filtering a fault display card, judging whether the display card resources are sufficient, and selecting an available display card;

step S2022, in the binding phase, the corresponding machine and the graphics card are searched in the adjacency matrix, and the service list is updated.

The scheduling module and the data processing module share a namespace.

The display cards can be allocated by preselecting and sorting (sorting by scoring, preferably Top1) to select the finally available display cards. The preselection includes acquiring the adjacency matrix through a query method provided by the data processing module, traversing all GPU nodes, judging whether the display card resources (quantity and display memory) are sufficient, and judging whether the card has a fault (for the display card with the fault, the fault is filtered out in the step), thereby realizing the effect that new services cannot be dispatched to the fault display card.

In step S2022, the IP, the card number of the graphics card, and the service information are used as parameters to call the adding method provided by the data processing module, and the method traverses the adjacency matrix, finds the corresponding IP and graphics card, and adds the service to the service list of the graphics card.

The embodiment is explained in detail by using a virtual environment based on K8s, and the concept of the present invention is also applicable to a virtual cloud environment other than K8 s. However, the way in which the graphics card information is recorded in a file under a specific namespace, config, is a specific implementation in the K8s environment, where modifications are required. For other virtualization implementation modes, such as messes, targeted modification is required, such as implementing a custom scheduler module, recording display card information in a mode of external storage instead of monitoring files, and the like; for the data module and the evictor module in this embodiment, in other forms of virtualized environments, their functionality may be integrated into the dispatcher module, depending on the particular implementation.

The config file of this embodiment is only used to record the result of the video card failure detection and is used to share the carrier of the video card information change with other modules, such as dispatcher and evictor modules. The working state of the display card is provided by a display card driver, and for the K8s, the display card state information is registered in a specific namespace in a config file form in a most appropriate mode and can be conveniently perceived by other modules. The way of the adjacency matrix is also a data structure which is considered to be more appropriate for storing the video card and the service association relation working on the video card. Other possible forms, such as storage in a general kv store, by key-list, may be sensed by the entire cluster.

Referring to fig. 4, the apparatus for processing a graphics card failure in a virtual cloud environment provided in this embodiment simultaneously includes:

and the subsequent processing module is used for performing subsequent processing on the display card fault in the virtual cloud environment based on the synchronized display card resource information, and the data processing module is used for synchronously updating the display card resource information.

As an optional implementation manner of this embodiment, the virtual cloud environment is a kubernets-based virtual cloud environment,

the data storage module stores the cluster nodes, the display cards, the service topological structures and the display card state information in an adjacent matrix mode;

optionally, the processing apparatus further includes a probe module, configured to acquire display card failure information of each GPU machine, and create a configuration file in a namespace to record the display card failure information.

And circularly judging whether each display card has a fault on each GPU machine, and if the display card has the fault, creating/modifying/deleting file resources containing the node IP and the fault of the corresponding display card in the name space of the probe module.

Optionally, the subsequent processing module includes an evictor module, deletes an existing service on a failed display card by subscribing to a change event of the namespace characterization display card failure, and synchronizes service change information to the display card resource information;

optionally, the subsequent processing module includes a scheduling module, and when the newly created service starts the display card allocation, the adjacency matrix is acquired, the GPU machine is traversed, and the display card resources are acquired, where the display card resources have filtered out the failed display card; and judging whether the display card resources are sufficient, selecting an available display card to distribute the newly created service, searching the machine and the display card number corresponding to the display card distributed with the newly created service in the adjacency matrix, and updating the service list.

As an optional implementation manner of this embodiment, the data processing module receives a change event of a configuration file in a namespace by maintaining a long connection with an apiserver of the kubernets system; starting a single instance in the global scope of the data processing module, registering a name space and monitoring a change event of a display card fault under the name space; judging a change event, and performing data synchronization processing on the adjacent matrix according to the type of the change event;

the probe module starts daemon set, a corresponding judgment program is started on each GPU machine, each display card is judged in a circulating mode, a driving API interface is called to judge the state of the display card, if the display card is judged to be healthy, the circulation judgment step is returned, and if the display card is judged to be unhealthy, config file resources containing the node IP and the display card are created, modified or deleted in the name space of the probe module;

the method comprises the steps that a single instance is started in the overall range of an expeller module, a namespace is shared with a data processing module, a change event of a config file under the namespace is subscribed, when the change event is a new event or a modification event, whether the number of fault display cards of the new event or the modification event exceeds a set value or not is judged, if yes, an adjacent matrix is traversed, service lists corresponding to all the display cards of a corresponding machine are deleted, and if not, the service lists corresponding to the display cards are found in the adjacent matrix and deleted;

the scheduling module starts a single instance in the global scope, newly created service starts display card distribution, an adjacent matrix is obtained, a GPU machine is traversed, a fault display card is filtered, whether display card resources are sufficient is judged, an available display card is selected, in a binding stage, a corresponding machine and a corresponding display card are searched in the adjacent matrix, and a service list is updated.

An electronic device of this embodiment includes a processor and a memory, where the memory is used to store a computer executable program, and when the computer program is executed by the processor, the processor executes the processing method for processing a video card failure in any virtual cloud environment described in the foregoing embodiments.

A computer-readable medium of this embodiment stores a computer-executable program, and when the computer-executable program is executed, the method for processing a video card failure in a virtual cloud environment according to any one of the foregoing embodiments is implemented.

The above embodiments are only used for illustrating the invention and not for limiting the technical solutions described in the invention, and although the present invention has been described in detail in the present specification with reference to the above embodiments, the present invention is not limited to the above embodiments, and therefore, any modification or equivalent replacement of the present invention is made; all such modifications and variations are intended to be included herein within the scope of this disclosure and the appended claims.

Claims

1. A processing method for display card faults in a virtual cloud environment is characterized by comprising the following steps:

monitoring a change event representing the fault of the display card;

2. The method for processing the display card fault in the virtual cloud environment according to claim 1, wherein the display card resource information is stored by adopting an adjacent matrix; the synchronization processing is data synchronization processing for the adjacency matrix.

3. The method for processing the failure of the display card in the virtual cloud environment according to claim 1, wherein the virtual cloud environment is a kubernetes-based virtual cloud environment; the processing method further comprises the following steps: acquiring display card fault information, and creating a configuration file in a name space to record the display card fault information;

4. The method for processing the failure of the graphics card in the virtual cloud environment according to claim 3, wherein the obtaining the failure information of the graphics card and creating the configuration file in the namespace to record the failure information of the graphics card includes:

step S1001, starting a display card detection program on each GPU machine;

5. The method for processing the display card fault in the virtual cloud environment according to any one of claims 1 to 4, wherein the performing data synchronization processing on the stored display card resource information according to the change event includes one or more of the following steps:

6. The method for processing the display card fault in the virtual cloud environment according to any one of claims 1 to 4, wherein the subsequent processing of the display card fault in the virtual cloud environment includes:

and expelling the existing service on the fault display card.

7. The method for processing the video card failure in the virtual cloud environment according to claim 6, wherein the evicting the existing service on the failed video card comprises: in a virtual cloud environment of kubernets, deleting existing services on a fault display card when the fault information of the display card is acquired by subscribing the change event of the namespace characterization display card fault, and synchronizing the service change information to the display card resource information;

8. The method for processing the failure of the graphics card in the virtual cloud environment according to claim 6, wherein in the virtual cloud environment of kubernets and when the graphics card resource information is stored in the adjacency matrix, the failed graphics card is no longer used when the graphics card is allocated for the newly created service, comprising:

and judging whether the available display card resources are sufficient or not, selecting the available display card to distribute the newly created service, searching the corresponding GPU machine and the display card number of the display card which is distributed with the newly created service in the adjacent matrix, and updating the service list.

9. The utility model provides a processing apparatus of display card trouble in virtual cloud environment which characterized in that includes:

10. The apparatus for processing video card failure in virtual cloud environment according to claim 9, wherein the virtual cloud environment is a kubernets-based virtual cloud environment, and the apparatus further comprises: the system comprises a probe module, a display card fault information acquisition module and a configuration file generation module, wherein the probe module is used for acquiring the display card fault information of each GPU machine and creating the configuration file in a name space to record the display card fault information;