CN112540880A

CN112540880A - Method and device for rapidly shielding fault display card in cluster and electronic equipment

Info

Publication number: CN112540880A
Application number: CN202011522554.XA
Authority: CN
Inventors: 程童; 张浩然; 吕亚霖; 王岩
Original assignee: Zuoyebang Education Technology Beijing Co Ltd
Current assignee: Zuoyebang Education Technology Beijing Co Ltd
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2021-03-23

Abstract

The invention belongs to the technical field of data information processing, and provides a method and a device for rapidly shielding a fault display card in a cluster, electronic equipment and a recording medium, wherein the method comprises the following steps: presetting a display card fault threshold value, and counting the faults of each display card in the cluster; and when the number of the faults exceeds a preset fault threshold value of the display card, carrying out shielding operation on the display card. According to the invention, the faults of each display card are counted, the fault display card is shielded in time, and the request sent to the fault display card is transferred to other display cards, so that the stability of the system is ensured, and the user experience and satisfaction are improved.

Description

Method and device for rapidly shielding fault display card in cluster and electronic equipment

Technical Field

The invention belongs to the technical field of data information processing, is particularly suitable for real-time data information processing in artificial intelligence, and more particularly relates to a method and a device for rapidly shielding a fault display card in a cluster and electronic equipment.

Background

With the development of internet technology and artificial intelligence technology, more and more artificial intelligence is used to process data information in online services.

At present, the amount of data information to be processed by artificial intelligence is huge, and due to the difference of architecture design, a display card is far higher than a central processing unit in the aspect of floating point arithmetic capability, so that the data is processed by the display card generally. With the expansion of services, a large number of display cards are required to be added in the later period to support the artificial intelligent operation.

However, the video card often fails to process data due to poor stability of frequent high-load operation, and if the shielding is not timely, a large number of service requests fail to be processed, which brings very poor experience to users.

Disclosure of Invention

Technical problem to be solved

The invention aims to solve the problems that the service request processing fails and the user experience is poor due to the fact that the fault display card cannot be rapidly shielded in the existing online service using artificial intelligence.

(II) technical scheme

In order to solve the above technical problem, an aspect of the present invention provides a method for rapidly shielding a fault graphics card in a cluster, including:

presetting a display card fault threshold value, and counting the faults of each display card in the cluster;

and when the number of the faults exceeds a preset fault threshold value of the display card, carrying out shielding operation on the display card.

According to the preferred embodiment of the invention, the display cards are physical display cards, and the fault of each display card is counted by adopting interprocess communication.

According to a preferred embodiment of the present invention, the inter-communication is a shared memory communication.

According to the preferred embodiment of the present invention, the counting of the faults occurring in each graphics card in the cluster specifically includes:

if a message of requesting processing failure returned by the display card is received, adding 1 to the fault count of the display card;

and if the returned message is that the request processing is successful, the failure count of the display card is reduced by 1.

According to the preferred embodiment of the present invention, the shielding operation on the graphics card specifically includes shielding an address of the graphics card, and forwarding the service request sent to the graphics card to another graphics card.

According to the preferred embodiment of the invention, the shielding time length of the display card is set, if the shielding operation is executed on the display card within the shielding time length of the display card, and the service request is allowed to be sent to the display card when the time length exceeds the shielding time length of the display card.

The second aspect of the present invention provides an apparatus for rapidly shielding a fault graphics card in a cluster, including:

the fault counting module is used for presetting a display card fault threshold value and counting faults of each display card in the cluster;

and the display card shielding module is used for shielding the display card when the number of faults exceeds a preset display card fault threshold value.

A third aspect of the invention proposes an electronic device comprising a processor and a memory for storing a computer-executable program, which, when executed by the processor, performs the method.

The fourth aspect of the present invention also provides a computer-readable medium storing a computer-executable program, which when executed, implements the method.

(III) advantageous effects

According to the invention, the faults of each display card are counted, the fault display card is shielded in time, and the request sent to the fault display card is transferred to other display cards, so that the stability of the system is ensured, and the user experience and satisfaction are improved.

Drawings

FIG. 1 is a schematic diagram of a task processing system according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for rapidly shielding a failing graphics card in a cluster according to an embodiment of the present invention;

FIG. 3 is a flow diagram illustrating a client service request processing flow according to an embodiment of the invention;

FIG. 4 is a flow chart illustrating the client detecting whether a video card is available according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an apparatus for rapidly shielding a graphics card with a fault in a cluster according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an electronic device of one embodiment of the invention;

fig. 7 is a schematic diagram of a computer-readable recording medium of an embodiment of the present invention.

Detailed Description

In describing particular embodiments, specific details of structures, properties, effects, or other features are set forth in order to provide a thorough understanding of the embodiments by one skilled in the art. However, it is not excluded that a person skilled in the art may implement the invention in a specific case without the above-described structures, performances, effects or other features.

The flow chart in the drawings is only an exemplary flow demonstration, and does not represent that all the contents, operations and steps in the flow chart are necessarily included in the scheme of the invention, nor does it represent that the execution is necessarily performed in the order shown in the drawings. For example, some operations/steps in the flowcharts may be divided, some operations/steps may be combined or partially combined, and the like, and the execution order shown in the flowcharts may be changed according to actual situations without departing from the gist of the present invention.

The block diagrams in the figures generally represent functional entities and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different network and/or processing unit devices and/or microcontroller devices.

The same reference numerals denote the same or similar elements, components, or parts throughout the drawings, and thus, a repetitive description thereof may be omitted hereinafter. It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these elements, components, or sections should not be limited by these terms. That is, these phrases are used only to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention. Furthermore, the term "and/or", "and/or" is intended to include all combinations of any one or more of the listed items.

In recent years, as Artificial Intelligence (AI) and image processing have been developed more and more rapidly, demands for video card-based computing systems, which are required to be able to handle task requests of different sizes and to be able to efficiently process tasks in either a busy or non-busy state, have increased.

The display card is also called GPU, display core, visual processor, display chip, and is a microprocessor specially used for image operation on personal computers, workstations, game machines, and some mobile devices (e.g. tablet computers, smart phones, etc.). The display control circuit is used for converting and driving display information required by a computer system, providing a line scanning signal for a display and controlling the display of the display correctly, and is an important element for connecting the display and a personal computer mainboard. Because the display card is far higher than the CPU (central processing unit) in the capability of processing floating point operation, the scenes that a large amount of operations are needed in artificial intelligence and machine learning at present are operated through the display card, the operation efficiency is far higher than that of the CPU, and the display card gradually becomes indispensable equipment in the artificial intelligence.

Fig. 1 is a scene schematic diagram of a task processing system provided in an embodiment of the present application, as shown in fig. 1, the task processing system includes a plurality of servers 101, 102, …, and 1XX, each server is provided with a plurality of display cards, in this embodiment, each server is provided with 8 display cards, and in other embodiments, 2 display cards or 4 display cards may also be provided.

The system comprises a plurality of

clients

201, 202, …, 2XX, wherein the clients can be, but are not limited to, smart phones, tablet computers, notebook computers, desktop computers, smart speakers, smart watches, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited herein.

In the embodiment, the client sends a request to the video card in the server according to the task type, for example, the server 101 identifies the emotion of the user, the server 102 is used for calculating the resource limit of the user, and the like. The types of models contained in each server may be the same or different. The client sends a request to the display card, wherein the request comprises data needing to be operated, the display card calls the model and the parameters from the AI operation model, and the data contained in the request is input into the model to be operated. And after the operation is finished, the display card returns the operation result to the client.

The client side and the display cards in the server are not in one-to-one correspondence, and the client side and the display cards in the server are sent to the display cards in different servers according to different tasks each time. For example, the client 201 sends a request to the

display cards

1011 and 1018 in the server 101 and the display card 1022 in the server 102, and the client 203 sends a request to the

display cards

1022 and 1028 in the server 102 and the display card 1XX8 in the server 1 XX.

In a conventional task processing system, a server regularly performs health check on the performance and the state of a graphics card, and stops the operation of the graphics card if the performance or the state of the graphics card is found to be abnormal, however, because the health check period is relatively long, such as 30 seconds or 1 minute, the processing time of a task is often in the order of nanoseconds. When the display card fails and the time for checking the health status is not reached, a large number of requests sent to the display card fail to be processed, the client cannot obtain a correct processing result, and the experience brought to the user is very poor.

In order to solve the above technical problem, the present invention provides a method for rapidly shielding a fault graphics card in a cluster, where a flowchart of the method is shown in fig. 2, and the method includes:

s101, presetting a display card fault threshold, and counting faults of each display card in the cluster.

On the basis of the technical scheme, the display cards are physical display cards, and the fault of each display card is counted by adopting interprocess communication.

In this embodiment, the used display card is a physical display card, the cluster providing services for the user includes a plurality of servers, and each server is provided with a plurality of display cards. Each server is connected with a plurality of clients, the service requests sent by the clients are processed by using the internally arranged display card, and each service request corresponds to one process.

And allocating an address for each display card by using the IP address and the port number, sending a service request to the display card by the client through the address, returning a correct processing result if the display card normally operates, and returning an error number if the display card is abnormal. By adopting inter-process communication statistics, different processes in the same server can communicate with each other, and the state of the same display card can be returned to different processes for statistics, so that the state of the display card can be quickly judged.

In this embodiment, the graphics card failure threshold is set to 60, the graphics card failure threshold may be adjusted according to the traffic flow, the traffic flow may be large, the graphics card failure threshold may be increased, and if the traffic flow is reduced, the graphics card failure threshold may be decreased.

On the basis of the above technical solution, the inter-communication is a shared memory communication mode.

In this embodiment, the HashTable is selected as the data structure of the shared memory. The area of the shared memory is visible for all processes processed by the same server, and after the processes acquire information returned by the display card, the state of the display card is recorded in the area of the shared memory.

On the basis of the technical scheme, further, the step of counting the faults of each display card in the cluster specifically comprises the following steps:

In this embodiment, it is likely that the error code returned by the graphics card may not be the failure of the graphics card, but may be caused by the model operation error, so to ensure the utilization rate of the graphics card, when the graphics card returns the error code, that is, when the request processing fails, the error code is written into the shared memory, and the failure count of the graphics card is incremented by 1. If the display card returns a correct processing result, the display card requests the processing to be successful, at the moment, the shared memory is also written, and the fault count of the display card is reduced by 1. By adopting the mode, the display card fault is determined only when the display card frequently requests processing failure in a short time, namely the stability of the system is ensured, the user has no perception when using the display card, and the display card can be fully utilized for operation processing.

In the implementation mode, when the display card returns an error code, that is, the request processing fails, writing in the shared memory may be further adopted, and a mode of adding 2 to the failure count of the display card may be adopted, so that the failed display card may be quickly shielded.

And S102, when the number of the faults exceeds a preset fault threshold value of the display card, carrying out shielding operation on the display card.

In this embodiment, before the process of the client wants to send a request to the graphics card, the graphics card failure count of the graphics card is read, and it is determined whether the count exceeds a preset threshold, and if the count exceeds the preset threshold, the client does not send a request to the graphics card.

On the basis of the above technical solution, the shielding operation on the display card specifically includes shielding an address of the display card and forwarding a service request sent to the display card to other display cards.

In this embodiment, since the display card is positioned in the form of "IP address + port", when the failure count of the display card in the shared memory exceeds the preset display card failure threshold, the system does not allocate the address of the display card to the client, and sends the addresses of the other display cards whose failure counts do not exceed the preset display card failure threshold to the client. In other embodiments, load balancing processing is also performed on other display cards, the number of current processing requests of the display cards of which the other failure counts do not exceed the preset display card failure threshold is judged, and a service request is preferentially sent to the display card with the smallest processing request number.

Further, on the basis of the technical scheme, the shielding time length of the display card is set, and if the shielding operation is performed on the display card within the shielding time length of the display card, the service request is allowed to be sent to the display card when the time length exceeds the shielding time length of the display card.

In the embodiment, the shielding time length of the display card is set, and when the display card fault count of the display card reaches the set display card fault threshold, the display card is shielded and timing is started at the same time. And when the shielding time reaches the set shielding time length of the display card, trying to send a processing request to the display card again. And at the moment, the failure count of the display card is kept, if the service request is successfully processed, the failure count of the display card is reduced by 1 in the shared memory, and if the service request is failed, the display card is shielded again.

Because some faults are small recoverable faults, for example, the computing capability of the display card is reduced due to overheating, so that the service request processing fails, and the display card can be recovered to be normal after cooling, the time for setting the shielding time length is not too long, and the value range of the shielding time length is 30 seconds to 5 minutes, usually 1 minute.

In other embodiments, when the display card is still in the shielding period, a small-scale heuristic mode may be adopted to attempt to send a service request to the display card for probing to determine whether the display card has recovered. For example, at ordinary times, the display card can process 10 service requests simultaneously, and only 1 service request is sent to the display card at the same time for probing. And if the display card returns a message of successful service processing, the shielding of the display card is cancelled, and the failure count of the display card in the shared memory is reduced by 1.

The following describes a service request processing flow of a client according to an embodiment, and a flowchart is shown in fig. 3.

Example one

S201, the client has a service request to be processed.

S202, distributing the display cards to the client side according to load balance, and enabling the client side to obtain the addresses of the display cards.

When a service request is processed by the client, an instruction is sent to the system to indicate that the client has the service request to be processed, and the instruction content comprises the service type, the model to be used and the priority. And judging the operation time consumption of the service request and acquiring the service pressure of each display card in the current server by the system according to the instruction content. The client is assigned a graphics card according to load balancing, and the address of the graphics card is sent to the client in the form of an IP + port, e.g. 192.168.50.174: 22747.

When the display card is distributed to the client, the address of the display card with the display card fault count exceeding the set display card fault threshold is shielded by the system, and the address of the display card is not sent to the client.

S203, the client detects whether the display card is available, if the display card is available, the step S205 is executed, otherwise, the step S204 is executed.

And the client side inquires the corresponding shared memory according to the address of the display card to detect whether the display card is available.

S204, judging whether retry is needed, if yes, returning to the step S202, otherwise, executing the step S209.

When the client detects that the display card is unavailable, retry can be carried out according to the requirement of the service request, the system redistributes the display card, and if retry is not needed, the process is finished directly.

S205, the client sends a service request to the display card.

And the client sends a service request to the display card according to the display card address, wherein the service request comprises an operation parameter and a model (algorithm).

S206, the video card processes the service request, if the processing is successful, step S207 is executed, otherwise step S208 is executed.

And S207, the client prints logs, writes the result into the shared memory, and subtracts 1 from the failure count of the display card.

And S208, the client side prints logs, writes the result into the shared memory, and adds 1 to the failure count of the display card.

The display card processes the service request according to the content of the service request, if the processing is successful, a service processing result is returned, the client writes the result into the shared memory, the display card fault count of the display card is reduced by 1, otherwise, the display card returns an error code, and the client adds 1 to the display card fault count of the display card in the shared memory.

And S209, ending. And the client service request processing flow is contacted.

The following describes in detail the process of S203, according to a second embodiment, of detecting whether a video card is available by the client, where a flowchart is shown in fig. 4.

Example two

S2031, inquiring the shared memory according to the address of the display card.

The display cards are positioned in the form of 'IP address + port number', and a code of a part of shared memory HashTable is set for each display card in the server in the system to record the display card fault count of the display card. The corresponding shared memory can be inquired through the address of the display card.

S2032, judging whether the shared memory has records, if not, executing S2035, and if so, executing S2033.

And inquiring the corresponding shared memory according to the address of the display card, judging whether a record exists, and if the record does not indicate that the display card has no fault or is used for the first time, shielding is not needed, and the indication that the display card is available is provided.

S2033, if the record exists, judging whether the display card fault count in the record exceeds the set display card fault threshold, if so, executing S2034, otherwise, executing S2035.

S2034, judging whether the current time exceeds the masking expiration time, if so, executing S2035, otherwise, executing S2036.

The shielding expiration time is the time for canceling shielding of the display card, for example, 14:05 shielding operation is performed on the display card, and the set shielding time is 1 minute, and the shielding expiration time is 14: 06. Therefore, if the current time exceeds the mask expiration time, it indicates that the display card has not been masked, so step S2035 is performed to determine that the display card is available.

And S2035, the display card is available.

S2036, the display card is unavailable. Fig. 5 is a device 500 for rapidly shielding a display card with a fault in a cluster according to an embodiment of the present invention, including:

the failure counting module 501 is configured to preset a failure threshold of the display card, and count failures occurring in each display card in the cluster.

The display card shielding module 502 is configured to shield the display card when the number of the faults exceeds a preset display card fault threshold.

On the basis of the above technical solution, the shielding operation on the graphics card is specifically to shield the address of the graphics card and forward the request sent to the graphics card to other graphics cards.

Further, on the basis of the technical scheme, the shielding time length of the display card is set, and if the shielding operation is performed on the display card within the shielding time length of the display card, the service request is allowed to be sent to the display card when the time length exceeds the shielding time length of the display card. In the embodiment, the shielding time length of the display card is set, and when the display card fault count of the display card reaches the set display card fault threshold, the display card is shielded and timing is started at the same time. And when the shielding time reaches the set shielding time length of the display card, trying to send a processing request to the display card again.

And at the moment, the failure count of the display card is kept, if the service request is successfully processed, the failure count of the display card is reduced by 1 in the shared memory, and if the service request is failed, the display card is shielded again.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, which includes a processor and a memory, where the memory stores a computer-executable program, and when the computer program is executed by the processor, the processor executes a vehicle intelligent assistance pushing method based on rotation angle monitoring.

As shown in fig. 6, the electronic device is in the form of a general purpose computing device. The processor can be one or more and can work together. The invention also does not exclude that distributed processing is performed, i.e. the processors may be distributed over different physical devices. The electronic device of the present invention is not limited to a single entity, and may be a sum of a plurality of entity devices.

The memory stores a computer executable program, typically machine readable code. The computer readable program may be executed by the processor to enable an electronic device to perform the method of the invention, or at least some of the steps of the method.

The memory may include volatile memory, such as Random Access Memory (RAM) and/or cache memory, and may also be non-volatile memory, such as read-only memory (ROM).

Optionally, in this embodiment, the electronic device further includes an I/O interface, which is used for data exchange between the electronic device and an external device. The I/O interface may be a local bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, and/or a memory storage device using any of a variety of bus architectures.

It should be understood that the electronic device shown in fig. 7 is only one example of the present invention, and elements or components not shown in the above example may be further included in the electronic device of the present invention. For example, some electronic devices further include a display unit such as a display screen, and some electronic devices further include a human-computer interaction element such as a button, a keyboard, and the like. Electronic devices are considered to be covered by the present invention as long as the electronic devices are capable of executing a computer-readable program in a memory to implement the method of the present invention or at least a part of the steps of the method.

Fig. 6 is a schematic diagram of a computer-readable recording medium of an embodiment of the present invention. As shown in fig. 5, the computer-readable recording medium stores a computer-executable program, and when the computer-executable program is executed, the method for vehicle intelligent assistance push based on rotation angle monitoring according to the present invention is implemented. The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

From the above description of the embodiments, those skilled in the art will readily appreciate that the present invention can be implemented by hardware capable of executing a specific computer program, such as the system of the present invention, and electronic processing units, servers, clients, mobile phones, control units, processors, etc. included in the system, and the present invention can also be implemented by a vehicle including at least a part of the above system or components. The invention can also be implemented by computer software for performing the method of the invention, for example, by control software executed by a microprocessor, an electronic control unit, a client, a server, etc. of the locomotive side. It should be noted that the computer software for executing the method of the present invention is not limited to be executed by one or a specific hardware entity, but may also be implemented in a distributed manner by hardware entities without specific details, for example, some method steps executed by the computer program may be executed at the locomotive end, and another part may be executed in the mobile terminal or the smart helmet, etc. For computer software, the software product may be stored in a computer readable storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or may be distributed over a network, as long as it enables the electronic device to perform the method according to the present invention.

While the foregoing embodiments have described the objects, aspects and advantages of the present invention in further detail, it should be understood that the present invention is not inherently related to any particular computer, virtual machine or electronic device, and various general-purpose machines may be used to implement the present invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims

1. A method for rapidly shielding a fault display card in a cluster is characterized in that:

2. The method for rapidly shielding the fault display card in the cluster as claimed in claim 1, wherein the display card is a physical display card, and the fault of each display card is counted by adopting inter-process communication.

3. The method according to claim 2, wherein the inter-process communication is a shared memory communication.

4. The method for rapidly shielding the display cards with faults in the cluster as claimed in claim 3, wherein the step of counting the faults of each display card in the cluster specifically comprises the following steps:

5. The method as claimed in claim 4, wherein the shielding operation is performed on the graphics card specifically, the address of the graphics card is shielded, and the service request sent to the graphics card is forwarded to other graphics cards.

6. The method as claimed in claim 5, wherein the method for rapidly shielding the failed graphics card in the cluster is characterized in that the shielding time duration of the graphics card is set, and if the shielding operation is performed on the graphics card within the shielding time duration of the graphics card, the service request is allowed to be sent to the graphics card when the time duration exceeds the shielding time duration of the graphics card.

7. The utility model provides a device of trouble display card in quick shielding cluster which characterized in that:

the display card shielding module is used for shielding the display card when the number of faults exceeds a preset display card fault threshold value; optionally, the display cards are physical display cards, and inter-process communication is adopted to count the faults of each display card;

optionally, the inter-communication is a shared memory communication mode;

optionally, the counting the faults occurring in each graphics card in the cluster specifically includes:

8. The apparatus for rapidly shielding a failed graphics card in a cluster according to claim 7, wherein the shielding operation performed on the graphics card specifically is to shield an address of the graphics card and forward a service request sent to the graphics card to other graphics cards;

optionally, the shielding time length of the display card is set, if the shielding operation is performed on the display card within the shielding time length of the display card, and if the time length exceeds the shielding time length of the display card, the service request is allowed to be sent to the display card.

9. An electronic device comprising a processor and a memory, the memory for storing a computer-executable program, characterized in that:

the computer program, when executed by the processor, performs the method of any one of claims 1-8.

10. A computer-readable medium storing a computer-executable program, wherein the computer-executable program, when executed, implements the method of any of claims 1-8.