CN116643906A - Cloud platform fault processing method and device, electronic equipment and storage medium - Google Patents

Cloud platform fault processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116643906A
CN116643906A CN202310644972.3A CN202310644972A CN116643906A CN 116643906 A CN116643906 A CN 116643906A CN 202310644972 A CN202310644972 A CN 202310644972A CN 116643906 A CN116643906 A CN 116643906A
Authority
CN
China
Prior art keywords
fault
list
repair
cloud platform
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310644972.3A
Other languages
Chinese (zh)
Inventor
许涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Capitalonline Data Service Co ltd
Original Assignee
Capitalonline Data Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Capitalonline Data Service Co ltd filed Critical Capitalonline Data Service Co ltd
Priority to CN202310644972.3A priority Critical patent/CN116643906A/en
Publication of CN116643906A publication Critical patent/CN116643906A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0712Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a virtual computing platform, e.g. logically partitioned systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1479Generic software techniques for error detection or fault masking
    • G06F11/1482Generic software techniques for error detection or fault masking by means of middleware or OS functionality
    • G06F11/1484Generic software techniques for error detection or fault masking by means of middleware or OS functionality involving virtual machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application provides a cloud platform fault processing method, a cloud platform fault processing device, electronic equipment and a storage medium, wherein the method can comprise the following steps: determining an object list affected by the fault under the condition that the fault is detected by using the operation parameters of the storage clusters in the cloud platform, wherein the object comprises a virtual machine; repairing the objects in the object list by using a fault automatic repairing technology matched with faults; the automatic fault repair technique is pre-generated based on historical faults. Through the above process, the automatic detection and automatic repair of faults are realized.

Description

Cloud platform fault processing method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of cloud computing technologies, and in particular, to a method and an apparatus for processing a cloud platform fault, an electronic device, and a storage medium.
Background
With the rise of technologies such as cloud computing and the internet of things, data is continuously growing and accumulating at an unprecedented speed. The number of servers is continuously increased, and the operation and maintenance pressure of the cloud platform is proportionally increased. Service interruption caused by various software and hardware faults becomes one of important factors affecting the stability of the cloud platform. If the cloud platform is not stable in operation, inconvenience is caused to a plurality of users due to long fault recovery time.
Disclosure of Invention
The embodiment of the application provides a cloud platform fault processing method, a cloud platform fault processing device, electronic equipment and a storage medium, so as to realize automatic detection and automatic repair of faults.
In a first aspect, an embodiment of the present application provides a method for processing a cloud platform fault, where the method may include the following steps:
determining an object list affected by the fault under the condition that the fault is detected by using the operation parameters of the storage clusters in the cloud platform, wherein the object comprises a virtual machine;
repairing the objects in the object list by using a fault automatic repairing technology matched with faults; the automatic fault repair technique is pre-generated based on historical faults.
In a second aspect, an embodiment of the present application provides a device for processing a platform fault, where the device may include:
the object list determining module is used for determining an object list affected by faults when faults are detected by using operation parameters of a storage cluster in the cloud platform, wherein the objects comprise virtual machines;
the fault repairing module is used for repairing the fault by utilizing a fault automatic repairing technology matched with the fault under the condition that the object list meets the specified condition; the automatic fault repair technique is pre-generated based on historical faults.
In a third aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory, the processor implementing any one of the methods described above when the computer program is executed.
In a fourth aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored therein, which when executed by a processor, implements a method as in any of the above.
Compared with the prior art, the application has the following advantages:
according to the embodiment of the application, under the condition that the storage clusters in the cloud platform are detected to have faults, the quick response of the faults can be carried out so as to determine the influence range of the faults. And simultaneously, starting an automatic fault repairing technology, and automatically repairing the faults. Therefore, the related service of the cloud platform can be quickly recovered, and the influence duration on the user is reduced to the greatest extent.
The foregoing description is only an overview of the present application, and is intended to provide a better understanding of the technical means of the present application, as it is embodied in the present specification, and is intended to provide a better understanding of the above and other objects, features and advantages of the present application, as it is embodied in the following description.
Drawings
In the drawings, the same reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily drawn to scale. It is appreciated that these drawings depict only some embodiments according to the application and are not therefore to be considered limiting of its scope.
FIG. 1 is a flow chart of a method for processing a cloud platform fault provided by the application;
FIG. 2 is a schematic diagram of a method for handling a cloud platform failure according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a cloud platform failure handling apparatus according to an embodiment of the present application;
fig. 4 is a block diagram of an electronic device for implementing an embodiment of the application.
Detailed Description
Hereinafter, only certain exemplary embodiments are briefly described. As will be recognized by those skilled in the pertinent art, the described embodiments may be modified in numerous different ways without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
In order to facilitate understanding of the technical solutions of the embodiments of the present application, the following describes related technologies of the embodiments of the present application. The following related technologies may be optionally combined with the technical solutions of the embodiments of the present application, which all belong to the protection scope of the embodiments of the present application.
An embodiment of the present application provides a method for processing a cloud platform fault, as shown in fig. 1, which is a flowchart of a method for processing a cloud platform fault according to an embodiment of the present application, and may include:
step 101: determining an object list affected by the fault under the condition that the fault is detected by using the operation parameters of the storage clusters in the cloud platform, wherein the object comprises a virtual machine;
step 102: repairing the objects in the object list by using a fault automatic repairing technology matched with faults; the automatic fault repair technique is pre-generated based on historical faults.
The execution main body of the application can be equipment for executing the fault self-healing of the cloud platform in the cloud platform, and is hereinafter called as the cloud platform self-healing equipment for short. The cloud platform self-healing device can detect operation parameters of the storage clusters in the cloud platform in a periodic manner, so as to determine whether the storage clusters in the cloud platform fail. Or, the cloud platform self-healing device can detect the operation parameters of the storage clusters in the cloud platform in real time to determine whether the storage clusters in the cloud platform have faults.
For the determination of the failure, it can be made from the following dimensions. For example, parsing a work log of the storage cluster may be included to determine if a failure has occurred. For another example, network read-write parameters may be parsed to determine if a failure has occurred. For another example, disk state parameters may also be parsed to determine if a failure has occurred, etc. Therefore, whether faults occur can be judged through analysis of operation parameters in different dimensions.
The judging manner of the fault may include judging the detected operation parameters of the storage cluster based on a preset operation parameter permission range so as to determine whether the fault occurs, the severity of the fault and the like. Furthermore, different operating parameters may have corresponding fault types. For example, the disk state parameter may correspond to a storage failure, the network read-write parameter may correspond to a load balancing failure, etc.
In addition, the fault determination can be performed based on a pre-trained fault recognition model. And inputting the operation parameters of the storage clusters into the trained fault recognition model, and obtaining a conclusion of whether faults occur or not, which is output by the fault recognition model. The fault recognition model may be trained by using the positive operational parameter sample and the negative operational parameter sample as input data, and the labeling result (no fault result) corresponding to the positive operational parameter sample and the labeling result (faulty result) corresponding to the negative operational parameter sample as output data. For example, an operation parameter positive sample and an operation parameter negative sample are used as input data. Input data are input to a fault recognition model to be trained, and the fault recognition model to be trained outputs a prediction result. The prediction results may be presented in the form of percentages. For example, the probability of failure is a%, the probability of no failure is b%, wherein a++b% = 100%. And carrying out iterative optimization on parameters in the fault recognition model to be trained by utilizing the difference between the prediction result and the labeling result until the output result meets the condition. In addition, through the labeling of the samples, the fault identification model can also obtain different fault types and different severity degrees. The labeling of the samples can be labeling of negative samples of the operation parameters, wherein the negative samples of the operation parameters of the first type correspond to type A faults, the negative samples of the operation parameters of the second type correspond to type B faults, and the like. Thus, the fault identification model can be made to determine whether a fault has occurred, the type of fault occurring, and the severity of the fault.
In the event of a fault being detected, an automatic fault repair process may be initiated. For example, the range affected by the fault may be determined. Where a scope may correspond to a resource scope. Illustratively, the resources may include CPU resources, memory resources, storage resources, network resources, and the like. The virtual machines may act as consumers of resources or as demands for resources, with the storage cluster being bound to multiple virtual machines. A virtual machine belongs to a certain user, for example one virtual machine belongs to one user, or a plurality of virtual machines belong to one user. The relation between the virtual machine and the user is binding, and after the virtual machine affected by the fault is determined, the user affected by the fault, the product of the user and the like can be correspondingly determined. The user can operate the product through the virtual machine. In the event of a failure, the list of objects affected by the failure may be based, for example, on a virtual machine list affected by the failure, a customer list affected by the failure, a product list affected by the failure, and so on.
The object list meeting the specified condition may be that the number of virtual machines in the list is greater than a virtual machine number threshold, the users in the list are special users, the audience number of the product is greater than an audience number threshold, etc. Only the specified conditions are exemplified in the present application. And under the condition that the object list meets the specified condition, the fault automatic repair technology matched with the fault can be utilized. For example, the failure automatic repair technology may be an automatic mount repair disk technology of the virtual machine, an automatic repair failure disk technology, or the like. Wherein the automatic fault repair technique is pre-built, and the built principle can be generated based on historical faults. For example, historical faults may be handled in a manual or automated manner to create an automatic fault repair technique corresponding to the fault. When a fault occurs, the corresponding fault automatic repair technology can be directly called, and automatic and rapid fault repair can be realized.
Through the process, under the condition that the storage clusters in the cloud platform are detected to be faulty, quick response of the faults can be carried out to determine the influence range of the faults. And simultaneously, starting an automatic fault repairing technology, and automatically repairing the faults. Therefore, the related service of the cloud platform can be quickly recovered, and the influence duration on the user is reduced to the greatest extent.
In one embodiment, the determining the object list affected by the fault referred to in step S101 may include:
step S1011: determining an object related to the attribute of the fault by utilizing a pre-constructed knowledge graph; the attribute includes at least one of a type of fault and a severity of the fault;
step S1012: a list is constructed using the objects associated with the attributes of the fault.
In the event that a fault is detected, the nature of the fault may be determined first. The attribute of the fault may include at least one of a type of fault and a severity of the fault.
In addition, a knowledge graph may be constructed in advance. The knowledge graph may be a knowledge graph related to a Configuration management database (Configuration ManagementDatabase, CMDB). The configuration management database is an information base related to all components in the cloud platform. It contains detailed information of cloud platform infrastructure configuration items. Fault impact information may be contained in a knowledge graph associated with the configuration management database. For example, if a first type of fault occurs, at least one first object directly affected by the first type of fault may be included in the knowledge-graph. Next, at least one second object indirectly affected by the first type of fault is determined based on the correlation of the other objects with the first object. That is, the occurrence of the first type of failure may affect the first object, resulting in the first object not functioning properly. While failure of the first object to function properly may result in failure of the second object to function properly. It should be noted that, the occurrence of the first type of fault may affect the first object, which may be that the first object itself fails, so that the first object cannot work normally due to the failure. In addition, the occurrence of the first type of faults may affect the first object, which may be the effect of non-self faults, for example, the first object cannot read data normally due to unbalanced network load.
Based on the pre-constructed knowledge graph, the object affected by the fault can be determined according to at least one of the type of the fault and the severity of the fault. Thus, a list can be constructed based on the objects affected by the fault. Thus, on the one hand, the fault coverage can be quickly located on the basis of the list. On the other hand, alarm information can be generated based on the list, so that fault notification can be performed quickly.
In one embodiment, the construction of the list using the objects related to the attribute of the fault referred to in step S1012 may include:
step S10121: identifying the object related to the fault attribute, and reserving a target object;
step S10122: an object list is determined from the target object.
Authentication may include determining whether an object associated with the failed attribute is powered on, is in use, etc. There are multiple virtual machines to which the storage cluster is bound. Thus, when a fault occurs, there are a plurality of virtual machines related to the attribute of the fault, which are generally determined through the knowledge graph, and if all the virtual machines related to the attribute of the fault appear in the object list, the problem of inaccurate object list information may be caused. For example, the virtual machine is unaffected by the virtual machine that is not powered on and is not in use.
Based on this, a part of the virtual machines can be filtered out by discriminating the object related to the attribute of the failure, thereby making it possible to make the object list more accurate.
In one embodiment, the identifying the object related to the attribute of the fault, which is referred to in step S10121, and retaining the target object may include:
step S101211: traversing the console pages of the objects related to the fault attributes to determine the objects in a read-only state;
step S101212: an object in a read-only state is determined as a target object.
Using image recognition techniques, the console pages of the object associated with the failed attribute may be image-recognized. That is, the console pages of the virtual machine are image-identified to determine the virtual machine that is in a read-only state. The reason for identifying the read-only state is that in one mode, the host computer of the cloud platform and the host computer have busy I/O, and the I/O request of the host computer cannot respond timely, so that a disk I/O error is generated, and the disk data is partitioned into the read-only state for protection. Alternatively, the host is forced to shut down, resulting in a file system error failure of the disk, and a read-only state.
Based on this, the object in the read-only state can be determined as a true object related to the attribute of the failure. That is, the object in the read-only state is determined as the object that is truly affected by the failure. Thus, the object in the read-only state can be determined as the target object. Thereby generating an object list using the target object.
In one embodiment, before traversing the console page of the object related to the attribute of the fault, the method may further include:
step S1012123: controlling an object related to the attribute of the fault to execute a click event;
step S1012124: objects that do not respond to the click event are filtered out.
A filtering operation may also be performed before traversing the console pages of the object associated with the failed attribute. The purpose of the screening work is to identify objects that are not powered on or are not in an operational state. Based on this, the object related to the failed attribute can be controlled to perform a click event. The click event may be an operation action simulating a user, and may not be responded to if the object is in a power-off state or an inactive state. Conversely, if the object is working, a click event may be responded to. For example, a corresponding program or the like is executed based on the click event.
Therefore, before traversing the console page of the object related to the attribute of the fault, the object which does not respond to the click event can be filtered, so that the subsequent workload is reduced, and the judgment accuracy of the object related to the attribute of the fault can be improved.
In one embodiment, the method may further include:
step S103: counting the number of objects in the object list;
step S104: and generating a fault list under the condition that the number is not lower than the corresponding number threshold value.
The quantity threshold may be determined based on business, scenario, or historical experience. When the number of objects in the object list is not lower than the corresponding number threshold, the occurred fault may be considered as a platform level fault. If a platform level fault occurs, a fault manifest may be generated for the platform level alarm. In addition, under the condition of platform level faults, the fault list can be sent to the corresponding user based on the user and the product corresponding to the virtual machine so as to realize timely notification.
In one embodiment, the method may further include:
step S105: determining the repair progress of the fault;
step S106: and controlling the display end to display the repair progress of the fault.
The repair progress of the fault can be determined based on time or based on the completion of the fault repair. For example, the repair time of the current fault may be estimated based on the fault history, and the estimated time may be obtained to determine the repair progress of the fault in the form of the completion time or the remaining time. For another example, the progress of the repair of the fault may be determined in percent based on the completion of the repair of the fault.
After the repair progress is determined, the display end can be controlled to display the repair progress. The display end can be the display end of the fault platform side or the display end of the user side.
In one embodiment, the method may further include:
step S107: detecting the repair result, wherein the detection result comprises repair success or unrepaired success;
step S108: and generating a repair report according to the detection result.
After the fault repair is finished, whether the fault is successfully modified or not can be detected by using the operation parameters of the storage clusters in the cloud platform again. That is, the repair result is detected, and the detection result includes whether the repair is successful or not. If the repair is successful, a repair report may be generated. The repair report may include the content of the fault, the time of occurrence of the fault, the object affected by the fault, the time of repair of the fault, etc. If the repair is not successful, the operation and maintenance personnel can be informed to start the manual repair process.
Fig. 2 is a schematic architecture diagram of a method for handling a cloud platform failure according to the present application. May include a traffic layer, a logic layer, and a data layer. The data layer comprises a storage cluster module, a disk module, a network module and a knowledge graph (CMCD knowledge graph) related to the configuration management database.
The logic layer comprises an image recognition module, a rule engine module and a monitoring alarm module. The rules engine module still further includes a store failure identification sub-module and a locate impact range sub-module. The storage fault identification module mainly monitors the health state of the storage cluster in real time, for example, analyzes the storage cluster error log, analyzes network read-write data, reads multi-dimensional data information such as disk states and the like, and identifies the fault of the storage cluster. And under the condition that the fault is identified, the monitoring alarm module can alarm operation and maintenance personnel at the first time. Illustratively, identifying a fault may be identifying a Rebuild event, or identifying Faild data in a Rebuild event, or the like.
The positioning influence range submodule mainly pulls the affected virtual machine list, the client list, the product list and the like based on the knowledge graph related to the configuration management database.
The image recognition module performs abnormality recognition on a virtual machine console page (a dashboard corresponding to a service layer) by using an image recognition technology, including recognizing a virtual machine in a read-only state. If the number of virtual machines in the read-only state exceeds a number threshold, then a platform level failure may be located. And generating a fault list and giving an alarm to a user through a monitoring alarm module.
The self-repairing system of the service layer realizes the automatic mounting of the repairing disc of the virtual machine under the condition of no human intervention and automatically repairing the failed disc and automatically detecting whether the repairing is successful or not under the condition of failure.
Correspondingly to the application scene and the method of the method provided by the embodiment of the application, the embodiment of the application also provides a device for processing the fault of the cloud platform. Fig. 3 is a block diagram of a cloud platform fault handling device according to an embodiment of the present application, where the cloud platform fault handling device may include:
an object list determining module 301, configured to determine, when a failure is detected by using an operation parameter of a storage cluster in the cloud platform, an object list affected by the failure, where the object includes a virtual machine;
the fault repairing module 302 is configured to repair a fault by using an automatic fault repairing technology matched with the fault when the object list meets a specified condition; the automatic fault repair technique is pre-generated based on historical faults.
In one embodiment, the object list determining module 301 may include:
an object determination submodule for determining an object related to the attribute of the fault by utilizing a pre-constructed knowledge graph; the attribute includes at least one of a type of fault and a severity of the fault;
a list construction sub-module for constructing a list using the objects associated with the properties of the fault.
In one embodiment, the list construction sub-module may include:
the identifying unit is used for identifying the object related to the fault attribute and reserving the target object;
and the list construction unit is used for determining an object list according to the target object.
In one embodiment, the authentication unit may include:
a read-only state identification subunit, configured to traverse a console page of an object related to the attribute of the fault, and determine an object in a read-only state;
and the target object determining subunit is used for determining the object in the read-only state as a target object.
In one embodiment, the authentication unit may further include:
a click event execution control subunit for controlling the object related to the attribute of the fault to execute the click event;
and the filtering subunit is used for filtering the objects which do not respond to the clicking event.
In one embodiment, the method may further include:
the object number counting module is used for counting the number of objects in the object list;
and the fault list generation module is used for generating a fault list under the condition that the number is not lower than the corresponding number threshold value.
In one embodiment, the method may further include:
the repair progress determining module is used for determining the repair progress of the fault;
and the repair progress display control module is used for controlling the display end to display the repair progress of the fault.
In one embodiment, the method may further include:
the repair result detection module is used for detecting repair results, wherein the detection results comprise repair success or unrepaired success;
and the repair report generation module is used for generating a repair report according to the detection result.
The functions of each module in each device of the embodiment of the present application may be referred to the corresponding descriptions in the above methods, and have corresponding beneficial effects, which are not described herein.
Fig. 4 is a block diagram of an electronic device for implementing an embodiment of the application. As shown in fig. 4, the electronic device includes: memory 410 and processor 420, memory 410 stores a computer program executable on processor 420. The processor 420, when executing the computer program, implements the methods of the above-described embodiments. The number of memories 410 and processors 420 may be one or more.
The electronic device further includes:
and the communication interface 430 is used for communicating with external equipment and carrying out data interaction transmission.
If the memory 410, the processor 420, and the communication interface 430 are implemented independently, the memory 410, the processor 420, and the communication interface 430 may be connected to each other and communicate with each other through buses. The bus may be an industry standard architecture (IndustryStandardArchitecture, ISA) bus, an external device interconnect (PeripheralComponent Interconnect, PCI) bus, or an extended industry standard architecture (ExtendedIndustryStandardArchitecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 4, but not only one bus or one type of bus.
Alternatively, in a specific implementation, if the memory 410, the processor 420, and the communication interface 430 are integrated on a chip, the memory 410, the processor 420, and the communication interface 430 may communicate with each other through internal interfaces.
The embodiment of the application provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the method provided in the embodiment of the application.
The embodiment of the application also provides a chip, which comprises a processor and is used for calling the instructions stored in the memory from the memory and running the instructions stored in the memory, so that the communication equipment provided with the chip executes the method provided by the embodiment of the application.
The embodiment of the application also provides a chip, which comprises: the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the application embodiment.
It should be appreciated that the processor may be a central processing unit (CentralProcessingUnit, CPU), but may also be other general purpose processors, digital signal processors (DigitalSignalProcessor, DSP), application specific integrated circuits (ApplicationSpecificIntegratedCircuit, ASIC), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be a processor supporting an advanced reduced instruction set machine (AdvancedRISCMachines, ARM) architecture.
Further alternatively, the memory may include a read-only memory and a random access memory. The memory may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may include Read-only memory (ROM), programmable Read-only memory (ProgrammableROM, PROM), erasable programmable Read-only memory (ErasablePROM, EPROM), electrically erasable programmable Read-only memory (ElectricallyEPROM, EEPROM), or flash memory, among others. Volatile memory can include random access memory (RandomAccessMemory, RAM), which acts as external cache. By way of example, and not limitation, many forms of RAM are available. For example, static Random Access Memory (SRAM), dynamic random access memory (DynamicRandomAccessMemory, DRAM), synchronous dynamic random access memory (SynchronousDRAM, SDRAM), double data rate synchronous dynamic random access memory (DoubleDataRateSDRAM, DDRSDRAM), enhanced synchronous dynamic random access memory (EnhancedSDRAM, ESDRAM), synchronous link dynamic random access memory (SynclinkDRAM, SLDRAM), and direct memory bus random access memory (DirectRambusRAM, DRRAM).
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with the present application are fully or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. Computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
Any process or method described in flow charts or otherwise herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes additional implementations in which functions may be performed in a substantially simultaneous manner or in an opposite order from that shown or discussed, including in accordance with the functions that are involved.
Logic and/or steps described in the flowcharts or otherwise described herein, e.g., may be considered a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. All or part of the steps of the methods of the embodiments described above may be performed by a program that, when executed, comprises one or a combination of the steps of the method embodiments, instructs the associated hardware to perform the method.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules described above, if implemented in the form of software functional modules and sold or used as a stand-alone product, may also be stored in a computer-readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.
The above description is merely an exemplary embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of various changes or substitutions within the technical scope of the present application, and these should be covered in the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (11)

1. The cloud platform fault processing method is characterized by comprising the following steps of:
in the case that a fault is detected by using the operation parameters of a storage cluster in a cloud platform, determining an object list influenced by the fault, wherein the object comprises a virtual machine;
under the condition that the object list meets the specified condition, repairing the fault by utilizing a fault automatic repairing technology matched with the fault; the automatic fault repair technique is pre-generated based on historical faults.
2. The method of claim 1, wherein said determining a list of objects affected by said fault comprises:
determining an object related to the attribute of the fault by utilizing a pre-constructed knowledge graph; the attribute includes at least one of a type of fault and a severity of the fault;
the list is constructed using objects related to the properties of the fault.
3. The method of claim 2, wherein constructing the list using objects related to the attributes of the fault comprises:
identifying the object related to the attribute of the fault, and reserving a target object;
and determining the object list according to the target object.
4. A method according to claim 3, wherein said identifying an object associated with an attribute of said fault, retaining a target object, comprises:
traversing the console pages of the objects related to the fault attribute to determine the objects in a read-only state;
and determining the object in the read-only state as a target object.
5. The method of claim 4, wherein prior to traversing the console page of the object related to the failed attribute, further comprising:
controlling an object related to the attribute of the fault to execute a click event;
filtering out objects which do not respond to the click event.
6. The method as recited in claim 1, further comprising:
counting the number of objects in the object list;
and generating a fault list under the condition that the number is not lower than a corresponding number threshold value.
7. The method as recited in claim 1, further comprising:
determining the repair progress of the fault;
and controlling a display end to display the repair progress of the fault.
8. The method as recited in claim 1, further comprising:
detecting the repair result, wherein the detection result comprises repair success or unrepaired success;
and generating a repair report according to the detection result.
9. A cloud platform fault handling device, comprising:
the object list determining module is used for determining an object list affected by faults when the faults are detected by using operation parameters of a storage cluster in the cloud platform, wherein the objects comprise virtual machines;
the fault repairing module is used for repairing the fault by utilizing a fault automatic repairing technology matched with the fault under the condition that the object list meets the specified condition; the automatic fault repair technique is pre-generated based on historical faults.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory, the processor implementing the method of any one of claims 1-8 when the computer program is executed.
11. A computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-8.
CN202310644972.3A 2023-06-01 2023-06-01 Cloud platform fault processing method and device, electronic equipment and storage medium Pending CN116643906A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310644972.3A CN116643906A (en) 2023-06-01 2023-06-01 Cloud platform fault processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310644972.3A CN116643906A (en) 2023-06-01 2023-06-01 Cloud platform fault processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116643906A true CN116643906A (en) 2023-08-25

Family

ID=87624436

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310644972.3A Pending CN116643906A (en) 2023-06-01 2023-06-01 Cloud platform fault processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116643906A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117762464A (en) * 2023-12-29 2024-03-26 中睿信数字技术有限公司 Cloud computing-based software operation and maintenance system and method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103167004A (en) * 2011-12-15 2013-06-19 中国移动通信集团上海有限公司 Cloud platform host system fault correcting method and cloud platform front control server
US20140059379A1 (en) * 2012-08-24 2014-02-27 Vmware, Inc. Proactive resource reservation for protecting virtual machines
WO2018054081A1 (en) * 2016-09-22 2018-03-29 华为技术有限公司 Fault processing method, virtual infrastructure management system and service management system
CN108108255A (en) * 2016-11-25 2018-06-01 中兴通讯股份有限公司 The detection of virtual-machine fail and restoration methods and device
CN108206747A (en) * 2016-12-16 2018-06-26 中国移动通信集团山西有限公司 Method for generating alarm and system
CN110955550A (en) * 2019-11-24 2020-04-03 济南浪潮数据技术有限公司 Cloud platform fault positioning method, device, equipment and storage medium
CN113722134A (en) * 2021-07-29 2021-11-30 浪潮电子信息产业股份有限公司 Cluster fault processing method, device and equipment and readable storage medium
CN115292003A (en) * 2022-08-01 2022-11-04 中国电信股份有限公司 Server failure recovery method and device, electronic equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103167004A (en) * 2011-12-15 2013-06-19 中国移动通信集团上海有限公司 Cloud platform host system fault correcting method and cloud platform front control server
US20140059379A1 (en) * 2012-08-24 2014-02-27 Vmware, Inc. Proactive resource reservation for protecting virtual machines
WO2018054081A1 (en) * 2016-09-22 2018-03-29 华为技术有限公司 Fault processing method, virtual infrastructure management system and service management system
CN108108255A (en) * 2016-11-25 2018-06-01 中兴通讯股份有限公司 The detection of virtual-machine fail and restoration methods and device
CN108206747A (en) * 2016-12-16 2018-06-26 中国移动通信集团山西有限公司 Method for generating alarm and system
CN110955550A (en) * 2019-11-24 2020-04-03 济南浪潮数据技术有限公司 Cloud platform fault positioning method, device, equipment and storage medium
CN113722134A (en) * 2021-07-29 2021-11-30 浪潮电子信息产业股份有限公司 Cluster fault processing method, device and equipment and readable storage medium
CN115292003A (en) * 2022-08-01 2022-11-04 中国电信股份有限公司 Server failure recovery method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117762464A (en) * 2023-12-29 2024-03-26 中睿信数字技术有限公司 Cloud computing-based software operation and maintenance system and method

Similar Documents

Publication Publication Date Title
CN109032824B (en) Database verification method, database verification device, computer equipment and storage medium
CN108768728B (en) Operation and maintenance task processing method and device, computer equipment and storage medium
US8621278B2 (en) System and method for automated solution of functionality problems in computer systems
CN112002370B (en) Method and device for identifying disk abnormity and distributed storage system
US20200379837A1 (en) Automated recovery of webpage functionality
CN112380089A (en) Data center monitoring and early warning method and system
JP6282217B2 (en) Anti-malware system and anti-malware method
CN116643906A (en) Cloud platform fault processing method and device, electronic equipment and storage medium
CN110063042A (en) A kind of response method and its terminal of database failure
CN110363381B (en) Information processing method and device
CN113505044A (en) Database warning method, device, equipment and storage medium
KR102372958B1 (en) Method and device for monitoring application performance in multi-cloud environment
CN110674008B (en) Problem disk log information collection method, device, equipment and medium of SSD
CN111756594B (en) Control method of pressure test, computer device and computer readable storage medium
KR20210097592A (en) Systems and methods for centralization of server initialization information
CN108964992B (en) Node fault detection method and device and computer readable storage medium
CN110855489A (en) Fault processing method and device and fault processing device
US20230088318A1 (en) Remotely healing crashed processes
CN115186001A (en) Patch processing method and device
CN114500249A (en) Root cause positioning method and device
CN108959604B (en) Method, apparatus and computer readable storage medium for maintaining database cluster
CN104823406A (en) Identifying reports to address network issues
CN110362464B (en) Software analysis method and equipment
CN117971564B (en) Data recovery method, device, computer equipment and storage medium
CN114553726B (en) Network security operation and maintenance method and system based on functions and resource levels

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination