CN117349128A - Fault monitoring method, device and equipment of server cluster and storage medium - Google Patents

Fault monitoring method, device and equipment of server cluster and storage medium Download PDF

Info

Publication number
CN117349128A
CN117349128A CN202311654834.XA CN202311654834A CN117349128A CN 117349128 A CN117349128 A CN 117349128A CN 202311654834 A CN202311654834 A CN 202311654834A CN 117349128 A CN117349128 A CN 117349128A
Authority
CN
China
Prior art keywords
server
server cluster
fault
information
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311654834.XA
Other languages
Chinese (zh)
Other versions
CN117349128B (en
Inventor
陈栋
李春
魏兴华
李建辉
杨禹航
吴炎
臧冰凌
张文件
罗春
王显伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Woqu Technology Co ltd
Original Assignee
Hangzhou Woqu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Woqu Technology Co ltd filed Critical Hangzhou Woqu Technology Co ltd
Priority to CN202311654834.XA priority Critical patent/CN117349128B/en
Publication of CN117349128A publication Critical patent/CN117349128A/en
Application granted granted Critical
Publication of CN117349128B publication Critical patent/CN117349128B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention relates to the technical field of server monitoring, in particular to a fault monitoring method, device and equipment of a server cluster and a storage medium, wherein the method comprises the following steps: acquiring relation information of a server cluster; processing the relation information of the server clusters to generate a connection association diagram of the server clusters; acquiring current attribute information of a server cluster; generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster; the fault state of each server in the server cluster can be intuitively inquired, the probability of the whole fault can be determined according to a certain connection relation according to the fault state, and further the fault monitoring of the server cluster is realized.

Description

Fault monitoring method, device and equipment of server cluster and storage medium
Technical Field
The present invention relates to the field of server monitoring technologies, and in particular, to a method, an apparatus, a device, and a storage medium for monitoring a failure of a server cluster.
Background
The connection mode between the database all-in-one machines is network communication connection, and a logical service cluster, namely a database all-in-one machine cluster, is formed together; in the database all-in-one cluster, a certain server is in a crash state or other abnormal states due to different factors, so that data abnormality can be caused, and therefore, faults of the server need to be monitored.
Disclosure of Invention
Aiming at the technical problems, the invention provides a fault monitoring method of a server cluster, which comprises the following steps:
and acquiring the relation information of the server cluster.
And processing the relation information of the server clusters to generate a connection association diagram of the server clusters.
And acquiring the current attribute information of the server cluster.
And generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster.
The invention also protects a fault monitoring device of the server cluster, which comprises:
the first acquisition module is used for acquiring the relation information of the server cluster.
The first generation module is used for processing the relation information of the server cluster and generating a connection association diagram of the server cluster.
And the second acquisition module is used for acquiring the current attribute information of the server cluster.
And the second generation module is used for generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that the fault of the server cluster is monitored according to the fault state diagram of the server cluster.
The invention protects a computer device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the fault monitoring method of the server cluster when executing the computer program.
The present invention protects a computer readable storage medium storing a computer program which when executed by a processor implements the above-described failure monitoring method for a server cluster.
Compared with the prior art, the invention has obvious advantages and beneficial effects. By means of the technical scheme, the fault monitoring method, the fault monitoring device, the fault monitoring equipment and the storage medium of the server cluster can achieve quite technical progress and practicability, and have wide industrial utilization value, and the fault monitoring method, the fault monitoring device and the storage medium of the server cluster have at least the following advantages:
the invention discloses a fault monitoring method, device and equipment of a server cluster and a storage medium, wherein the method comprises the following steps: acquiring relation information of a server cluster; processing the relation information of the server clusters to generate a connection association diagram of the server clusters; acquiring current attribute information of a server cluster; generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster; the fault state of each server in the server cluster can be intuitively inquired, the probability of the whole fault can be determined according to a certain connection relation according to the fault state, and further the fault monitoring of the server cluster is realized.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention, as well as the features and advantages of the present invention, which are more fully understood, as it is now apparent from the following detailed description of the preferred embodiments, taken in conjunction with the accompanying drawings.
Drawings
Fig. 1 is a flowchart of a fault monitoring method for a server cluster according to a first embodiment of the present invention;
FIG. 2 is a flowchart of the step S2 provided in the first embodiment of the present invention;
FIG. 3 is a flowchart of step S4 according to a first embodiment of the present invention;
fig. 4 is a schematic structural diagram of a fault monitoring device for a server cluster according to a second embodiment of the present invention;
fig. 5 is a schematic structural diagram of a first generating module 2 according to a second embodiment of the present invention;
fig. 6 is a schematic structural diagram of a second generating module 4 according to a second embodiment of the present invention.
Detailed Description
In order to further describe the technical means and effects adopted by the present invention to achieve the preset purpose, the following detailed description refers to the specific implementation and effects of a recovery method of a seed-obtained server cluster according to the present invention with reference to the accompanying drawings and preferred embodiments.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and in the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
Example 1
As shown in fig. 1, a first embodiment provides a fault monitoring method for a server cluster, where the method includes:
s1, acquiring relation information of a server cluster, wherein the server cluster comprises a plurality of servers, for example, the servers are database integrated machines.
Specifically, the relationship information of the server cluster includes relationship information of a plurality of servers, where the relationship information of each server refers to a communication connection relationship between any server and other servers except the server.
S2, processing the relation information of the server clusters to generate a connection association diagram of the server clusters.
Specifically, the processing the relationship information of the server cluster, and generating a connection association graph of the server cluster further includes the following steps, as shown in fig. 2:
s21, determining a connection association server ID set of the server cluster according to the relation information of the server cluster;
in a specific embodiment, the method further determines a set of connection association server IDs for the server cluster by:
s211, obtaining a server ID list A= { A corresponding to the server cluster 1 ,A 2 ,……,A i ,……,A m },A i Refers to the ith server ID, i=1, 2 … … m, m is the number of server IDs corresponding to the server cluster.
Specifically, the server ID is a unique identity of the server.
S212, acquiring a connection gateway corresponding to A according to the relation information of the server cluster corresponding to A
The set of co-server IDs b= { B 1 ,B 2 ,……,B i ,……,B m },B i ={B i1 ,B i2 ,……,B ij ,……,B in(i) },B ij Refers to A i The corresponding j-th connection association server ID, j=1, 2 … … n (i), n (i) referring to a i And the corresponding number of the connection association server IDs, namely the connection association server ID set of the server cluster is B.
Further, the A i The corresponding connection association server ID refers to the connection associated with A i And the unique identity of the servers with the relation information exists between the corresponding servers.
Specifically, the connection association graph of the server cluster is a tree-structured association graph, wherein the connection association graph of the server cluster comprises a connection association root node and leaf nodes associated with s-layer connection, and the number of the leaf nodes associated with the connection of each layer is inconsistent, and the method further comprises the following steps of:
s221, a connection association server ID number list n= { n (1), n (2), … …, n (i), … …, n (m) } corresponding to a is acquired.
S222, determining the root node associated with the connection according to n.
In a specific embodiment, the root node of the connection association is A when only n (i) is the minimum number of associated server IDs in n i
In another specific embodiment, in step S222, the root node associated with the connection is further determined by:
s2221, according to n, acquires a first intermediate server ID set C= { C 1 ,C 2 ,……,C x ,……,C p },C x For the x-th first intermediate server ID, x=1, 2 … … p, p being the number of first intermediate server IDs.
Further, the first intermediate server ID refers to a server ID corresponding to the minimum value in n.
S2222, from BAcquiring a list z= { z (1), z (2), … …, z (x), … …, z (p) } of the number of the associated server IDs corresponding to C, wherein z (x) is C x Corresponding number of associated server IDs.
S2223, when any z (x) is the minimum number of associated server IDs in z, determining C x And associating a root node for the connection.
And the root node and the leaf node are found through the association relation, so that a connection association diagram of a reasonable tree structure is further constructed, the subsequent association with the fault is realized, the probability of the fault of the whole is ensured to be determined, and the effective monitoring of the fault of the server cluster is further realized.
S223, determining all leaf nodes D= { D according to the root nodes associated with the connection 1 ,D 2 ,……,D r ,……,D s },D r ={D r1 ,D r2 ,……,D ry ,……,D rq(r) },D ry For the y leaf node in the r layer, r=1, 2 … … s, y=1, 2 … … q (r), q (r) being the number of leaf nodes in the r layer; it can be understood that: d (D) ry Characterised by dividing D in A r-1 And any server ID which is not larger than a preset first server ID quantity threshold value and is outside the corresponding server ID list and the server ID corresponding to the root node associated with the connection.
Further, in step S223, D is also determined by the following steps 1y
S2231, obtaining a second intermediate server ID list U= { U corresponding to the root node associated with the connection 1 ,U 2 ,……,U g ,……,U v },U g G=1, 2 … … v for the g second intermediate server IDs corresponding to the root node associated with the connection, where v is the number of second intermediate server IDs corresponding to the root node associated with the connection.
Further, the second intermediate server ID is an associated server ID corresponding to the root node associated with the connection in B.
S2232, obtain each U g Corresponding number of associated server IDs and U g U with the number of corresponding associated server IDs not greater than a preset first server ID number threshold g As D 1y
Preferably, the first server ID number threshold may be determined by a person skilled in the art according to the level of the leaf node, which will not be described in detail herein.
When the leaf nodes are determined, the reasonable leaf nodes can be accurately determined based on the number of the associated server IDs, so that the reasonable probability of overall faults is improved, and further effective monitoring of faults of the server cluster is realized.
S3, obtaining the current attribute information of the server cluster.
Specifically, the current attribute information of the server cluster includes current attribute information of each server, where the current attribute information of each server includes: current hardware state information of the server, current network state information of the server, and current software state information of the server.
And S4, generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster.
In a specific embodiment, the generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that the monitoring of the fault of the server cluster according to the fault state diagram of the server cluster further includes the following steps, as shown in fig. 3:
s41, determining the current fault label vector corresponding to each server according to the current attribute information corresponding to each server in the server set.
S42, generating a fault state diagram of the server cluster according to the current fault label vector of each server and the connection association diagram of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster.
Specifically, the step S41 further includes the following steps:
s411, acquiring the current attribute information A corresponding to A 0 ={A 0 1 ,A 0 2 ,……,A 0 i ,……,A 0 m },A 0 i =(A 0 i1 ,A 0 i2 ,A 0 i3 ),A 0 i1 Refers to A i Current hardware state information of A 0 i2 Refers to A i Current network state information of A) i3 Refers to A i Is provided for the current software state information of the computer system.
S412, according to A 0 Obtaining A 0 Corresponding current failure tag vector set B 0 ={B 0 1 ,B 0 2 ,……,B 0 i ,……,B 0 m },B 0 i =(B 0 i1 ,B 0 i2 ,B 0 i3 ),B 0 i1 Is A 0 i1 Corresponding fault probability value, B 0 i2 Is A 0 i2 Corresponding fault probability value, B 0 i3 Is A 0 i3 A corresponding fault probability value; it can be understood that: will A 0 i1 、A 0 i2 And A 0 i3 Respectively inputting the data into a corresponding trained neural network model to obtain A by distribution 0 i1 、A 0 i2 And A 0 i3 A corresponding fault probability value; those skilled in the art are aware of the method for obtaining the probability value of occurrence of the fault by using the neural network model, and will not be described herein.
Specifically, the step S42 further includes the following steps:
s421, obtaining a connection association diagram of a server cluster;
s422, will B 0 i And recording the corresponding server ID on the node corresponding to each server ID in the connection association diagram of the server cluster, and generating a fault state diagram of the server cluster so as to monitor the fault of the server cluster according to the fault state diagram of the server cluster.
And the fault state diagram is constructed by combining the probability of fault occurrence on the basis of the connection relation diagram, so that the reasonable probability of overall fault occurrence is determined, and further, the effective monitoring of the faults of the server cluster is realized.
The fault monitoring method of the server cluster in this embodiment includes: acquiring relation information of a server cluster; processing the relation information of the server clusters to generate a connection association diagram of the server clusters; acquiring current attribute information of a server cluster; generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster; the fault state of each server in the server cluster can be intuitively inquired, the probability of the whole fault can be determined according to a certain connection relation according to the fault state, and further the fault monitoring of the server cluster is realized.
Example two
As shown in fig. 4, a first embodiment provides a fault monitoring device for a server cluster, where the device includes:
the first obtaining module 1 is configured to obtain relationship information of a server cluster, where the server cluster includes a plurality of servers, for example, the servers are database integrated machines.
Specifically, the relationship information of the server cluster includes relationship information of a plurality of servers, where the relationship information of each server refers to a communication connection relationship between any server and other servers except the server.
The first generating module 2 is configured to process the relationship information of the server cluster, and generate a connection association diagram of the server cluster.
Specifically, as shown in fig. 5, the first generating module 2 further includes:
a first determining module 21, configured to determine a connection association server ID set of the server cluster according to relationship information of the server cluster;
the first graph generating module 22 is configured to generate a connection association graph of the server cluster according to the connection association server ID set of the server cluster.
In a specific embodiment, the first determining module 21 includes:
a server ID list obtaining module 211, configured to obtain a server ID list a= { a corresponding to the server cluster 1 ,A 2 ,……,A i ,……,A m },A i Refers to the ith server ID, i=1, 2 … … m, m is the number of server IDs corresponding to the server cluster.
Specifically, the server ID is a unique identity of the server.
A connection association server ID set acquisition module 212, configured to, according to the server cluster corresponding to a
Acquiring a connection association server ID set B= { B corresponding to A 1 ,B 2 ,……,B i ,……,B m },B i ={B i1 ,B i2 ,……,B ij ,……,B in(i) },B ij Refers to A i The corresponding j-th connection association server ID, j=1, 2 … … n (i), n (i) referring to a i And the corresponding number of the connection association server IDs, namely the connection association server ID set of the server cluster is B.
Further, the A i The corresponding connection association server ID refers to the connection associated with A i And the unique identity of the servers with the relation information exists between the corresponding servers.
Specifically, the connection association diagram of the server cluster is a tree-structured association diagram, wherein,
the connection association graph of the server cluster includes a connection association root node and leaf nodes associated with s-layer connection, and the number of the leaf nodes associated with the connection of each layer is inconsistent, where the first graph generating module 22 includes:
the connection association server ID number list obtaining module 221 is configured to obtain a connection association server ID number list n= { n (1), n (2), … …, n (i), … …, n (m) } corresponding to a.
The root node determining module 222 is configured to determine, according to n, a root node associated with the connection.
In a specific embodiment, the root node of the connection association is A when only n (i) is the minimum number of associated server IDs in n i
In another specific embodiment, the root node determination module 222 includes:
a first intermediate server ID set acquisition module 2221, configured to acquire a first intermediate server ID set c= { C according to n 1 ,C 2 ,……,C x ,……,C p },C x For the x-th first intermediate server ID, x=1, 2 … … p, p being the number of first intermediate server IDs.
Further, the first intermediate server ID refers to a server ID corresponding to the minimum value in n.
A first execution module 2222, configured to obtain, from B, a list z= { z (1), z (2), … …, z (x), … …, z (p) }, z (x) corresponding to C, where z (x) is C x Corresponding number of associated server IDs.
A second execution module 2223 configured to determine C when any z (x) is the minimum number of associated server IDs in z x And associating a root node for the connection.
A leaf node determining module 223, configured to determine all leaf nodes d= { D according to the root nodes associated with the connection 1 ,D 2 ,……,D r ,……,D s },D r ={D r1 ,D r2 ,……,D ry ,……,D rq(r) },D ry For the y leaf node in the r layer, r=1, 2 … … s, y=1, 2 … … q (r), q (r) being the number of leaf nodes in the r layer; it can be understood that: d (D) ry Characterised by dividing D in A r-1 And any server ID which is not larger than a preset first server ID quantity threshold value and is outside the corresponding server ID list and the server ID corresponding to the root node associated with the connection.
Further, the leaf node determining module 223 includes:
a third execution module 2231, configured to obtain a second intermediate server ID list u= { U corresponding to the root node associated with the connection 1 ,U 2 ,……,U g ,……,U v },U g G=1, 2 … … v for the g second intermediate server IDs corresponding to the root node associated with the connection, where v is the number of second intermediate server IDs corresponding to the root node associated with the connection.
Further, the second intermediate server ID is an associated server ID corresponding to the root node associated with the connection in B.
A fourth execution module 2232 for obtaining each U g Corresponding number of associated server IDs and U g U with the number of corresponding associated server IDs not greater than a preset first server ID number threshold g As D 1y
Preferably, the first server ID number threshold may be determined by a person skilled in the art according to the level of the leaf node, which will not be described in detail herein.
And the second acquisition module 3 is used for acquiring the current attribute information of the server cluster.
Specifically, the attribute information of the server cluster includes current attribute information of each server, where the current attribute information of each server includes: current hardware state information of the server, current network state information of the server, and current software state information of the server.
And the second generating module 4 is configured to generate a failure state diagram of the server cluster according to the connection association diagram of the server cluster and current attribute information of the server cluster, so that failure of the server cluster is monitored according to the failure state diagram of the server cluster.
In a specific embodiment, as shown in fig. 6, the second generating module 4 further includes:
and the second determining module 41 is configured to determine a current fault label vector corresponding to each server according to the current attribute information corresponding to each server in the server set.
And the second graph generating module 42 is configured to generate a failure state graph of the server cluster according to the current failure label vector of each server and the connection association graph of the server cluster, so that failure of the server cluster is monitored according to the failure state graph of the server cluster.
Specifically, the second determining module 41 includes:
a current attribute information obtaining module 411 for obtaining current attribute information a corresponding to a 0 ={A 0 1 ,A 0 2 ,……,A 0 i ,……,A 0 m },A 0 i =(A 0 i1 ,A 0 i2 ,A 0 i3 ),A 0 i1 Refers to A i Current hardware state information of A 0 i2 Refers to A i Current network state information of A) i3 Refers to A i Is provided for the current software state information of the computer system.
The current failure tag vector set obtaining module 412 is configured to obtain, according to a 0 Obtaining A 0 Corresponding current failure tag vector set B 0 ={B 0 1 ,B 0 2 ,……,B 0 i ,……,B 0 m },B 0 i =(B 0 i1 ,B 0 i2 ,B 0 i3 ),B 0 i1 Is A 0 i1 Corresponding fault probability value, B 0 i2 Is A 0 i2 Corresponding fault probability value, B 0 i3 Is A 0 i3 A corresponding fault probability value; it can be understood that: will A 0 i1 、A 0 i2 And A 0 i3 Respectively inputting the data into a corresponding trained neural network model to obtain A by distribution 0 i1 、A 0 i2 And A 0 i3 A corresponding fault probability value; those skilled in the art are aware of the method for obtaining the probability value of occurrence of the fault by using the neural network model, and will not be described herein.
Specifically, the second graph generation module 42 includes:
a fifth execution module 421, configured to obtain a connection association diagram of the server cluster;
a sixth execution module 422 for executing B 0 i And recording the corresponding server ID on the node corresponding to each server ID in the connection association diagram of the server cluster, and generating a fault state diagram of the server cluster so as to monitor the fault of the server cluster according to the fault state diagram of the server cluster.
In one embodiment, a computer device is provided that includes a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
acquiring relation information of a server cluster;
processing the relation information of the server clusters to generate a connection association diagram of the server clusters;
acquiring current attribute information of a server cluster;
and generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring relation information of a server cluster;
processing the relation information of the server clusters to generate a connection association diagram of the server clusters;
acquiring current attribute information of a server cluster;
and generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above functional units and the division of the modules are illustrated, and in practical application, the above functions may be allocated to different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to complete all or part of the functions described above.
The present invention is not limited to the above-mentioned embodiments, but is not limited to the above-mentioned embodiments, and any person skilled in the art can make some changes or modifications to the equivalent embodiments without departing from the scope of the present invention, but all the simple modifications, equivalent changes and modifications according to the technical matter of the present invention fall within the scope of the technical solution of the present invention.

Claims (14)

1. A method for monitoring a failure of a server cluster, the method comprising:
acquiring relation information of a server cluster;
processing the relation information of the server clusters to generate a connection association diagram of the server clusters;
acquiring current attribute information of a server cluster;
and generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster.
2. The method for fault monitoring of a server cluster according to claim 1, wherein the server cluster comprises a number of servers.
3. The failure monitoring method of a server cluster according to claim 2, wherein the relationship information of the server cluster includes relationship information of a plurality of servers, wherein the relationship information of each server refers to a communication connection relationship between any one server and other servers except itself.
4. The method for monitoring a failure of a server cluster according to claim 3, wherein the processing the relationship information of the server cluster, and generating a connection association graph of the server cluster further comprises the steps of:
determining a connection association server ID set of the server cluster according to the relation information of the server cluster;
and generating a connection association diagram of the server cluster according to the connection association server ID set of the server cluster.
5. The method for fault monitoring of a server cluster as claimed in claim 2, wherein,
the current attribute information of the server cluster includes current attribute information of each server, wherein the current attribute information of each server includes: current hardware state information of the server, current network state information of the server, and current software state information of the server.
6. The method for monitoring a failure of a server cluster according to claim 5, wherein generating a failure state diagram of the server cluster according to the connection association diagram of the server cluster and current attribute information of the server cluster, so that the monitoring of the failure of the server cluster according to the failure state diagram of the server cluster further comprises the steps of:
determining a current fault label vector corresponding to each server according to the current attribute information corresponding to each server in the server cluster;
generating a fault state diagram of the server cluster according to the current fault label vector of each server and the connection association diagram of the server cluster, so that faults of the server cluster are monitored according to the fault state diagram of the server cluster.
7. A fault monitoring device for a server cluster, the device comprising:
the first acquisition module is used for acquiring the relation information of the server cluster;
the first generation module is used for processing the relation information of the server cluster and generating a connection association diagram of the server cluster;
the second acquisition module is used for acquiring the current attribute information of the server cluster;
and the second generation module is used for generating a fault state diagram of the server cluster according to the connection association diagram of the server cluster and the current attribute information of the server cluster, so that the fault of the server cluster is monitored according to the fault state diagram of the server cluster.
8. The fault monitoring device of a server cluster according to claim 7, wherein the server cluster comprises a number of servers.
9. The failure monitoring apparatus of claim 8, wherein the relationship information of the server cluster includes relationship information of a plurality of servers, wherein the relationship information of each server refers to a communication connection relationship between any one server and other servers except itself.
10. The fault monitoring device of a server cluster according to claim 9, wherein the first generating module comprises:
the first determining module is used for determining a connection association server ID set of the server cluster according to the relation information of the server cluster;
and the first graph generation module is used for generating a connection association graph of the server cluster according to the connection association server ID set of the server cluster.
11. The failure monitoring apparatus of the server cluster according to claim 8, wherein the current attribute information of the server cluster includes current attribute information of each server, wherein the current attribute information of each server includes: current hardware state information of the server, current network state information of the server, and current software state information of the server.
12. The fault monitoring device of a server cluster according to claim 11, wherein the second generating module comprises:
the second determining module is used for determining a current fault label vector corresponding to each server according to the current attribute information corresponding to each server in the server cluster;
and the second graph generating module is used for generating a fault state graph of the server cluster according to the current fault label vector of each server and the connection association graph of the server cluster, so that faults of the server cluster are monitored according to the fault state graph of the server cluster.
13. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements a fault monitoring method of a server cluster according to any of claims 1-6 when the computer program is executed.
14. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the method for fault monitoring of a server cluster according to any one of claims 1 to 6.
CN202311654834.XA 2023-12-05 2023-12-05 Fault monitoring method, device and equipment of server cluster and storage medium Active CN117349128B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311654834.XA CN117349128B (en) 2023-12-05 2023-12-05 Fault monitoring method, device and equipment of server cluster and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311654834.XA CN117349128B (en) 2023-12-05 2023-12-05 Fault monitoring method, device and equipment of server cluster and storage medium

Publications (2)

Publication Number Publication Date
CN117349128A true CN117349128A (en) 2024-01-05
CN117349128B CN117349128B (en) 2024-03-22

Family

ID=89365340

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311654834.XA Active CN117349128B (en) 2023-12-05 2023-12-05 Fault monitoring method, device and equipment of server cluster and storage medium

Country Status (1)

Country Link
CN (1) CN117349128B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100185894A1 (en) * 2009-01-20 2010-07-22 International Business Machines Corporation Software application cluster layout pattern
CN111752759A (en) * 2020-06-30 2020-10-09 重庆紫光华山智安科技有限公司 Kafka cluster fault recovery method, device, equipment and medium
CN111984498A (en) * 2020-07-24 2020-11-24 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Server cluster monitoring and management system
CN112202617A (en) * 2020-10-09 2021-01-08 腾讯科技(深圳)有限公司 Resource management system monitoring method and device, computer equipment and storage medium
CN113609139A (en) * 2021-09-30 2021-11-05 苏州浪潮智能科技有限公司 Monitoring data management method and device, electronic equipment and storage medium
CN114064438A (en) * 2021-11-24 2022-02-18 建信金融科技有限责任公司 Database fault processing method and device
CN115643163A (en) * 2022-11-03 2023-01-24 平安科技(深圳)有限公司 Fault equipment positioning method, device, equipment and storage medium
CN115643158A (en) * 2022-10-25 2023-01-24 平安银行股份有限公司 Equipment cluster repairing method, device, equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100185894A1 (en) * 2009-01-20 2010-07-22 International Business Machines Corporation Software application cluster layout pattern
CN111752759A (en) * 2020-06-30 2020-10-09 重庆紫光华山智安科技有限公司 Kafka cluster fault recovery method, device, equipment and medium
CN111984498A (en) * 2020-07-24 2020-11-24 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Server cluster monitoring and management system
CN112202617A (en) * 2020-10-09 2021-01-08 腾讯科技(深圳)有限公司 Resource management system monitoring method and device, computer equipment and storage medium
CN113609139A (en) * 2021-09-30 2021-11-05 苏州浪潮智能科技有限公司 Monitoring data management method and device, electronic equipment and storage medium
CN114064438A (en) * 2021-11-24 2022-02-18 建信金融科技有限责任公司 Database fault processing method and device
CN115643158A (en) * 2022-10-25 2023-01-24 平安银行股份有限公司 Equipment cluster repairing method, device, equipment and storage medium
CN115643163A (en) * 2022-11-03 2023-01-24 平安科技(深圳)有限公司 Fault equipment positioning method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱莉;: "云计算服务器集群的时间同步方法", 指挥信息系统与技术, no. 04, 28 August 2018 (2018-08-28) *

Also Published As

Publication number Publication date
CN117349128B (en) 2024-03-22

Similar Documents

Publication Publication Date Title
Borghesi et al. Anomaly detection using autoencoders in high performance computing systems
US11269718B1 (en) Root cause detection and corrective action diagnosis system
CN108920272B (en) Data processing method, device, computer equipment and storage medium
CN108509433A (en) The method, apparatus and electronic equipment of formation sequence number based on distributed system
CN110807064A (en) Data recovery device in RAC distributed database cluster system
CN112737800A (en) Service node fault positioning method, call chain generation method and server
CN113065912A (en) Method, apparatus, device and medium for monitoring orders with unsynchronized order states
CN113704018A (en) Application operation and maintenance data processing method and device, computer equipment and storage medium
CN117349128B (en) Fault monitoring method, device and equipment of server cluster and storage medium
CN110363381B (en) Information processing method and device
CN108924772B (en) Short message sending method and device, computer equipment and storage medium
CN111966461B (en) Virtual machine cluster node guarding method, device, equipment and storage medium
CN112465048A (en) Deep learning model training method, device, equipment and storage medium
CN114205214B (en) Power communication network fault identification method, device, equipment and storage medium
US10254333B2 (en) Method of generating quality affecting factor for semiconductor manufacturing process and generating system for the same
CN115544008A (en) Transaction order database and table processing method and system, electronic device and storage medium
CN110716101A (en) Power line fault positioning method and device, computer and storage medium
CN106789711B (en) Flow distribution method and device
CN112231142A (en) System backup recovery method and device, computer equipment and storage medium
CN112668730A (en) Self-service equipment module replacement method and device, computer equipment and storage medium
CN114281578B (en) Interaction method, device, computer equipment and medium of distributed file storage system
CN113010120B (en) Method for realizing distributed storage of voice data in round robin mode
Bagora et al. Data Labeling for Fault Detection in Cloud: A Test Suite-Based Active Learning Approach
CN115473793B (en) Automatic recovery method, device, terminal and medium for cluster EI host environment
CN114490157A (en) Fault detection method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant