CN115509783A - Link failure processing method, system, electronic device and storage medium - Google Patents

Link failure processing method, system, electronic device and storage medium Download PDF

Info

Publication number
CN115509783A
CN115509783A CN202211165734.6A CN202211165734A CN115509783A CN 115509783 A CN115509783 A CN 115509783A CN 202211165734 A CN202211165734 A CN 202211165734A CN 115509783 A CN115509783 A CN 115509783A
Authority
CN
China
Prior art keywords
fault
port
cascade
cabinet
expansion cabinet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211165734.6A
Other languages
Chinese (zh)
Inventor
郑强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202211165734.6A priority Critical patent/CN115509783A/en
Publication of CN115509783A publication Critical patent/CN115509783A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Abstract

The application provides a link fault processing method, a system, an electronic device and a storage medium, comprising: establishing a storage cluster, wherein the storage cluster comprises a plurality of storage main cabinets, and each storage main cabinet comprises a plurality of controllers; acquiring topology information of each controller, judging whether a storage cluster has a fault or not based on the topology information, and reporting fault information; and automatically repairing the fault based on the fault information and a preset fault baseline library. The topology can be automatically identified and the fault can be detected, and the fault can be automatically repaired based on the fault baseline library after the fault occurs, so that the customer perception is reduced, and the reliability of the system is improved. In addition, the method disclosed by the application can automatically identify the number of the cascaded single-port expansion cabinets of the multi-port expansion cabinets, has strong expansion capability, and can adapt to the storage main cabinets for mounting different numbers of controllers and the corresponding types of the multi-port expansion cabinets, thereby enhancing the competitiveness.

Description

Link failure processing method, system, electronic device and storage medium
Technical Field
The present invention relates to the field of controller storage technologies, and in particular, to a method and a system for processing a link failure, an electronic device, and a storage medium.
Background
With the advent of the big data era, the storage demand is increasing for the expanding data volume, the bandwidth requirement for interaction is increasing, the command execution delay time requirement is decreasing, and the computational power requirement is increasing, so that a multi-controller storage master cabinet is generated to process a large amount of data to face the storage pressure.
However, the original single-port expansion cabinet for mounting the expansion storage space cannot adapt to the upgraded multi-control storage main cabinet, and thus the original link fault reporting method for the single-port expansion cabinet cannot be used normally.
Therefore, a link failure processing method suitable for a multi-port expansion cabinet is needed to solve the above technical problems in the prior art.
Disclosure of Invention
In order to solve the defects of the prior art, a primary object of the present invention is to provide a link failure processing method, a system, an electronic device and a storage medium, so as to solve the above technical problems of the prior art.
In order to achieve the above object, a first aspect of the present invention provides a link failure processing method, including:
establishing a storage cluster, wherein the storage cluster comprises a plurality of storage main cabinets, and each storage main cabinet comprises a plurality of controllers;
acquiring topology information of each controller, judging whether a fault exists in the storage cluster or not based on the topology information, and reporting fault information;
and automatically repairing the fault based on the fault information and a preset fault baseline library.
In some embodiments, said establishing a storage cluster comprises:
the controllers are connected to the uplink port of the multi-port expansion cabinet through the downlink port based on cables;
the multi-port expansion cabinet is cascaded with N-stage single-port expansion cabinets, and N is greater than or equal to 0.
In some embodiments, said obtaining topology information for each said controller comprises:
determining the number of stages, the address of the cascade port and the superior address of the cascade expansion cabinet of the multi-port expansion cabinet connected with the current controller based on the address of the cascade port of the multi-port expansion cabinet and the preset tail port address so as to generate the topology information of the current controller;
the multi-port expansion cabinet and the cascade expansion cabinet are cascaded through a cascade port, and a cascade port corresponding to each cascade expansion cabinet forming the cascade is marked as a cascade belt.
In some embodiments, the determining the number of stages, the cascade port address and the upper level address of the cascade expansion cabinet of the cascade of multi-port expansion cabinets connected with the current controller comprises:
traversing all single-port expansion cabinets which can be identified by a current controller, and acquiring cascade port addresses and upper-level addresses of the single-port expansion cabinets;
if the upper-level address of the identifiable single-port expansion cabinet does not exist and is equal to the port address of the multi-port expansion cabinet, determining that the multi-port expansion cabinet is not cascaded and the value of N is 0;
if the upper-level address of the identifiable single-port expansion cabinet is equal to the port address of the multi-port expansion cabinet, adding 1 to the value N, determining that the single-port expansion cabinet is a cascade expansion cabinet and the number of stages is N, and updating the cascade port address of the single-port expansion cabinet which is currently determined as the cascade expansion cabinet to be the tail port address;
if the upper address of the identifiable single-port expansion cabinet is equal to the address of the tail port, adding 1 to the value of N and determining that the single-port expansion cabinet is a cascade expansion cabinet with the number of stages being N, and updating the cascade port address of the single-port expansion cabinet which is currently determined as the cascade expansion cabinet into a tail port address, and repeating the step until the tail port address is equal to a preset tail port address.
In some embodiments, the determining whether a failure exists in the storage cluster based on the topology information and reporting failure information includes:
traversing each cascading zone under the condition that the storage main cabinet and the controller are effective, if the number of cascading extension cabinets contained in the cascading zones is larger than the maximum cascading number, judging that a fault exists in the storage cluster, and reporting fault information;
if the corresponding series numbers of any cascaded expansion cabinet in one cascaded band are the same and the upper-level addresses are the same, judging that a fault exists in the storage cluster and reporting fault information;
if the SAS port numbers of the corresponding multi-port expansion cabinets of each cascade band corresponding to one controller are different and/or the number of the contained cascade expansion cabinets is different, judging that a fault exists in the storage cluster and reporting fault information;
if the indexes of the cascade bands of the corresponding cascade bands of the upper control and the lower control of the same cascade expansion cabinet are different, judging that a fault exists in the storage cluster and reporting fault information;
and if the plurality of main storage cabinets identify the same multi-port expansion cabinet, judging that a fault exists in the storage cluster and reporting fault information.
In some embodiments, the method further comprises cable fault detection:
detecting the working state of the cable, if the cable cannot transmit stable data, judging that the cable is abnormal and reporting fault information;
detecting the online state of the controller, and if the controller state is not the online state or the degradation state, judging that the cable is abnormal and reporting fault information;
and when the controller state is an online state or a degraded state, if the controller downlink port state is an offline state and/or the uplink port of the multi-port expansion cabinet is an offline state, judging that the cable is missing and reporting fault information.
In some embodiments, automatically repairing the fault based on the fault information and a preset fault baseline library includes:
the fault base line library prestores mapping tables of faults and fault repairing methods;
based on the fault contained in the fault information, inquiring a corresponding fault repairing method in the mapping table to realize automatic fault repairing;
if the fault information contains the fault to be repaired, which does not have the corresponding fault solution in the mapping table, a fault alarm is generated to prompt a worker to carry out manual fault repair;
and recording the fault to be repaired and the corresponding artificial fault to be repaired to the mapping table so as to automatically repair the fault to be repaired by using the mapping table.
In a second aspect, the present application provides a link failure handling system, the system comprising:
the system comprises a preparation module, a storage module and a storage module, wherein the preparation module is used for establishing a storage cluster, the storage cluster comprises a plurality of storage main cabinets, and each storage main cabinet comprises a plurality of controllers;
the identification module is used for acquiring the topology information of each controller;
the detection module is used for judging whether a fault exists in the storage cluster based on the topology information and reporting fault information;
and the repairing module is used for automatically repairing the fault based on the fault information and a preset fault baseline library.
In a third aspect, the present application provides an electronic device, comprising:
one or more processors;
and memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:
establishing a storage cluster, wherein the storage cluster comprises a plurality of storage main cabinets, and each storage main cabinet comprises a plurality of controllers;
acquiring topology information of each controller, judging whether a fault exists in the storage cluster or not based on the topology information, and reporting fault information;
and automatically repairing the fault based on the fault information and a preset fault baseline library.
In a fourth aspect, the present application further provides a computer-readable storage medium having a computer program stored thereon, the computer program causing a computer to perform the following operations:
establishing a storage cluster, wherein the storage cluster comprises a plurality of storage main cabinets, and each storage main cabinet comprises a plurality of controllers;
acquiring topology information of each controller, judging whether a fault exists in the storage cluster based on the topology information, and reporting fault information;
and automatically repairing the fault based on the fault information and a preset fault baseline library.
The beneficial effect that this application realized does:
the application provides a link fault processing method, which comprises the following steps: establishing a storage cluster, wherein the storage cluster comprises a plurality of storage main cabinets, and each storage main cabinet comprises a plurality of controllers; acquiring topology information of each controller, judging whether a fault exists in the storage cluster based on the topology information, and reporting fault information; and automatically repairing the fault based on the fault information and a preset fault baseline library. The topology can be automatically identified and the fault can be detected, and the fault can be automatically repaired based on the fault baseline library after the fault occurs, so that the customer perception is reduced, and the reliability of the system is improved. In addition, the method disclosed by the application can automatically identify the number of the cascaded single-port expansion cabinets of the multi-port expansion cabinets, has strong expansion capability, and can adapt to the storage main cabinets for mounting different numbers of controllers and the corresponding types of the multi-port expansion cabinets, thereby enhancing the competitiveness.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive efforts, wherein:
FIG. 1 is a physical topology diagram provided by an embodiment of the present application;
fig. 2 is a schematic diagram of a link failure processing method provided in an embodiment of the present application;
FIG. 3 is a diagram of a link failure handling system architecture provided by an embodiment of the present application;
fig. 4 is a structural diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
In order to make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be understood that throughout the description and claims of this application, unless the context clearly requires otherwise, the words "comprise", "comprising", and the like, are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".
It will be further understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present application, "a plurality" means two or more unless otherwise specified.
It should be noted that the terms "S1", "S2", etc. are used for descriptive purposes only, are not intended to refer specifically to an order or sequential meaning, nor are they intended to limit the present application, but are merely used for convenience in describing the method of the present application and are not to be construed as indicating the order of the steps. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present application.
Example one
The embodiment of the application provides a method for repairing a link fault in a storage cluster in a multi-controller storage cascade multi-port expansion cabinet scene, and specifically, a process for processing the fault by applying the method disclosed by the embodiment comprises the following steps:
s1, establishing a storage cluster.
Specifically, as shown in the physical topology diagram Of fig. 1, the storage main cabinet includes a plurality Of controllers, wherein each Of the downstream ports Of the controllers is connected to the upstream port Of the upper control and the downstream port Of the lower control Of the multi-port expansion cabinet (JBOD) through a Serial Attached SCSI (SAS) cable. In addition, the multi-port expansion cabinet is also cascaded with the single-port expansion cabinet so as to further expand the storage space of the main storage cabinet; the multi-port expansion cabinet can be cascaded with N-level expansion cabinets, and the N value is more than or equal to 0; and when N is equal to 0, the storage main cabinet is only hung with one multi-port expansion cabinet.
S2, identifying the expansion cabinets connected with each controller in the storage main cabinet, forming cascading and/or parallel topology information, and informing the topology information of the controllers to the storage cluster so as to enable the subsequent storage cluster to perform link topology validity detection.
Specifically, the topology information acquisition process includes the following steps:
after the storage cluster is established, the storage cluster informs each controller in the storage main cabinet to perform topology identification, including acquiring hardware information and alarm states of all the expansion cabinets, wherein the hardware information at least comprises port number addresses, cascade port addresses and superior addresses of all the expansion cabinets. And the storage cluster initializes the port of each expansion cabinet to the port number supported by the current platform according to the platform configuration and acquires the port address of each port number. The multi-port expansion cabinet comprises ports corresponding to the number of the controllers and a cascade port. At this time, if an alarm exists in the storage cluster, the fault of the expansion cabinet is indicated, and the alarm can be eliminated by replacing the expansion cabinet generating the alarm.
The current controller traverses all identifiable single-port expansion cabinets and acquires the cascade port address and the superior address of each identifiable single-port expansion cabinet. Judging whether a single-port expansion cabinet cascaded with the multi-port expansion cabinet exists in all single-port expansion cabinets identifiable by the current controller according to the cascade port address of the multi-port expansion cabinet obtained in the previous step; if the upper-level address of the single-port expansion cabinet is not consistent with the address of the cascade port of the multi-port expansion cabinet, it indicates that the multi-port expansion cabinet is not cascaded with other single-port expansion cabinets downwards, and at this time, N =0; if the upper address of the single-port expansion cabinet is consistent with the address of the cascade port of the multi-port expansion cabinet, the single-port expansion cabinet and the multi-port expansion cabinet are in cascade connection and belong to a cascade expansion cabinet, wherein N +1 is the stage number N of the single-port expansion cabinet is 1, and the cascade expansion cabinet is a first-stage cascade expansion cabinet; and updating the address of the cascade port of the first-level cascade expansion cabinet to the address of the tail port. Then, continuously judging whether a single-port expansion cabinet cascaded with the first-stage cascaded expansion cabinet exists according to the address of the cascade port of the first-stage cascaded expansion cabinet; similarly, if the upper-level address of the single-port expansion cabinet does not coincide with the cascade port address of the first-level cascade expansion cabinet, it indicates that the first-level cascade expansion cabinet does not cascade other single-port expansion cabinets downwards, and at this time, the value N is determined and does not change any more; if the upper address of the single-port expansion cabinet is consistent with the address of the cascade port of the first-stage cascade expansion cabinet, the single-port expansion cabinet is in cascade connection with the first-stage cascade port expansion cabinet and belongs to a cascade expansion cabinet, at the moment, N +1=2 is adopted, namely the stage number N of the single-port expansion cabinet is 2, the cascade expansion cabinet is a second-stage cascade expansion cabinet, and the address of the cascade port of the second-stage cascade expansion cabinet is updated to be the address of the tail port; and repeating the steps, and continuously judging whether more cascaded expansion cabinets exist or not until the updated tail port address is equal to the preset tail port address or the cascaded expansion cabinets do not exist downwards.
Aiming at the current controller, the series, the cascade port address and the higher level address of each cascade expansion cabinet which is connected with the current controller and is used for identifying the cascade of the multi-port expansion cabinet, and the port number, the port address, the cascade port address and the higher level address of the multi-port expansion cabinet are recorded, and the recorded information can form a part of topology information. It can be understood that the multi-port expansion cabinet and the cascaded expansion cabinet are cascaded through the cascade ports, the cascade port corresponding to each cascaded expansion cabinet forming the cascade is marked as a cascade band, generally, one expansion cabinet comprises an upper control and a lower control, when the cascade is performed, the upper control and the upper control are cascaded, and the lower control are cascaded, so that one expansion cabinet can correspond to two cascade bands.
And S3, after the storage cluster finishes acquiring the topology information, automatically detecting the link topology of all the expansion cabinets in the storage cluster (namely judging whether the link has a fault), reporting the fault information when the link topology is illegal (namely the link has a fault), wherein the fault information at least comprises a fault type, and repairing according to a preset fault database.
Specifically, whether the storage main cabinet and/or the controller are/is valid needs to be judged, if the storage main cabinet and/or the controller are/is invalid, a first fault exists in the storage cluster, fault information is reported, and link topology detection is performed under the condition that the storage main cabinet and the controller are both valid, wherein whether the storage main cabinet and the controller with different serial numbers are valid can be judged through whether the storage main cabinet and the controller can be identified by the storage cluster, and if the storage main cabinet and the controller can be identified by the storage cluster, the link topology detection is valid. And traversing each cascading zone, if the number of the cascading expansion cabinets contained in the cascading zone is larger than the maximum cascading number, judging that a second fault (namely, an overload fault of the cascading expansion cabinets) exists in the storage cluster, and reporting fault information. If the corresponding stages of any one of the cascade expansion cabinets in one cascade band are the same and the upper-level addresses are the same, namely, it is proved that two cascade expansion cabinets are simultaneously connected to one cascade expansion cabinet on the same cascade band, the storage cluster is judged to have a third fault and fault information is reported. If the SAS port numbers of the multi-port expansion cabinets corresponding to each cascading zone corresponding to one controller are different and/or the number of the contained cascading expansion cabinets is different, and the two cascading zones on the same controller are usually the same in the initiated SAS port number and the cascading stage number, judging that a fourth fault exists in the storage cluster and reporting fault information, wherein the SAS port number is used for representing the ordinal number of each port in the uplink port of the expansion cabinet and is contained in the hardware information of the expansion cabinet; if the indexes of the cascade bands corresponding to the upper control and the lower control of the same cascade expansion cabinet are different, judging that a fifth fault exists in the storage cluster and reporting fault information, wherein the indexes of the cascade bands are used for identifying the cascade bands, and the indexes of the cascade bands under the same controller are consistent under the common condition; if a plurality of the storage main cabinets identify the same multi-port expansion cabinet, judging that a sixth fault exists in the storage cluster and reporting fault information, because usually, although one storage main cabinet can be connected with a plurality of multi-port expansion cabinets, one multi-port expansion cabinet can only be connected with one storage main cabinet. The first fault, the second fault, the third fault and the like are preset alarm levels, and the embodiment of the application only exemplifies one setting situation, and the setting of the specific alarm level can be modified according to different scenes.
And according to the fault type in the reported fault information, realizing automatic fault repair based on the error repair method prestored in the fault base line library. In the fault baseline library, the repairing method of each fault is collected in advance through big data and stored in a script form, weights can be divided according to the levels of alarms, and the using times can be recorded when the repairing is performed with high priority so as to update the weights. If the fault base line library cannot be repaired according to the existing fault repairing method in the fault base line library, an alarm prompt can be generated to prompt workers to repair manually, and the manual fault repairing method can be synchronized into the fault base line library to continuously perfect the fault base line library. Preferably, the fault repairing method and the fault in the fault baseline library can be stored in the form of a mapping table so as to facilitate searching.
And S4, detecting the connection state of the SAS cable, reporting an alarm and realizing automatic fault repair according to the fault base line library if the cable is absent.
Specifically, in an EN (enable) fault polling stage, SAS cable anomaly detection is performed, and if a cable cannot transmit stable data, the cable anomaly is determined and fault information is reported; the online state of the controller can be detected, if the state of the controller is not the online state or the degradation state, that is, the controller does not exist or the communication is abnormal, the cable is judged to be abnormal at the moment, and fault information is reported; and when the state of the controller is an online state or a degraded state, if the state of the downlink port of the controller is an offline state and/or the state of the uplink port of the multi-port expansion cabinet is an offline state, judging that the cable is missing and reporting fault information. Similarly, automatic fault repair can be realized based on a fault repair method pre-stored in the fault baseline library, and manual repair is performed when automatic repair cannot be performed.
The method disclosed by the embodiment of the application can automatically identify the topology and detect the fault, and automatically repair the fault based on the fault baseline library after the fault occurs, so that the customer perception is reduced, and the reliability of the system is improved. In addition, the method disclosed by the application can automatically identify the number of the cascaded single-port expansion cabinets of the multi-port expansion cabinets, has strong expansion capability, and can adapt to the storage main cabinets for mounting different numbers of controllers and the corresponding types of the multi-port expansion cabinets, thereby enhancing the competitiveness.
Example two
Corresponding to the first embodiment, the present application further provides a link failure processing method, as shown in fig. 2, specifically as follows:
2100. establishing a storage cluster, wherein the storage cluster comprises a plurality of storage main cabinets, and each storage main cabinet comprises a plurality of controllers;
preferably, the establishing a storage cluster includes:
2110. the controllers are connected to the uplink port of the multi-port expansion cabinet through the downlink port based on cables;
2120. the multi-port expansion cabinet is cascaded with N-stage single-port expansion cabinets, and N is greater than or equal to 0.
2200. Acquiring topology information of each controller, judging whether a fault exists in the storage cluster based on the topology information, and reporting fault information;
preferably, the acquiring topology information of each controller includes:
2210. determining the number of stages, the address of the cascade port and the superior address of the cascade expansion cabinet of the multi-port expansion cabinet connected with the current controller based on the address of the cascade port of the multi-port expansion cabinet and the preset tail port address so as to generate the topology information of the current controller;
2220. the multi-port expansion cabinet and the cascade expansion cabinet are cascaded through a cascade port, and the cascade port corresponding to each cascade expansion cabinet which forms the cascade is marked as a cascade belt.
Preferably, the determining the number of stages, the address of the cascade interface, and the upper address of the cascade expansion cabinet cascade connected to the multi-interface expansion cabinet connected to the current controller includes:
2211. traversing all single-port expansion cabinets which can be identified by a current controller, and acquiring cascade port addresses and upper-level addresses of the single-port expansion cabinets;
2212. if the upper-level address of the identifiable single-port expansion cabinet is not equal to the port address of the multi-port expansion cabinet, determining that the multi-port expansion cabinet is not cascaded and the value N is 0;
2213. if the upper-level address of the identifiable single-port expansion cabinet is equal to the port address of the multi-port expansion cabinet, adding 1 to the value N, determining that the single-port expansion cabinet is a cascade expansion cabinet and the number of stages is N, and updating the cascade port address of the single-port expansion cabinet which is currently determined as the cascade expansion cabinet to be the tail port address;
2214. if the upper-level address of the identifiable single-port expansion cabinet is equal to the tail port address, adding 1 to the value N, determining that the single-port expansion cabinet is a cascade expansion cabinet and the number of stages is N, updating the cascade port address of the single-port expansion cabinet which is currently determined as the cascade expansion cabinet to the tail port address, and repeating the step until the tail port address is equal to the preset tail port address.
Preferably, the determining whether a fault exists in the storage cluster and reporting fault information based on the topology information includes:
2230. traversing each cascading zone under the condition that the storage main cabinet and the controller are effective, if the number of cascading extension cabinets contained in the cascading zones is larger than the maximum cascading number, judging that a fault exists in the storage cluster, and reporting fault information;
2240. if the corresponding series numbers of any cascaded expansion cabinet in one cascaded band are the same and the upper-level addresses are the same, judging that a fault exists in the storage cluster and reporting fault information;
2250. if the SAS port numbers of the corresponding multi-port expansion cabinets of each cascade zone corresponding to one controller are different and/or the number of the contained cascade expansion cabinets is different, judging that a fault exists in the storage cluster and reporting fault information;
2260. if the indexes of the cascade bands of the corresponding cascade bands of the upper control and the lower control of the same cascade expansion cabinet are different, judging that a fault exists in the storage cluster and reporting fault information;
2270. and if the plurality of main storage cabinets identify the same multi-port expansion cabinet, judging that a fault exists in the storage cluster and reporting fault information.
2300. And automatically repairing the fault based on the fault information and a preset fault baseline library.
Preferably, the automatically repairing the fault based on the fault information and a preset fault baseline library includes:
2310. the fault base line library prestores mapping tables of faults and fault repairing methods;
2320. based on the fault contained in the fault information, inquiring a corresponding fault repairing method in the mapping table to realize automatic fault repairing;
2330. if the fault information contains the fault to be repaired, which does not have the corresponding fault solution in the mapping table, a fault alarm is generated to prompt a worker to carry out manual fault repair;
2340. and recording the fault to be repaired and the corresponding artificial fault to be repaired to the mapping table so as to automatically repair the fault to be repaired by using the mapping table.
2400. Preferably, the method further comprises cable fault detection:
2410. detecting the working state of the cable, if the cable cannot transmit stable data, judging that the cable is abnormal and reporting fault information;
2420. detecting the online state of the controller, and if the controller state is not the online state or the degradation state, judging that the cable is abnormal and reporting fault information;
2430. and when the controller state is an online state or a degraded state, if the controller downlink port state is an offline state and/or the uplink port of the multi-port expansion cabinet is an offline state, judging that the cable is missing and reporting fault information.
EXAMPLE III
As shown in fig. 3, corresponding to the first embodiment and the second embodiment, an embodiment of the present application provides a link failure processing system, where the system includes:
a preparing module 310, configured to establish a storage cluster, where the storage cluster includes a plurality of storage master cabinets, and each storage master cabinet includes a plurality of controllers;
an identifying module 320, configured to obtain topology information of each controller;
a detecting module 330, configured to determine whether a fault exists in the storage cluster based on the topology information and report fault information;
and a repairing module 340, configured to automatically repair the fault based on the fault information and a preset fault baseline library.
In some embodiments, the preparation module 310 is configured to connect the plurality of controllers to an upstream port of a multi-port showcase via a downstream port based on cables; the preparation module 310 is further configured to cascade N-stage single-port expansion cabinets to the multi-port expansion cabinet, where N is greater than or equal to 0.
In some embodiments, the identification module 320 is further configured to determine, based on the cascade port address of the multi-port expansion cabinet and a preset tail port address, the number of stages, the cascade port address, and an upper level address of a cascade expansion cabinet of the multi-port expansion cabinet cascade connected to the current controller, so as to generate topology information of the current controller; the multi-port expansion cabinet and the cascade expansion cabinet are cascaded through a cascade port, and the cascade port corresponding to each cascade expansion cabinet which forms the cascade is marked as a cascade belt.
In some embodiments, the identifying module 320 is further configured to traverse all single-port expansion cabinets that can be identified by the current controller, and obtain a cascade port address and a higher-level address of the single-port expansion cabinet; if the upper-level address of the identifiable single-port expansion cabinet does not exist and is equal to the port address of the multi-port expansion cabinet, the identifying module 320 determines that the multi-port expansion cabinets are not cascaded and the value of N is 0; the identification module 320 is further configured to, when there is an upper address of the identifiable single-ported extension cabinet equal to a port address of the multi-ported extension cabinet, add 1 to the value N, determine that the single-ported extension cabinet is a cascaded extension cabinet and has N stages, and update a cascaded port address of the single-ported extension cabinet currently determined as the cascaded extension cabinet to a tail port address; the identification module 320 is further configured to determine that the recognizable single-port expansion cabinet has a higher-level address equal to the tail port address by adding 1 to the N value and determining that the single-port expansion cabinet is a cascade expansion cabinet and the number of stages is N, and update the cascade port address of the single-port expansion cabinet, which is currently determined as the cascade expansion cabinet, to the tail port address, and repeat the step until the tail port address is equal to the preset tail port address.
In some embodiments, the detection module 330 is further configured to traverse each of the cascaded bands under the condition that the storage master cabinet and the controller are valid, and if the number of cascaded extension cabinets included in the cascaded band is greater than the maximum number of cascaded extension cabinets, determine that a fault exists in the storage cluster and report fault information; the detection module 330 is further configured to determine that a fault exists in the storage cluster and report fault information when the number of stages corresponding to any one of the cascaded extension cabinets in one of the cascaded bands is the same and the addresses of the upper stages are the same; the detecting module 330 is further configured to, if SAS port numbers of the corresponding multi-port expansion cabinets of each cascade band corresponding to one controller are different and/or the number of the included cascade expansion cabinets is different, determine that a fault exists in the storage cluster, and report fault information; the detection module 330 is further configured to determine that a fault exists in the storage cluster and report fault information if the cascade band indexes of the cascade bands corresponding to the upper control and the lower control of the same cascade expansion cabinet are different; the detection module 330 is further configured to determine that a fault exists in the storage cluster and report fault information if the plurality of storage main cabinets identify the same multi-port extension cabinet.
In some embodiments, the detecting module 330 is further configured to detect a working state of the cable, and if the cable cannot transmit stable data, determine that the cable is abnormal and report fault information; the detection module 330 is further configured to detect an online state of the controller, and if the controller state is not an online state or a degraded state, determine that the cable is abnormal and report fault information; and when the controller state is an online state or a degraded state, if the controller downlink port state is an offline state and/or the uplink port of the multi-port expansion cabinet is an offline state, judging that the cable is missing and reporting fault information.
In some embodiments, the repair module 340 includes a fault baseline library, which pre-stores a mapping table of faults and fault repair methods; the repairing module 340 queries a corresponding fault repairing method in the mapping table based on the fault included in the fault information to realize automatic fault repairing; if the fault information includes a fault to be repaired, which does not have a corresponding fault solution in the mapping table, the repair module 340 generates a fault alarm to prompt a worker to perform manual fault repair; the repairing module 340 records the fault to be repaired and the corresponding artificial fault to be repaired to the mapping table, so as to automatically repair the fault to be repaired by using the mapping table in the following.
Example four
Corresponding to all the above embodiments, an embodiment of the present application provides an electronic device, including:
one or more processors; and memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of:
step A, establishing a storage cluster, wherein the storage cluster comprises a plurality of storage main cabinets, and each storage main cabinet comprises a plurality of controllers;
b, acquiring topology information of each controller, judging whether a fault exists in the storage cluster based on the topology information, and reporting fault information;
and step C, automatically repairing the fault based on the fault information and a preset fault baseline library.
Fig. 4 schematically shows an architecture of an electronic device, which may specifically include a processor 410, a video display adapter 411, a disk drive 412, an input/output interface 413, a network interface 414, and a memory 420. The processor 410, the video display adapter 411, the disk drive 412, the input/output interface 413, the network interface 414, and the memory 420 may be communicatively connected by a bus 430.
The processor 410 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solution provided by the present Application.
The Memory 420 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 420 may store an operating system 421 for controlling execution of the electronic device 400, a Basic Input Output System (BIOS) 422 for controlling low-level operation of the electronic device 400. In addition, a web browser 423, a data storage management system 424, an icon font processing system 424, and the like may also be stored. The icon font processing system 424 may be an application program that implements the operations of the foregoing steps in this embodiment of the application. In summary, when the technical solution provided in the present application is implemented by software or firmware, the relevant program code is stored in the memory 420 and called to be executed by the processor 410.
The input/output interface 413 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component within the device (not shown) or may be external to the device to provide corresponding functionality. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The network interface 414 is used to connect a communication module (not shown in the figure) to implement communication interaction between the present device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, bluetooth and the like).
Bus 430 includes a path that transfers information between the various components of the device, such as processor 410, video display adapter 411, disk drive 412, input/output interface 413, network interface 414, and memory 420.
In addition, the electronic device 400 may also obtain information of specific pickup conditions from a virtual resource object pickup condition information database for performing condition judgment, and the like.
It should be noted that although the above-mentioned devices only show the processor 410, the video display adapter 411, the disk drive 412, the input/output interface 413, the network interface 414, the memory 420, the bus 430 and so on, in a specific implementation, the device may also include other components necessary for normal execution. Furthermore, it will be understood by those skilled in the art that the apparatus described above may also include only the components necessary to implement the solution of the present application, and not necessarily all of the components shown in the figures.
EXAMPLE six
In response to all the above embodiments, an embodiment of the present application further provides a computer-readable storage medium, which stores a computer program, where the computer program causes a computer to operate as follows:
establishing a storage cluster, wherein the storage cluster comprises a plurality of storage main cabinets, and each storage main cabinet comprises a plurality of controllers;
acquiring topology information of each controller, judging whether a fault exists in the storage cluster based on the topology information, and reporting fault information;
and automatically repairing the fault based on the fault information and a preset fault baseline library.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a cloud server, or a network device) to execute the method according to the embodiments or some parts of the embodiments of the present application.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (10)

1. A method for handling link failure, the method comprising:
establishing a storage cluster, wherein the storage cluster comprises a plurality of storage main cabinets, and each storage main cabinet comprises a plurality of controllers;
acquiring topology information of each controller, judging whether a fault exists in the storage cluster based on the topology information, and reporting fault information;
and automatically repairing the fault based on the fault information and a preset fault baseline library.
2. The method of claim 1, wherein the establishing a storage cluster comprises:
the controllers are connected to the uplink port of the multi-port expansion cabinet through the downlink port based on cables;
the multi-port expansion cabinet is cascaded with N-stage single-port expansion cabinets, and N is greater than or equal to 0.
3. The method of claim 2, wherein the obtaining topology information of each of the controllers comprises:
determining the number of stages, the address of the cascade port and the superior address of the cascade expansion cabinet of the multi-port expansion cabinet connected with the current controller based on the address of the cascade port of the multi-port expansion cabinet and the preset tail port address so as to generate the topology information of the current controller;
the multi-port expansion cabinet and the cascade expansion cabinet are cascaded through a cascade port, and the cascade port corresponding to each cascade expansion cabinet which forms the cascade is marked as a cascade belt.
4. The method of claim 3, wherein the determining the number of stages, the cascade port address and the upper level address of the cascade of the multi-port expansion cabinets connected with the current controller comprises:
traversing all single-port expansion cabinets which can be identified by a current controller, and acquiring cascade port addresses and upper-level addresses of the single-port expansion cabinets;
if the upper-level address of the identifiable single-port expansion cabinet is not equal to the port address of the multi-port expansion cabinet, determining that the multi-port expansion cabinet is not cascaded and the value N is 0;
if the upper-level address of the identifiable single-port expansion cabinet is equal to the port address of the multi-port expansion cabinet, adding 1 to the value N, determining that the single-port expansion cabinet is a cascade expansion cabinet and the number of stages is N, and updating the cascade port address of the single-port expansion cabinet which is currently determined as the cascade expansion cabinet to be the tail port address;
if the upper-level address of the identifiable single-port expansion cabinet is equal to the tail port address, adding 1 to the value N, determining that the single-port expansion cabinet is a cascade expansion cabinet and the number of stages is N, updating the cascade port address of the single-port expansion cabinet which is currently determined as the cascade expansion cabinet to the tail port address, and repeating the step until the tail port address is equal to the preset tail port address.
5. The method of claim 4, wherein the determining whether a failure exists in the storage cluster based on the topology information and reporting failure information comprises:
traversing each cascading zone under the condition that the storage main cabinet and the controller are effective, if the number of cascading extension cabinets contained in the cascading zones is larger than the maximum cascading number, judging that a fault exists in the storage cluster, and reporting fault information;
if the corresponding series numbers of any cascaded expansion cabinet in one cascaded band are the same and the upper-level addresses are the same, judging that a fault exists in the storage cluster and reporting fault information;
if the SAS port numbers of the corresponding multi-port expansion cabinets of each cascade zone corresponding to one controller are different and/or the number of the contained cascade expansion cabinets is different, judging that a fault exists in the storage cluster and reporting fault information;
if the indexes of the cascade bands of the corresponding cascade bands of the upper control and the lower control of the same cascade expansion cabinet are different, judging that a fault exists in the storage cluster and reporting fault information;
and if the plurality of main storage cabinets identify the same multi-port expansion cabinet, judging that a fault exists in the storage cluster and reporting fault information.
6. The method of claim 2, further comprising cable fault detection:
detecting the working state of the cable, if the cable cannot transmit stable data, judging that the cable is abnormal and reporting fault information;
detecting the online state of the controller, and if the controller state is not the online state or the degradation state, judging that the cable is abnormal and reporting fault information;
and when the controller state is an online state or a degraded state, if the controller downlink port state is an offline state and/or the uplink port of the multi-port expansion cabinet is an offline state, judging that the cable is missing and reporting fault information.
7. The method of claim 1, wherein automatically repairing the fault based on the fault information and a preset fault baseline library comprises:
the fault base line library prestores mapping tables of faults and fault repairing methods;
based on the fault contained in the fault information, querying a corresponding fault repairing method in the mapping table to realize automatic fault repairing;
if the fault information contains the fault to be repaired, which does not have the corresponding fault solution in the mapping table, a fault alarm is generated to prompt a worker to carry out manual fault repair;
and recording the fault to be repaired and the corresponding artificial fault to be repaired to the mapping table so as to automatically repair the fault to be repaired by using the mapping table.
8. A link failure handling system, the system comprising:
the system comprises a preparation module, a storage module and a storage module, wherein the preparation module is used for establishing a storage cluster, the storage cluster comprises a plurality of storage main cabinets, and each storage main cabinet comprises a plurality of controllers;
the identification module is used for acquiring the topology information of each controller;
the detection module is used for judging whether a fault exists in the storage cluster or not based on the topology information and reporting fault information;
and the repairing module is used for automatically repairing the fault based on the fault information and a preset fault baseline library.
9. An electronic device, characterized in that the electronic device comprises:
one or more processors;
and memory associated with the one or more processors for storing program instructions which, when read and executed by the one or more processors, perform the method of any of claims 1-8.
10. A computer-readable storage medium, characterized in that it stores a computer program which causes a computer to execute the method of any one of claims 1-7.
CN202211165734.6A 2022-09-23 2022-09-23 Link failure processing method, system, electronic device and storage medium Pending CN115509783A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211165734.6A CN115509783A (en) 2022-09-23 2022-09-23 Link failure processing method, system, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211165734.6A CN115509783A (en) 2022-09-23 2022-09-23 Link failure processing method, system, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN115509783A true CN115509783A (en) 2022-12-23

Family

ID=84505174

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211165734.6A Pending CN115509783A (en) 2022-09-23 2022-09-23 Link failure processing method, system, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN115509783A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117155938A (en) * 2023-10-30 2023-12-01 北京腾达泰源科技有限公司 Cluster node fault reporting method, device, equipment and storage medium
CN117421185A (en) * 2023-12-18 2024-01-19 苏州元脑智能科技有限公司 Cascade topology structure detection method, system, device and medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117155938A (en) * 2023-10-30 2023-12-01 北京腾达泰源科技有限公司 Cluster node fault reporting method, device, equipment and storage medium
CN117155938B (en) * 2023-10-30 2024-01-12 北京腾达泰源科技有限公司 Cluster node fault reporting method, device, equipment and storage medium
CN117421185A (en) * 2023-12-18 2024-01-19 苏州元脑智能科技有限公司 Cascade topology structure detection method, system, device and medium
CN117421185B (en) * 2023-12-18 2024-03-19 苏州元脑智能科技有限公司 Cascade topology structure detection method, system, device and medium

Similar Documents

Publication Publication Date Title
CN115509783A (en) Link failure processing method, system, electronic device and storage medium
US20080065928A1 (en) Technique for supporting finding of location of cause of failure occurrence
CN111049705A (en) Method and device for monitoring distributed storage system
CN109308239B (en) Method and apparatus for outputting information
CN104036043A (en) High availability method of MYSQL and managing node
CN110677292A (en) Optical interface rate configuration method and device
JPWO2006117833A1 (en) Monitoring simulation apparatus, method and program thereof
CN113645085B (en) Method and device for detecting abnormality of intelligent network card, electronic equipment and storage medium
CN108243031B (en) Method and device for realizing dual-computer hot standby
CN111897697A (en) Server hardware fault repairing method and device
CN109639755B (en) Associated system server decoupling method, device, medium and electronic equipment
CN111159029B (en) Automated testing method, apparatus, electronic device and computer readable storage medium
EP2940540A1 (en) Power system monitoring and control system
CN112988439A (en) Server fault discovery method and device, electronic equipment and storage medium
CN111966520A (en) Database high-availability switching method, device and system
CN111124785A (en) Hard disk fault checking method, device, equipment and storage medium
CN113110970A (en) Method, device, equipment and medium for monitoring components in server working mode
CN109361192B (en) Terminal equipment, and constant value modification method and device
CN110704219B (en) Hardware fault reporting method and device and computer storage medium
CN113312197A (en) Method and apparatus for determining batch faults, computer storage medium and electronic device
CN113448786A (en) PCIe equipment testing method, device, equipment and readable storage medium
CN111090553B (en) Test system, test method and test device
CN114793132A (en) Optical module detection method and device, electronic equipment and storage medium
CN107783852B (en) Dump file generation method and terminal
CN115604135B (en) Service monitoring method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination