CN113162808A

CN113162808A - Storage link fault processing method and device, electronic equipment and storage medium

Info

Publication number: CN113162808A
Application number: CN202110478680.8A
Authority: CN
Inventors: 任岗; 吴晓晔; 周炜; 潘磊
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2021-07-23
Anticipated expiration: 2041-04-30
Also published as: CN113162808B

Abstract

The disclosure provides a storage link fault processing method and device, electronic equipment and a storage medium, and can be applied to the technical field of big data. The storage link fault processing method comprises the following steps: determining a storage link queue to be processed; acquiring fault positioning basic information of each to-be-processed storage link, wherein the fault positioning basic information comprises disk machine state information, multi-path software state information and network link state information of each to-be-processed storage link, and the network link state information comprises connection state information of the network link and port information corresponding to the network link; determining the fault occurrence position of each storage link to be processed according to the fault positioning basic information; and executing one or more times of remote recovery scripts at the fault occurrence position of each storage link to be processed, and acquiring the fault recovery result of the storage link to be processed.

Description

Storage link fault processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of big data technologies, and in particular, to a method and an apparatus for processing a failure of a storage link, an electronic device, and a storage medium computer program product.

Background

With the continuous development of information technology, the continuous operation of services needs to ensure the normal calling and storage of a large amount of precious data. Compared with the previous method for respectively constructing and storing each network management system, the centralized FC (Fiber Channel-based optical Fiber protocol) storage takes the disk array system as the core device of the centralized storage, and has the advantages of realizing the unified management of data, excellent expandability and ultrahigh access performance, effectively ensuring the storage safety of the data and realizing the centralized management and application of the data.

In the course of implementing the disclosed concept, the inventors found that there is at least the following problem in the related art with centralized FC storage, and currently, the processing and recovery of storage FC link failure mainly depend on manual operation.

Disclosure of Invention

In view of the above, the present disclosure provides a storage link failure processing method and apparatus, an electronic device, a storage medium, and a computer program product.

One aspect of the present disclosure provides a storage link failure processing method, including:

determining a pending storage link queue, wherein the pending storage link queue comprises at least one pending storage link;

acquiring fault positioning basic information of each to-be-processed storage link, wherein the fault positioning basic information comprises disk machine state information, multi-path software state information and network link state information of each to-be-processed storage link, and the network link state information comprises connection state information of the network link and port information corresponding to the network link;

determining the fault occurrence position of each storage link to be processed according to the fault positioning basic information;

and executing one or more times of remote recovery scripts at the fault occurrence position of each storage link to be processed, and acquiring the fault recovery result of the storage link to be processed, wherein the fault recovery result comprises recovered normal connection and unrecovered normal connection.

According to an embodiment of the present disclosure, the method further includes:

determining a self-healing storage link and a non-self-healing storage link according to a fault recovery result of a storage link to be processed; and under the condition that the storage link to be processed is the non-self-healing storage link, sending the fault positioning basic information of the non-self-healing storage link to the centralized monitoring system so as to manage the non-self-healing storage link through the centralized monitoring system.

determining the historical error reporting times of the self-healing storage link under the condition that the storage link to be processed is the self-healing storage link; and sending the fault positioning basic information of the target self-healing storage link with the historical error reporting times meeting the preset times condition to a centralized monitoring system so as to manage the target self-healing storage link through the centralized monitoring system.

determining a fault disk machine according to the disk machine state information of each storage link to be processed;

acquiring the association information of each fault disk unit, wherein the association information is used for representing the number of to-be-processed storage links associated with the fault disk unit;

determining a key concern disk unit according to the associated information of each fault disk unit;

and sending the state information of the disk machine with the important attention to the centralized monitoring system so as to manage the disk machine with the important attention through the centralized monitoring system.

According to an embodiment of the present disclosure, the determining the pending storage link queue includes: acquiring target error reporting information and server state information on each storage link, wherein the target error reporting information comprises disk machine storage state error reporting information on each storage link, network link connection state error reporting information and multi-path software running state error reporting information, and the server state information comprises first state information used for representing whether a server on each storage link is in a production state and second state information used for representing whether the server on each storage link normally works; and determining a storage link queue to be processed according to the target error reporting information and the server state information on each storage link.

According to an embodiment of the present disclosure, determining a pending storage link queue according to the target error reporting information and the server status information on each storage link includes:

determining a first storage link queue to be checked according to the target error reporting information, wherein the first storage link queue to be checked comprises at least one storage link to be checked;

determining the running state of each storage link to be inspected according to the second state information, wherein the running state comprises normal and abnormal;

filtering the storage link with normal operation state from the first storage link queue to be checked to obtain a second storage link queue to be checked;

determining a white list storage link according to the first state information, wherein the white list storage link is a storage link where a server which is not in a production state is located;

and filtering the white list storage link from the second storage link queue to be checked so as to determine a storage link queue to be processed.

According to an embodiment of the present disclosure, the obtaining the target error reporting information and the server state information on each storage link includes:

acquiring system log information of a server on each storage link; and screening out target error reporting information and server state information on each storage link from the system log information.

According to an embodiment of the present disclosure, the screening out target error information and server status information on each storage link from the system log information includes:

storing the system log information to a target database; and screening out target error reporting information and server state information on each storage link from the target database.

Another aspect of the present disclosure provides a storage link failure processing apparatus, including: the device comprises a first determining module, an obtaining module, a positioning module and a recovering module.

The first determining module is configured to determine a pending storage link queue, where the pending storage link queue includes at least one pending storage link.

The first obtaining module is used for obtaining fault location basic information of each to-be-processed storage link, wherein the fault location basic information comprises disk machine state information, multi-path software state information and network link state information of each to-be-processed storage link, and the network link state information comprises connection state information of the network link and port information corresponding to the network link.

And the positioning module is used for determining the fault occurrence position of each storage link to be processed according to the fault positioning basic information.

And the recovery module is used for executing one or more times of remote recovery scripts at the fault occurrence position of each storage link to be processed and acquiring the fault recovery result of the storage link to be processed, wherein the fault recovery result comprises recovered normal connection and unrecovered normal connection.

According to an embodiment of the present disclosure, the apparatus further comprises: the device comprises a second determining module and a first sending module. The second determining module is used for determining the self-healing storage link and the non-self-healing storage link according to the fault recovery result of the storage link to be processed. And the first sending module is used for sending the fault positioning basic information of the non-self-healing storage link to the centralized monitoring system under the condition that the storage link to be processed is the non-self-healing storage link, so that the non-self-healing storage link is managed through the centralized monitoring system.

According to the embodiment of the disclosure, the device further comprises a third determining module and a second sending module. The third determining module is configured to determine the historical error reporting times of the self-healing storage link when the storage link to be processed is the self-healing storage link. And the second sending module is used for sending the fault positioning basic information of the target self-healing storage link with the historical error reporting times meeting the preset times condition to the centralized monitoring system so as to manage the target self-healing storage link through the centralized monitoring system.

According to the embodiment of the disclosure, the device further comprises a fourth determining module, a second obtaining module, a fifth determining module and a third sending module. And the fourth determining module is used for determining a fault disk drive according to the disk drive state information of each to-be-processed storage link. And the second acquisition module is used for acquiring the associated information of each failed disk unit, wherein the associated information is used for representing the number of the to-be-processed storage links associated with the failed disk unit. And the fifth determining module is used for determining the disk unit with the important attention according to the associated information of each fault disk unit. And the third sending module is used for sending the state information of the disk machine of the important concern to the centralized monitoring system so as to manage the disk machine of the important concern through the centralized monitoring system.

According to an embodiment of the present disclosure, the first determining module includes an obtaining unit and a determining unit. The acquiring unit is used for acquiring target error reporting information and server state information on each storage link, wherein the target error reporting information comprises disk machine storage state error reporting information on each storage link, network link connection state error reporting information and multipath software running state error reporting information, and the server state information comprises first state information used for representing whether a server on each storage link is in a production state and second state information used for representing whether the server on each storage link normally works. The determining unit is used for determining the storage link queue to be processed according to the target error reporting information and the server state information on each storage link.

According to an embodiment of the present disclosure, the determining unit includes a first determining subunit, a second determining subunit, a first filtering subunit, a third determining subunit, and a second filtering subunit. The first determining subunit is configured to determine a first to-be-detected storage link queue according to the target error reporting information, where the first to-be-detected storage link queue includes at least one to-be-detected storage link. And the second determining subunit is used for determining the operation state of each storage link to be checked according to the second state information, wherein the operation state comprises normal and abnormal. And the first filtering subunit is used for filtering the storage link with the normal operation state from the first storage link queue to be detected so as to obtain a second storage link queue to be detected. And the third determining subunit is used for determining a white list storage link according to the first state information, wherein the white list storage link is a storage link where a server which is not in a production state is located. And the second filtering subunit is used for filtering the white list storage link from the second storage link queue to be checked so as to determine a storage link queue to be processed.

According to an embodiment of the present disclosure, the obtaining unit includes a obtaining subunit and a screening subunit. The acquiring subunit is configured to acquire system log information of the server on each storage link. And the screening subunit is used for screening out target error reporting information and server state information on each storage link from the system log information.

According to an embodiment of the present disclosure, the screening out target error information and server status information on each storage link from the system log information includes: storing the system log information to a target database; and screening out target error reporting information and server state information on each storage link from the target database.

Another aspect of the present disclosure provides an electronic device including: one or more processors, and memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as above.

Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for implementing the above method when executed.

Another aspect of the disclosure provides a computer program product comprising computer executable instructions for implementing the method as above when executed.

According to the embodiment of the disclosure, each link which may have a fault on the fault storage link can be checked by determining the storage link with the fault to be processed and further acquiring the fault location basic information of each storage link with the fault to be processed, the fault inspection range is relatively comprehensive, the problems of false report, false report and missing report of the fault which may exist in the related technology are at least partially solved, and the accuracy of fault processing is improved. In addition, according to the embodiment of the disclosure, the fault is automatically recovered by executing the remote recovery script once or multiple times, and under the condition that the primary recovery fails, the remote recovery script can be executed again after every preset time interval, and the recovery of the fault is attempted again, so that the problems of poor timeliness and fault tolerance and more labor cost occupied due to the fact that effective closed-loop management cannot be formed due to the fact that manual operation is relied on in the related art are solved, and the troubleshooting and repairing efficiency is improved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates an exemplary system architecture to which the storage link failure handling methods and apparatus of the present disclosure may be applied;

FIG. 2 schematically illustrates a flow chart of a storage link failure handling method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow diagram of a storage link failure handling method according to another embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow diagram of a storage link failure handling method according to another embodiment of the present disclosure;

FIG. 5 schematically shows a block diagram of a storage link failure handling apparatus according to an embodiment of the present disclosure; and

fig. 6 schematically shows a block diagram of an electronic device for implementing a storage link failure handling method according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In the process of implementing the present disclosure, it is found that the processing and recovery of the current storage FC link failure are mainly completed by human operations. Specifically, the following partial or all disadvantages exist in the related art: 1. when an alarm occurs, manual access is needed, and the timeliness is poor; 2. the situation of false alarm and false alarm exists; 3. the recovery depends on manual operation, and the fault tolerance is poor; 4. the alarm needs personnel to track, effective closed-loop management cannot be formed for subsequent recovery, and under the condition that the current platform is greatly expanded, the number of link alarms is increased, so that a large amount of manpower is required to be invested for inspection and maintenance of a storage backup group and a system on duty, and more labor cost is occupied.

Based on the above, the present disclosure provides a storage link failure processing method and apparatus, an electronic device, and a storage medium. The storage link fault processing method comprises the following steps: determining a storage link queue to be processed; acquiring fault positioning basic information of each to-be-processed storage link, wherein the fault positioning basic information comprises disk machine state information, multi-path software state information and network link state information of each to-be-processed storage link, and the network link state information comprises connection state information of the network link and port information corresponding to the network link; determining the fault occurrence position of each storage link to be processed according to the fault positioning basic information; and executing one or more times of remote recovery scripts at the fault occurrence position of each storage link to be processed, and acquiring the fault recovery result of the storage link to be processed.

Before the embodiments of the present disclosure are explained in detail, the system structure and the application scenario related to the method provided by the embodiments of the present disclosure are described as follows.

Fig. 1 schematically illustrates an exemplary system architecture 100 to which the storage link failure handling methods and apparatus of the present disclosure may be applied. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

The method provided by the embodiment of the disclosure can be based on a centralized FC storage system architecture, the centralized FC storage system architecture takes a disk array system as a core device for centralized storage, the storage safety of data can be effectively guaranteed, and the centralized management and application of the data are realized. As shown in fig. 1, the system architecture 100 according to this embodiment includes a server cluster 101, disk drives 102, storage links 103, database servers 104, and remote terminal devices 105.

Under the centralized FC storage system architecture, the storage link 103 is a link for providing communication between the server cluster 101 and the disk drives 102, the storage link 103 is based on a SAN link, and a plurality of servers in the server cluster 101 are connected to the disk drives 102 through a plurality of storage links 103 composed of fabric switches. The disk drives 102 may include a plurality of storage hard disks, and the hard disk capacity may be divided into a plurality of partitions according to the client requirement, and the partitions are respectively mapped to the required servers to provide a hard disk storage service for the server cluster 101.

The storage links 103 between the server cluster 101 and the disk drives 102 may adopt redundant links, that is, two different storage links 103 may be provided between each server and the disk drives 102, and when one of the links fails, data may be called and stored through the other link, thereby effectively ensuring continuous and stable operation of services.

In the system operation process, network faults need to be discovered and processed in time, and loss of stored data is avoided. In order to grasp the state of each storage link 103, the system log of each server may be imported into the database server 104, and the remote terminal device 105 accesses the database to obtain the system log and obtain the state information of each storage link 103 according to the system log, so as to find the fault in time, and perform centralized monitoring and management on the fault.

It should be noted that the storage link failure handling method provided by the embodiment of the present disclosure may be generally executed by the remote terminal device 105. Accordingly, the storage link failure processing apparatus provided by the embodiment of the present disclosure may be generally disposed in the remote terminal device 105. The storage link failure handling method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the remote terminal device 105 and is capable of communicating with the database server 104. Accordingly, the storage link failure processing apparatus provided in the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the remote terminal device 105 and capable of communicating with the database server 104.

It should be noted that the method and the device for processing the storage link failure in the present disclosure may be used in the field of big data technology, and may also be applied in any field except the field of big data.

Fig. 2 schematically shows a flow chart of a storage link failure handling method according to an embodiment of the present disclosure. As shown in fig. 2, the method includes operations S201 to S204.

In operation S201, a pending storage link queue is determined, where the pending storage link queue includes at least one pending storage link. According to the embodiment of the disclosure, the storage links are FC network links (based on Fiber Channel Fiber protocol) for performing communication between the servers and the disk drives, each storage link may include multiple Fiber switches, and because at least one storage link is provided between each server and the disk drives (to ensure reliability of data transmission, at least two redundant links are generally adopted), multiple storage links are provided between the server cluster and the disk drive cluster.

In order to troubleshoot and repair network failures during operation of the system, it is first necessary to determine the storage links where network failures may exist, so as to perform troubleshooting and repair later. According to the embodiment of the disclosure, the storage link to be processed is a storage link which may have a network fault, and a plurality of storage links which may have a network fault form a storage link queue to be processed, so as to perform sequential troubleshooting and processing one by one.

According to the embodiment of the disclosure, when determining a storage link with a possible network fault, system log information of a server on each storage link may be obtained first, and then the system log information is analyzed to screen out error reporting information related to storage and the link, for example, whether a disk machine has an error report, whether the network link is disconnected, whether a route is reversed, and the like. And then, determining a storage link to be processed according to the screened error reporting information, and establishing a storage link queue to be processed.

In operation S202, obtaining fault location basic information of each to-be-processed storage link, where the fault location basic information includes disk machine state information, multipath software state information, and network link state information of each to-be-processed storage link, where the network link state information includes connection state information of the network link and port information corresponding to the network link.

In operation S203, a fault occurrence location of each to-be-processed storage link is determined according to the fault location basic information.

Because of the storage links between each server and the disk drives, failures may occur in many situations, for example, the disk drives may fail, or multi-path software operations for implementing storage between the servers and the disk drives may fail, or network switch links may fail, or servers may fail, etc. In the above operation S202 and operation S203, on the basis that the storage link with the network fault is determined in the above operation S201, the location where the fault occurs needs to be further located, and the fault location basic information of each storage link to be processed, that is, the disk machine state information, the multi-path software state information, and the network link state information of each storage link to be processed may be obtained according to the fault reporting information related to the storage and the link, where the network link state information includes the connection state information of the network link and the port information corresponding to the network link, so as to further determine the location where the fault occurs, for example, if the disk machine has a fault, it may be determined that the fault occurs in the disk machine; and if the route is reversed, determining that the communication link of the network link has a fault.

In the fault location basic information, the state information of the disk unit includes a normal state and an abnormal state, and is used for determining whether a fault exists in a link where the disk unit and the disk unit are located; the multipath software state information is information related to the multipath software running state and is used for determining whether the multipath software runs possibly with faults or not; in the network link state information, the connection state information of the network link is used to determine which situation the network link failure belongs to, such as whether the link is disconnected or not, whether the network router is reversed or not, and the port information corresponding to the network link is used to locate which link the failure occurs on.

After the above operation determines the failure occurrence location on the storage link, in operation S204, one or more remote recovery scripts are executed at the failure occurrence location of each to-be-processed storage link, and a failure recovery result of the to-be-processed storage link is obtained, where the failure recovery result includes a recovered normal connection and an unrecovered normal connection. In this operation, the remote recovery script may be executed by the remote terminal device to automatically recover the failure, and in the case of a failure of the primary recovery, the remote recovery script may be executed again after every preset time interval to try to recover the failure again.

Fig. 3 schematically shows a flow chart of a storage link failure handling method according to another embodiment of the present disclosure. As shown in fig. 3, the method includes operations S301 to S308.

In this embodiment, operations S301-S304 may refer to operations S201-S204 described with reference to FIG. 2. For simplicity of description, the description of operations S301 to S304 is omitted here, and only operations S305 to S308 are explained. Wherein:

in operation S305, a self-healing storage link and a non-self-healing storage link are determined according to a failure recovery result of the storage link to be processed.

In operation S306, in a case that the to-be-processed storage link is the non-self-healing storage link, sending the fault location basic information of the non-self-healing storage link to the centralized monitoring system, so as to manage the non-self-healing storage link through the centralized monitoring system.

According to the embodiment of the disclosure, for various faults which may occur on the storage link, some faults may be automatically recovered by non-manual intervention, i.e., by executing a remote recovery script by a remote terminal device, such as a network router inversion situation. The scenario pertains to a self-healing scenario, and the link pertains to a self-healing storage link. In addition, some failures can not be automatically recovered by executing the remote recovery script, including some failures occurring in hardware, such as network cable disconnection, disk damage, failure of server hardware, or other situations that cannot be recovered by itself, which are non-self-healing situations and require human intervention, and which belong to non-self-healing storage links.

According to the embodiment of the disclosure, under the condition that a certain link fails to be recovered repeatedly for many times, the link can be determined as a non-self-healing storage link, meanwhile, the fault positioning basic information of the non-self-healing storage link is sent to the centralized monitoring system, and a network operation and maintenance worker manages the non-self-healing storage link through the centralized monitoring system so as to be convenient for manual intervention and fault recovery.

In operation S307, in a case that the to-be-processed storage link is a self-healing storage link, a historical error reporting number of the self-healing storage link is determined.

In operation S308, the fault location basic information of the target self-healing storage link, of which the historical error reporting times satisfy the preset time condition, is sent to the centralized monitoring system, so that the target self-healing storage link is managed by the centralized monitoring system.

According to the embodiment of the disclosure, for a self-healing storage link, if a certain link reports errors for multiple times, the link needs to be paid attention to, so that the fault location basic information of the link can be sent to a centralized monitoring system, so that network operation and maintenance personnel can comprehensively check the link through manual intervention to determine the frequent reason of the fault.

According to the embodiment of the disclosure, the fault location basic information of the non-self-healing storage link is sent to the centralized monitoring system, so that network operation and maintenance personnel can repair the fault of the non-self-healing storage link in time, and the fault processing efficiency is improved. The fault positioning basic information of the target self-healing storage link, the historical error reporting times of which meet the preset time condition, is sent to the centralized monitoring system, so that the fault multi-occurrence link can be focused and comprehensively checked, the frequent fault reasons can be determined, and the fault occurrence frequency can be reduced.

FIG. 4 schematically shows a flow diagram of a storage link failure handling method according to another embodiment of the present disclosure. As shown in fig. 4, the method includes operations S401 to S408.

In this embodiment, operations S401 to S404 may refer to operations S201 to S204 described with reference to FIG. 2. For simplicity of description, the description of operations S401 to S404 is omitted here, and only operations S405 to S408 are explained. Wherein:

in operation S405, a failed disk drive is determined according to the disk drive state information of each to-be-processed storage link. After the disk machine state information of each to-be-processed storage link is acquired in operation S402, whether a link where the disk machine is located has a fault may be determined according to whether the disk machine state is normal.

In operation S406, association information of each failed disk unit is obtained, where the association information is used to characterize the number of to-be-processed storage links associated with the failed disk unit.

In operation S407, it is determined that the disk drive is focused according to the association information of each failed disk drive. For example, if a plurality of storage links associated with a certain disk unit all have a failure, the disk unit needs to be focused, and there is a possibility that the disk unit has a hardware failure and needs to be focused by manual intervention.

In operation S408, the disk drive state information of the disk drive of the important concern is sent to the centralized monitoring system, so that the network operation and maintenance personnel can manage, investigate, and recover the disk drive of the important concern through manual intervention.

According to the embodiment of the disclosure, the disk machine with the important attention is determined by acquiring the associated information of each fault disk machine, and the fault disk machines with multiple faults and the links thereof can be subjected to important troubleshooting to determine the frequent reason of the fault and reduce the frequency of the fault.

According to an embodiment of the present disclosure, the specific operation of determining the pending storage link queue as shown in fig. 2 includes: acquiring target error reporting information and server state information on each storage link, wherein the target error reporting information comprises disk machine storage state error reporting information on each storage link, network link connection state error reporting information and multi-path software running state error reporting information, the server state information comprises first state information used for representing whether a server on each storage link is in a production state or not and second state information used for representing whether the server on each storage link normally works or not, and determining a storage link queue to be processed according to the target error reporting information and the server state information on each storage link.

According to the embodiment of the present disclosure, before performing fault repairing, a storage link in which a network fault may exist needs to be determined, and therefore, error information related to the storage link, that is, target error information in the above operation needs to be acquired first. The target error reporting information needs to relate to each link which may have a fault on the storage link, and therefore the target error reporting information includes error reporting information of each link which may have a fault on the storage link, that is, error reporting information of a storage state of a disk machine, error reporting information of a connection state of a network link, and error reporting information of an operation state of multi-path software on each storage link.

In addition, the state information of the server on each link needs to be acquired, because there may be some special situations, for example, a certain server is not in a production state, although the link where the server is located has a fault and reports the fault, the link does not belong to an object that needs to be concerned, and the link needs to be excluded, so that first state information used for representing whether the server on each storage link is in the production state needs to be acquired. For another example, a link may have a history error, but if the link is recovered to be normal, the link needs to be excluded, and whether the link is in a normal working state at present can be determined by obtaining the second state information of whether the server on each storage link is working normally.

After the target error reporting information and the server state information on each storage link are obtained, the storage link queue to be processed can be determined according to the information. According to an embodiment of the present disclosure, the determining the to-be-processed storage link queue according to the target error reporting information and the server state information on each storage link includes the following operations:

first, a first storage link queue to be checked is determined according to the target error reporting information, where the first storage link queue to be checked includes at least one storage link to be checked, that is, a link queue that may have a failure is preliminarily determined according to the target error reporting information, for example, if at least one error reporting information exists in a certain link, the link may be preliminarily determined to be a link that may have a failure.

And then determining the operation state of each storage link to be inspected according to the second state information, wherein the operation state comprises normal and abnormal.

And then filtering the storage link with normal operation state from the first storage link queue to be checked to obtain a second storage link queue to be checked, and eliminating the link which may have historical error report but is recovered to be normal at present.

And then, determining a white list storage link according to the first state information, wherein the white list storage link is a storage link where a server which is not in the production state is located.

And finally, filtering the white list storage link from the second storage link queue to be checked to determine a storage link queue to be processed, and eliminating the link which does not belong to the link needing the attention object although the link has a fault and reports the fault.

According to the embodiment of the disclosure, by removing the link which may have a history of error reporting but has been recovered to be normal at present and removing the link which does not belong to the object needing attention although the link has a fault, the situation of error reporting can be avoided, unnecessary fault recovery operation is reduced, and the efficiency of fault processing is improved.

According to an embodiment of the present disclosure, in the above operation, a specific method for acquiring the target error reporting information and the server state information on each storage link is as follows: the system log information of the server on each storage link can be firstly acquired, and then the target error reporting information and the server state information on each storage link can be screened out from the system log information. Specifically, the target error reporting information and the server status information on each storage link can be screened from the system log information by a method of searching for keywords (e.g., "I/O", "FC Port", "error", "warming", etc.).

According to the embodiment of the disclosure, the Syslog tool can be used for importing the system log information of each server into one target database server, so that the system log information is processed in a centralized manner through the database server, and quick screening and management of effective information are facilitated. The database may be Mongodb. The system logs can be obtained by accessing the target database server through the remote terminal device, and the state information of each storage link can be obtained according to the system logs, so that faults can be found in time, and centralized monitoring and management can be performed on the faults.

According to the embodiment of the disclosure, all fault information acquired in the fault processing process, such as fault location basic information of each to-be-processed storage link, target error reporting information, server state information on each storage link, and fault recovery results acquired after executing a remote recovery script, are stored in a target database server, and the information is used as basic data to ensure the execution of the storage link fault processing method.

Fig. 5 schematically illustrates a block diagram of a block diagram 500 of a storage link failure handling apparatus according to an embodiment of the present disclosure. The storage link failure handling apparatus 500 may be used to implement the method described with reference to fig. 2.

As shown in fig. 5, the load prediction apparatus 500 includes: a first determination module 501, an acquisition module 502, a location module 503, and a recovery module 504.

The first determining module 501 is configured to determine a pending storage link queue, where the pending storage link queue includes at least one pending storage link. A first obtaining module 502, configured to obtain basic fault location information of each to-be-processed storage link, where the basic fault location information includes disk machine state information, multipath software state information, and network link state information of each to-be-processed storage link, and the network link state information includes connection state information of a network link and port information corresponding to the network link. And a positioning module 503, configured to determine a fault occurrence position of each to-be-processed storage link according to the fault positioning basic information. The recovery module 504 is configured to execute one or more remote recovery scripts at the failure occurrence location of each to-be-processed storage link, and obtain a failure recovery result of the to-be-processed storage link, where the failure recovery result includes a recovered normal connection and an unrecovered normal connection.

According to the embodiment of the disclosure, the storage link with the fault to be processed is determined by the first determining module 501, and the fault location basic information of each storage link with the fault to be processed is further acquired by the first acquiring module 502, so that each link with the fault possibly existing on the storage link with the fault can be checked, the fault inspection range is relatively comprehensive, the problems of false report, false report and missing report of the fault possibly existing in the related technology are at least partially solved, and the accuracy of fault processing is improved. In addition, according to the embodiment of the disclosure, the recovery module 504 executes one or more remote recovery scripts to automatically recover the fault, and when the primary recovery fails, the remote recovery scripts can be executed again after every preset time interval to try to recover the fault again, so that the problems of poor timeliness and fault tolerance and high labor cost caused by the fact that effective closed-loop management cannot be formed due to the fact that manual operation is relied on in the related art are solved, and the troubleshooting and repairing efficiency is improved.

According to an embodiment of the present disclosure, the load prediction apparatus 500 further includes: the device comprises a second determining module and a first sending module.

The second determining module is used for determining the self-healing storage link and the non-self-healing storage link according to the fault recovery result of the storage link to be processed. And the first sending module is used for sending the fault positioning basic information of the non-self-healing storage link to the centralized monitoring system under the condition that the storage link to be processed is the non-self-healing storage link, so that the non-self-healing storage link is managed through the centralized monitoring system.

According to an embodiment of the present disclosure, the load prediction apparatus 500 further includes a third determining module and a second sending module.

The third determining module is configured to determine the historical error reporting times of the self-healing storage link when the storage link to be processed is the self-healing storage link. And the second sending module is used for sending the fault positioning basic information of the target self-healing storage link with the historical error reporting times meeting the preset times condition to the centralized monitoring system so as to manage the target self-healing storage link through the centralized monitoring system.

According to an embodiment of the present disclosure, the load prediction apparatus 500 further includes a fourth determining module, a second obtaining module, a fifth determining module, and a third sending module.

And the fourth determining module is used for determining a fault disk drive according to the disk drive state information of each to-be-processed storage link. And the second acquisition module is used for acquiring the associated information of each failed disk unit, wherein the associated information is used for representing the number of the to-be-processed storage links associated with the failed disk unit. And the fifth determining module is used for determining the disk unit with the important attention according to the associated information of each fault disk unit. And the third sending module is used for sending the state information of the disk machine of the important concern to the centralized monitoring system so as to manage the disk machine of the important concern through the centralized monitoring system.

According to an embodiment of the present disclosure, the first determining module 501 includes an obtaining unit and a determining unit.

The acquiring unit is used for acquiring target error reporting information and server state information on each storage link, wherein the target error reporting information comprises disk machine storage state error reporting information on each storage link, network link connection state error reporting information and multipath software running state error reporting information, and the server state information comprises first state information used for representing whether a server on each storage link is in a production state and second state information used for representing whether the server on each storage link normally works. The determining unit is used for determining the storage link queue to be processed according to the target error reporting information and the server state information on each storage link.

According to an embodiment of the present disclosure, the determining unit includes a first determining subunit, a second determining subunit, a first filtering subunit, a third determining subunit, and a second filtering subunit.

The first determining subunit is configured to determine a first to-be-detected storage link queue according to the target error reporting information, where the first to-be-detected storage link queue includes at least one to-be-detected storage link. And the second determining subunit is used for determining the operation state of each storage link to be checked according to the second state information, wherein the operation state comprises normal and abnormal. And the first filtering subunit is used for filtering the storage link with the normal operation state from the first storage link queue to be detected so as to obtain a second storage link queue to be detected. And the third determining subunit is used for determining a white list storage link according to the first state information, wherein the white list storage link is a storage link where a server which is not in a production state is located. And the second filtering subunit is used for filtering the white list storage link from the second storage link queue to be checked so as to determine a storage link queue to be processed.

According to an embodiment of the present disclosure, the above-mentioned obtaining unit includes a obtaining subunit and a screening subunit. The acquiring subunit is configured to acquire system log information of the server on each storage link. And the screening subunit is used for screening out target error reporting information and server state information on each storage link from the system log information.

According to the embodiment of the disclosure, screening out target error reporting information and server state information on each storage link from system log information comprises: storing the system log information to a target database; and screening out target error reporting information and server state information on each storage link from the target database.

Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.

For example, any plurality of the first determining module 501, the obtaining module 502, the positioning module 503 and the recovering module 504 may be combined and implemented in one module/unit/sub-unit, or any one of the modules/units/sub-units may be split into a plurality of modules/units/sub-units. Alternatively, at least part of the functionality of one or more of these modules/units/sub-units may be combined with at least part of the functionality of other modules/units/sub-units and implemented in one module/unit/sub-unit. According to an embodiment of the present disclosure, at least one of the first determining module 501, the obtaining module 502, the positioning module 503 and the recovering module 504 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or implemented by any one of three implementations of software, hardware and firmware, or any suitable combination of any of them. Alternatively, at least one of the first determining module 501, the obtaining module 502, the positioning module 503 and the restoring module 504 may be at least partly implemented as a computer program module, which when executed may perform a corresponding function.

Fig. 6 schematically shows a block diagram of an electronic device for implementing a storage link failure handling method according to an embodiment of the present disclosure. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, an electronic device 600 according to an embodiment of the present disclosure includes a processor 601, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. Processor 601 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 601 may also include onboard memory for caching purposes. Processor 601 may include a single processing unit or multiple processing units for performing different actions of a method flow according to embodiments of the disclosure.

In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are stored. The processor 601, the ROM602, and the RAM 603 are connected to each other via a bus 604. The processor 601 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM602 and/or RAM 603. It is to be noted that the programs may also be stored in one or more memories other than the ROM602 and RAM 603. The processor 601 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

Electronic device 600 may also include input/output (I/O) interface 606, which input/output (I/O) interface 606 is also connected to bus 604, according to an embodiment of the present disclosure. The system 600 may also include one or more of the following components connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program, when executed by the processor 601, performs the above-described functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to an embodiment of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM602 and/or RAM 603 described above and/or one or more memories other than the ROM602 and RAM 603.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method provided by the embodiments of the present disclosure, when the computer program product is run on an electronic device, the program code being configured to cause the electronic device to implement the storage link failure handling method provided by the embodiments of the present disclosure.

The computer program, when executed by the processor 601, performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of a signal on a network medium, downloaded and installed through the communication section 609, and/or installed from the removable medium 611. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

According to the embodiment of the disclosure, by executing the storage link fault processing method, the storage link to be processed can be determined, the fault location basic information of each storage link to be processed can be further obtained, each link on the fault storage link, which may have a fault, can be checked, the fault inspection range is relatively comprehensive, the problems of false fault report, false report and missing report in the related technology are at least partially solved, and the fault processing accuracy is improved. In addition, according to the embodiment of the disclosure, the fault is automatically recovered by executing the remote recovery script once or multiple times, and under the condition that the primary recovery fails, the remote recovery script can be executed again after every preset time interval, and the recovery of the fault is attempted again, so that the problems of poor timeliness and fault tolerance and more labor cost occupied due to the fact that effective closed-loop management cannot be formed due to the fact that manual operation is relied on in the related art are solved, and the troubleshooting and repairing efficiency is improved.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A storage link failure handling method comprises the following steps:

determining a storage link queue to be processed, wherein the storage link queue to be processed comprises at least one storage link to be processed;

determining the fault occurrence position of each storage link to be processed according to the fault positioning basic information; and

executing one or more remote recovery scripts at the fault occurrence position of each to-be-processed storage link, and acquiring the fault recovery result of the to-be-processed storage link, wherein the fault recovery result comprises recovered normal connection and unrecovered normal connection.

2. The method of claim 1, further comprising:

determining a self-healing storage link and a non-self-healing storage link according to the fault recovery result of the storage link to be processed;

and under the condition that the storage link to be processed is the non-self-healing storage link, sending the fault positioning basic information of the non-self-healing storage link to a centralized monitoring system so as to manage the non-self-healing storage link through the centralized monitoring system.

3. The method of claim 2, further comprising:

determining the historical error reporting times of the self-healing storage link under the condition that the storage link to be processed is the self-healing storage link;

and sending the fault positioning basic information of the target self-healing storage link with the historical error reporting times meeting a preset time condition to a centralized monitoring system so as to manage the target self-healing storage link through the centralized monitoring system.

4. The method of claim 1 or 3, further comprising:

determining a fault disk machine according to the disk machine state information of each to-be-processed storage link;

acquiring the association information of each fault disk unit, wherein the association information is used for representing the number of the to-be-processed storage links associated with the fault disk unit;

and sending the disk machine state information of the disk machine with the important attention to a centralized monitoring system so as to manage the disk machine with the important attention through the centralized monitoring system.

5. The method of claim 1, wherein the determining a pending storage link queue comprises:

acquiring target error reporting information and server state information on each storage link, wherein the target error reporting information comprises disk machine storage state error reporting information on each storage link, network link connection state error reporting information and multi-path software running state error reporting information, and the server state information comprises first state information used for representing whether a server on each storage link is in a production state and second state information used for representing whether the server on each storage link normally works;

and determining the storage link queue to be processed according to the target error reporting information and the server state information on each storage link.

6. The method of claim 5, wherein determining the pending storage link queue according to the target error reporting information and the server state information on each storage link comprises:

determining the running state of each storage link to be checked according to the second state information, wherein the running state comprises normal and abnormal;

and filtering the white list storage link from the second storage link queue to be checked so as to determine the storage link queue to be processed.

7. The method of claim 5, wherein the obtaining target error information and server state information on each storage link comprises:

acquiring system log information of a server on each storage link;

and screening the target error reporting information and the server state information on each storage link from the system log information.

8. The method of claim 7, wherein screening the system log information for the target error information and server status information on the each storage link comprises:

storing the system log information to a target database;

and screening the target error reporting information and the server state information on each storage link from the target database.

9. A storage link failure handling apparatus comprising:

the device comprises a first determining module, a second determining module and a processing module, wherein the first determining module is used for determining a storage link queue to be processed, and the storage link queue to be processed comprises at least one storage link to be processed;

a first obtaining module, configured to obtain basic fault location information of each to-be-processed storage link, where the basic fault location information includes disk machine state information, multi-path software state information, and network link state information of each to-be-processed storage link, and the network link state information includes connection state information of the network link and port information corresponding to the network link;

the positioning module is used for determining the fault occurrence position of each storage link to be processed according to the fault positioning basic information; and

10. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-8.

11. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to carry out the method of any one of claims 1 to 8.

12. A computer program product comprising computer executable instructions for implementing the method of any one of claims 1 to 8 when executed.