CN114116282B

CN114116282B - Method and device for reporting and repairing network additional storage faults

Info

Publication number: CN114116282B
Application number: CN202111342238.9A
Authority: CN
Inventors: 郑强
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2023-08-18
Anticipated expiration: 2041-11-12
Also published as: CN114116282A

Abstract

The application provides a method, a system, equipment and a storage medium for reporting and repairing network additional storage faults, wherein the method comprises the following steps: acquiring an alarm information file additionally stored in a network, and filling alarm data information in the alarm information file; judging whether each alarm event triggers an alarm or not in sequence according to the filled alarm data information; responding to an alarm event to trigger an alarm, and calling a reporting function to report the alarm event; and calling a repair function in a fault mode library to repair the alarm event according to the identification of the occurrence of the alarm event. The application can display the network additional storage alarm and be visible to the user, thus the fault can be effectively handled, the stability of the system is ensured, and meanwhile, part of the alarm can be automatically repaired without manual intervention, thus the application has no perception to the user and increases the acceptance of the user.

Description

Method and device for reporting and repairing network additional storage faults

Technical Field

The present application relates to the field of storage, and in particular, to a method, system, device, and storage medium for reporting and repairing a network attached storage failure.

Background

In the big data age, the requirements on the reliability of storage and the accurate positioning of problems are higher and higher. However, when the service of the current MCS (reduced linux based on the linux kernel) NAS (Network Attached Storage, network additional storage) fails in the use process, the GUI (Graphical User Interface ) has no warning event prompt information related to the network additional storage service, so that the user cannot acquire the failure information in time, the processing cannot be measured in time, and hidden danger is buried for the stable operation of the system.

Disclosure of Invention

In view of the above, an object of the embodiments of the present application is to provide a method, a system, a computer device and a computer readable storage medium for reporting and repairing a network additional storage failure.

Based on the above objects, an aspect of the embodiments of the present application provides a method for reporting and repairing a network additional storage failure, including the following steps: acquiring an alarm information file additionally stored in a network, and filling alarm data information in the alarm information file; judging whether each alarm event triggers an alarm or not in sequence according to the filled alarm data information; responding to an alarm event to trigger an alarm, and calling a reporting function to report the alarm event; and calling a repair function in a fault mode library to repair the alarm event according to the identification of the occurrence of the alarm event.

In some embodiments, the calling a reporting function to report the alarm event includes: activating errors corresponding to the alarm event in a manager corresponding to the alarm event, and checking whether other managers are activated with the errors; and mapping the error code into a node true error code and setting an error flag in response to the other manager not activating the error.

In some embodiments, the method further comprises: and in response to the alarm event failing to trigger an alarm, invoking a clearing function to clear the alarm event.

In some embodiments, the invoking the purge function to purge the alarm event comprises: clearing the error code information in the cache and judging whether the error code is a preset value or not; and in response to the error code being a preset value, clearing the current mode of the platform main process, and setting the platform main process to be a common mode.

In another aspect of the embodiment of the present application, a system for reporting and repairing a network attached storage failure is provided, including: the acquisition module is configured to acquire an alarm information file additionally stored in the network and fill alarm data information in the alarm information file; the judging module is configured to judge whether each alarm event triggers an alarm or not in sequence according to the filled alarm data information; the reporting module is configured to respond to an alarm event and trigger an alarm, and call a reporting function to report the alarm event; and the repair module is configured to call a repair function in the fault mode library to repair the alarm event according to the identification of the occurrence of the alarm event.

In some embodiments, the reporting module is configured to: activating errors corresponding to the alarm event in a manager corresponding to the alarm event, and checking whether other managers are activated with the errors; and mapping the error code into a node true error code and setting an error flag in response to the other manager not activating the error.

In some embodiments, the system further comprises a purge module configured to: and in response to the alarm event failing to trigger an alarm, invoking a clearing function to clear the alarm event.

In some embodiments, the purge module is further configured to: clearing the error code information in the cache and judging whether the error code is a preset value or not; and in response to the error code being a preset value, clearing the current mode of the platform main process, and setting the platform main process to be a common mode.

In yet another aspect of the embodiment of the present application, there is also provided a computer apparatus, including: at least one processor; and a memory storing computer instructions executable on the processor, which when executed by the processor, perform the steps of the method as above.

In yet another aspect of the embodiments of the present application, there is also provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method steps as described above.

The application has the following beneficial technical effects: the network additional storage alarm is intuitively displayed on the page of the user, and when the network additional storage alarm appears, the automatic repair reduces manual intervention, increases the acceptance of the user, and improves the stability of the system.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application and that other embodiments may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an embodiment of a method for reporting and repairing a network attached storage failure provided by the present application;

FIG. 2 is a schematic diagram of an embodiment of a system for reporting and repairing network attached storage failures provided by the present application;

FIG. 3 is a schematic hardware architecture diagram of an embodiment of a computer device for reporting and repairing a network attached storage failure provided by the present application;

FIG. 4 is a schematic diagram of an embodiment of a computer storage medium for reporting and repairing network attached storage failures provided by the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the following embodiments of the present application will be described in further detail with reference to the accompanying drawings.

It should be noted that, in the embodiments of the present application, all the expressions "first" and "second" are used to distinguish two entities with the same name but different entities or different parameters, and it is noted that the "first" and "second" are only used for convenience of expression, and should not be construed as limiting the embodiments of the present application, and the following embodiments are not described one by one.

In a first aspect of the embodiment of the present application, an embodiment of a method for reporting and repairing a network attached storage failure is provided. Fig. 1 is a schematic diagram of an embodiment of a method for reporting and repairing a network attached storage failure provided by the present application. As shown in fig. 1, the embodiment of the present application includes the following steps:

s1, acquiring an alarm information file additionally stored by a network, and filling alarm data information in the alarm information file;

s2, judging whether each alarm event triggers an alarm or not in sequence according to the filled alarm data information;

s3, responding to an alarm event and triggering an alarm, and calling a reporting function to report the alarm event; and

s4, according to the identification of the occurrence of the alarm event, a repair function in a fault mode library is called to repair the alarm event.

Through embedding a plurality of fault perceptrons in the network additional storage virtual machine, if faults occur, the perceptrons can rapidly capture and report alarms to an MCS system, such as acquisition network additional storage node failover (failover), NFS (Network File System ) service, CIFS (Common Internet File Systems, universal Internet file system) service, FTP (File Transfer Protocol ) service, minios service, network additional storage restart faults, network additional storage Ethernet port faults, file system capacity and the like, for the MCS system to call, and the implementation flow is as follows:

implemented on mcs by daemon vm_daemon. Py, invoking nas_alarmd once every 5 seconds, and the nas_alarmd performs a query by connecting a virtual machine through ssh (Secure Shell), the query being based on nodes. The nas_alarmd obtains the states of network attached storage nodes in the virtual machine, such as failover, network file system service, universal Internet file system service, file transfer protocol service, minios service, restarting, network card and file system, and writes fifo files for the mcs alarm code to inquire if the inquiry is successful.

And acquiring an alarm information file additionally stored in the network, and filling alarm data information in the alarm information file. And judging whether each alarm event triggers an alarm or not in sequence according to the filled alarm data information.

And responding to the alarm event to trigger an alarm, and calling a reporting function to report the alarm event. The alarm detection processing of the MCS system is completed through two modules, namely an EC module and a PL module, in the system, each module is particularly responsible for the following functions, the EC module sequentially judges alarm events by reading a network additional storage alarm information file, fills information such as error records, state data, activation marks and the like, sequentially processes the alarm events according to the filling information, calls an alarm reporting function if an alarm exists, and otherwise calls an alarm clearing function; and the PL module performs error code sequencing according to the received alarm event information and reports the alarm event. The specific flow is as follows: the MCS checks whether the event state is starting, if so, the method exits; the MCS system reads the NAS alarm information state file, judges whether the acquired information is effective or not, and exits if the acquired information is ineffective; the MCS system starts to judge NAS alarm information in sequence and fills in error records, state data, activation marks and other information; and (3) sequentially processing alarm events according to the alarm data information filled in the previous step, calling an ecmgr_sensor_report_node_error function to report an alarm if an alarm exists in a certain alarm event, and calling the ecmgr_sensor_clear_node_error function to clear the alarm if the alarm does not exist.

In some embodiments, the calling a reporting function to report the alarm event includes: activating errors corresponding to the alarm event in a manager corresponding to the alarm event, and checking whether other managers are activated with the errors; and mapping the error code into a node true error code and setting an error flag in response to the other manager not activating the error. Checking whether the error code is 0x522, if so, forcibly setting the platform main process to 522 mode, and if not, calling the function to report an alarm. The error code is cached to prevent the error information from being lost due to the exit of the io process.

In some embodiments, the invoking the purge function to purge the alarm event comprises: clearing the error code information in the cache and judging whether the error code is a preset value or not; and in response to the error code being a preset value, clearing the current mode of the platform main process, and setting the platform main process to be a common mode. It is checked whether the error code is 0x522, if so, the platform main process 522 mode is cleared, and if not, the platform main process is set to the normal mode.

Invoking the clear function to clear the alert event also includes: activating errors corresponding to the alarm event in a manager corresponding to the alarm event, and checking whether other managers are activated with the errors; and mapping the error code to a node true error code in response to the other manager not activating the error.

And calling a repairing function in a fault mode library to repair the alarm event according to the identification of the occurrence of the alarm event.

NAS related alarm event information can be displayed in an alarm interface at the front end of a graphic user interface, wherein the interface lists error codes, time stamps, states, descriptions, object types, object identifications and object name information of the current alarm event, and operations such as checking attributes, clearing logs, running repair and the like can be performed on the alarm event by clicking a certain alarm event right. And (3) partial alarm, registering through a big data background script, and then calling an automatic repair module to automatically position and repair. And according to the principle of the automatic modification module and the identification of the occurrence of the alarm, calling an automatic repair module in the fault mode library to perform automatic repair.

It should be noted that, in the foregoing embodiments of the method for reporting and repairing a network additional storage fault, the steps may be intersected, replaced, added and deleted, so that the method for reporting and repairing a network additional storage fault by using these reasonable permutation and combination transforms should also belong to the protection scope of the present application, and should not limit the protection scope of the present application to the embodiments.

Based on the above objective, a second aspect of the embodiments of the present application proposes a system for reporting and repairing a network attached storage failure. As shown in fig. 2, the system 200 includes the following modules: the acquisition module is configured to acquire an alarm information file additionally stored in the network and fill alarm data information in the alarm information file; the judging module is configured to judge whether each alarm event triggers an alarm or not in sequence according to the filled alarm data information; the reporting module is configured to respond to an alarm event and trigger an alarm, and call a reporting function to report the alarm event; and the repair module is configured to call a repair function in the fault mode library to repair the alarm event according to the identification of the occurrence of the alarm event.

In view of the above object, a third aspect of the embodiments of the present application provides a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions being executable by the processor to perform the steps of: s1, acquiring an alarm information file additionally stored by a network, and filling alarm data information in the alarm information file; s2, judging whether each alarm event triggers an alarm or not in sequence according to the filled alarm data information; s3, responding to an alarm event and triggering an alarm, and calling a reporting function to report the alarm event; s4, according to the identification of the occurrence of the alarm event, a repair function in a fault mode library is called to repair the alarm event.

In some embodiments, the steps further comprise: and in response to the alarm event failing to trigger an alarm, invoking a clearing function to clear the alarm event.

As shown in fig. 3, a hardware structure diagram of an embodiment of the computer device for reporting and repairing the network attached storage fault provided by the present application is shown.

Taking the example of the device shown in fig. 3, a processor 301 and a memory 302 are included in the device.

The processor 301 and the memory 302 may be connected by a bus or otherwise, for example in fig. 3.

The memory 302 is used as a non-volatile computer readable storage medium, and may be used to store non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to a method for reporting and repairing network attached storage failures in an embodiment of the present application. The processor 301 executes various functional applications and data processing of the server, that is, a method of reporting and repairing network attached storage failures, by running nonvolatile software programs, instructions, and modules stored in the memory 302.

Memory 302 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the method of network attached storage failure reporting and repair, etc. In addition, memory 302 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 302 may optionally include memory located remotely from processor 301, which may be connected to the local module via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Computer instructions 303 corresponding to the method for reporting and repairing one or more network attached storage failures are stored in the memory 302, and when executed by the processor 301, perform the method for reporting and repairing a network attached storage failure in any of the method embodiments described above.

Any embodiment of a computer device that performs the method for reporting and repairing a network attached storage failure described above may achieve the same or similar effects as any of the method embodiments described above that correspond to the embodiment.

The application also provides a computer readable storage medium storing a computer program which when executed by a processor performs a method of reporting and repairing network attached storage failures.

As shown in fig. 4, a schematic diagram of an embodiment of a computer storage medium for reporting and repairing the network-attached storage failure according to the present application is provided. Taking a computer storage medium as shown in fig. 4 as an example, the computer readable storage medium 401 stores a computer program 402 that when executed by a processor performs the above method.

Finally, it should be noted that, as will be appreciated by those skilled in the art, implementing all or part of the above-described embodiments of the method may be implemented by a computer program to instruct related hardware, and the program of the method for reporting and repairing a network additional storage failure may be stored in a computer readable storage medium, where the program may include the steps of the embodiments of the above-described methods when executed. The storage medium of the program may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (RAM), or the like. The computer program embodiments described above may achieve the same or similar effects as any of the method embodiments described above.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that as used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The foregoing embodiment of the present application has been disclosed with reference to the number of embodiments for the purpose of description only, and does not represent the advantages or disadvantages of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, and the program may be stored in a computer readable storage medium, where the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will appreciate that: the above discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the disclosure of embodiments of the application, including the claims, is limited to such examples; combinations of features of the above embodiments or in different embodiments are also possible within the idea of an embodiment of the application, and many other variations of the different aspects of the embodiments of the application as described above exist, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the embodiments should be included in the protection scope of the embodiments of the present application.

Claims

1. The method for reporting and repairing the network additional storage fault is characterized by comprising the following steps:

acquiring an alarm information file additionally stored in a network, and filling alarm data information in the alarm information file;

judging whether each alarm event triggers an alarm or not in sequence according to the filled alarm data information;

responding to an alarm event to trigger an alarm, and calling a reporting function to report the alarm event; and

2. The method of claim 1, wherein the invoking a reporting function to report the alarm event comprises:

activating errors corresponding to the alarm event in a manager corresponding to the alarm event, and checking whether other managers are activated with the errors; and

and in response to the other manager not activating the error, mapping the error code into a node true error code and setting an error flag.

3. The method according to claim 1, wherein the method further comprises:

and in response to the alarm event failing to trigger an alarm, invoking a clearing function to clear the alarm event.

4. The method of claim 3, wherein the invoking a clear function to clear the alert event comprises:

clearing the error code information in the cache and judging whether the error code is a preset value or not; and

and in response to the error code being a preset value, clearing the current mode of the platform main process, and setting the platform main process to be a common mode.

5. A system for reporting and repairing a network attached storage failure, comprising:

the acquisition module is configured to acquire an alarm information file additionally stored in the network and fill alarm data information in the alarm information file;

the judging module is configured to judge whether each alarm event triggers an alarm or not in sequence according to the filled alarm data information;

the reporting module is configured to respond to an alarm event and trigger an alarm, and call a reporting function to report the alarm event; and

and the repair module is configured to call a repair function in the fault mode library to repair the alarm event according to the identifier of the alarm event.

6. The system of claim 5, wherein the reporting module is configured to:

7. The system of claim 5, further comprising a purge module configured to:

8. The system of claim 7, wherein the purge module is further configured to:

9. A computer device, comprising:

at least one processor; and

a memory storing computer instructions executable on the processor, which when executed by the processor, perform the steps of the method of any one of claims 1-4.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any of claims 1-4.