CN114116282A

CN114116282A - Method and device for reporting and repairing network additional storage fault

Info

Publication number: CN114116282A
Application number: CN202111342238.9A
Authority: CN
Inventors: 郑强
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2022-03-01
Anticipated expiration: 2041-11-12
Also published as: CN114116282B

Abstract

The present invention provides a method, system, device and storage medium for reporting and repairing network attached storage faults. The method includes: acquiring an alarm information file stored in a network attached storage, and filling the alarm information file with alarm data information; The alarm data information sequentially determines whether each alarm event triggers an alarm; in response to the alarm event being able to trigger an alarm, calling a reporting function to report the alarm event; and calling a repair function in the failure mode library according to the identification of the occurrence of the alarm event Repair the alarm event. The present invention will display the network attached storage alarm and make it visible to the user, so that the fault can be efficiently dealt with, the stability of the system can be ensured, and some alarms can be automatically repaired, and manual intervention is no longer required, so that the user has no perception and increases the user's approval. Spend.

Description

Method and device for reporting and repairing network additional storage fault

Technical Field

The present invention relates to the field of storage, and in particular, to a method, a system, a device, and a storage medium for reporting and repairing a network attached storage failure.

Background

In the big data era, the requirements on the reliability of storage and accurate positioning of problems are higher and higher. However, after a failure occurs in the service of the existing MCS system (simplified linux based on linux kernel) NAS (Network Attached Storage), there is no alarm event prompt information related to the Network Attached Storage service in the GUI (Graphical User Interface), so that it is inconvenient for a User to obtain the failure information in time, and thus the failure information cannot be measured and processed in time, and hidden troubles are buried for stable operation of the system.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, a system, a computer device, and a computer readable storage medium for reporting and repairing a network attached storage failure, where a network attached storage alarm is visually displayed on a user's page, and when the network attached storage alarm occurs, automatic repair is performed to reduce manual intervention, increase the user's acceptance, and improve the system stability.

Based on the above object, an aspect of the embodiments of the present invention provides a method for reporting and repairing a network attached storage failure, including the following steps: acquiring an alarm information file additionally stored in a network, and filling alarm data information in the alarm information file; judging whether each alarm event triggers an alarm or not in sequence according to the filled alarm data information; responding to an alarm event and triggering an alarm, and calling a reporting function to report the alarm event; and calling a repair function in a failure mode library to repair the alarm event according to the identifier of the alarm event.

In some embodiments, the invoking the reporting function to report the alarm event includes: activating an error corresponding to the alarm event in a manager corresponding to the alarm event, and checking whether other managers activate the error; and mapping the error code to a node real error code and setting an error flag in response to the other managers not activating the error.

In some embodiments, the method further comprises: and in response to the alarm event not triggering the alarm, calling a clearing function to clear the alarm event.

In some embodiments, the invoking a clear function to clear the alarm event comprises: clearing error code information in the cache, and judging whether the error code is a preset value or not; and responding to the error code as a preset value, clearing the current mode of the platform main process, and setting the platform main process as a common mode.

In another aspect of the embodiments of the present invention, a system for reporting and repairing a network attached storage fault is provided, including: the acquisition module is configured for acquiring an alarm information file additionally stored in a network and filling alarm data information in the alarm information file; the judging module is configured to sequentially judge whether each alarm event triggers an alarm according to the filled alarm data information; the reporting module is configured to respond to an alarm event and trigger an alarm, and call a reporting function to report the alarm event; and the repairing module is configured to call a repairing function in the failure mode library to repair the alarm event according to the identifier of the alarm event.

In some embodiments, the reporting module is configured to: activating an error corresponding to the alarm event in a manager corresponding to the alarm event, and checking whether other managers activate the error; and mapping the error code to a node real error code and setting an error flag in response to the other managers not activating the error.

In some embodiments, the system further comprises a purge module configured to: and in response to the alarm event not triggering the alarm, calling a clearing function to clear the alarm event.

In some embodiments, the purge module is further configured to: clearing error code information in the cache, and judging whether the error code is a preset value or not; and responding to the error code as a preset value, clearing the current mode of the platform main process, and setting the platform main process as a common mode.

In another aspect of the embodiments of the present invention, there is also provided a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method as above.

In a further aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, in which a computer program for implementing the above method steps is stored when the computer program is executed by a processor.

The invention has the following beneficial technical effects: by visually displaying the network additional storage alarm on the page of the user and automatically repairing when the network additional storage alarm occurs, the manual intervention is reduced, the recognition degree of the user is increased, and the stability of the system is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

Fig. 1 is a schematic diagram of an embodiment of a method for reporting and repairing a network attached storage failure according to the present invention;

fig. 2 is a schematic diagram of an embodiment of a system for reporting and repairing a network attached storage failure according to the present invention;

fig. 3 is a schematic diagram of a hardware structure of an embodiment of a computer device for reporting and repairing a network attached storage failure according to the present invention;

fig. 4 is a schematic diagram of an embodiment of a computer storage medium for reporting and repairing a network attached storage failure according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

A first aspect of an embodiment of the present invention provides an embodiment of a method for reporting and repairing a network attached storage fault. Fig. 1 is a schematic diagram illustrating an embodiment of a method for reporting and repairing a network attached storage failure according to the present invention. As shown in fig. 1, the embodiment of the present invention includes the following steps:

s1, acquiring an alarm information file additionally stored in a network, and filling alarm data information in the alarm information file;

s2, sequentially judging whether each alarm event triggers an alarm according to the filled alarm data information;

s3, responding to the alarm event and triggering the alarm, and calling a reporting function to report the alarm event; and

and S4, according to the identifier of the alarm event, calling a repair function in a failure mode library to repair the alarm event.

Through embedding a plurality of sensors with faults in the Network additional storage virtual machine, if faults occur, the sensors can quickly capture and report to an MCS System, for example, alarms such as collection of Network additional storage node failover, NFS (Network File System) service, CIFS (Common Internet File Systems) service, FTP (File Transfer Protocol) service, Minioss service, Network additional storage restart fault, Network additional storage Ethernet port fault, File System capacity and the like are called by the MCS System, and the implementation flow is as follows:

the method is implemented by a daemon vm _ daemon.py, nas _ alarmd is called once every 5 seconds, and is connected with a virtual machine through ssh (Secure Shell) to execute nas _ alarm.py to inquire, and the inquiry is based on nodes. And nas _ alarmd acquires the states of network additional storage node failover, network file system service, general Internet file system service, file transfer protocol service and Minioss service, restart, network card and file system in the virtual machine, and writes a fifo file for the mcs alarm code to query if the query is successful.

And acquiring an alarm information file additionally stored in a network, and filling alarm data information in the alarm information file. And sequentially judging whether each alarm event triggers an alarm or not according to the filled alarm data information.

And responding to an alarm event and triggering an alarm, and calling a reporting function to report the alarm event. The MCS system alarm detection processing is completed through an EC module and a PL module in the system, each module has the following specific functions, the EC module sequentially judges alarm events by reading network additional stored alarm information files, fills information such as error records, state data, activation marks and the like, then sequentially processes the alarm events according to the filled information, if an alarm exists, an alarm reporting function is called, otherwise, an alarm clearing function is called; and the PL module carries out error code sequencing according to the received alarm event information and reports the alarm event. The specific process is as follows: the MCS checks whether the event state is at starting, and exits if the event state is at starting; the MCS system reads the NAS warning information state file and judges whether the acquired information is valid or not, and if the acquired information is invalid, the NAS warning information state file is quitted; the MCS system starts to sequentially judge the NAS warning information and fills information such as error records, state data, activation marks and the like; and processing the alarm events in turn according to the alarm data information filled in the previous step, calling an ecmgr _ sensor _ report _ node _ error function to report an alarm if a certain alarm event has an alarm, and calling the ecmgr _ sensor _ clear _ node _ error function to clear the alarm if no alarm exists.

In some embodiments, the invoking the reporting function to report the alarm event includes: activating an error corresponding to the alarm event in a manager corresponding to the alarm event, and checking whether other managers activate the error; and mapping the error code to a node real error code and setting an error flag in response to the other managers not activating the error. And checking whether the error code is 0x522, if so, forcibly setting the platform main process to 522 mode, and if not, calling a function to report an alarm. The error code is cached to prevent the loss of error information due to the exit of the io process (input/output process).

In some embodiments, the invoking a clear function to clear the alarm event comprises: clearing error code information in the cache, and judging whether the error code is a preset value or not; and responding to the error code as a preset value, clearing the current mode of the platform main process, and setting the platform main process as a common mode. It is checked whether the error code is 0x522, if so, the platform host process 522 mode is cleared, and if not, the platform host process is set to normal mode.

Invoking the clear function to clear the alarm event also includes: activating an error corresponding to the alarm event in a manager corresponding to the alarm event, and checking whether other managers activate the error; and mapping the error code to a node true error code in response to the other manager not activating the error.

And calling a repair function in a failure mode library to repair the alarm event according to the identifier of the alarm event.

The information of the related alarm event of the NAS can be displayed in an alarm interface at the front end of the graphical user interface, the interface lists error codes, time stamps, states, descriptions, object types, object identifications and object name information of the current alarm event, and right-clicking a certain alarm event can execute operations of checking attributes, clearing logs, running repairs and the like on the alarm event. And partial alarm is performed, the large data background script is registered, and then the automatic repair module is called to automatically position and repair. And the principle of the automatic modification module calls an automatic repair module in the fault mode library to automatically repair according to the identifier of the alarm.

It should be particularly noted that, in each embodiment of the foregoing method for reporting and repairing a network attached storage failure, each step may be intersected, replaced, added, or deleted, and therefore, the method for reporting and repairing a network attached storage failure, which is transformed by reasonable permutation and combination, shall also belong to the protection scope of the present invention, and shall not limit the protection scope of the present invention to the embodiment.

Based on the above object, a second aspect of the embodiments of the present invention provides a system for reporting and repairing a network attached storage failure. As shown in fig. 2, the system 200 includes the following modules: the acquisition module is configured for acquiring an alarm information file additionally stored in a network and filling alarm data information in the alarm information file; the judging module is configured to sequentially judge whether each alarm event triggers an alarm according to the filled alarm data information; the reporting module is configured to respond to an alarm event and trigger an alarm, and call a reporting function to report the alarm event; and the repairing module is configured to call a repairing function in the failure mode library to repair the alarm event according to the identifier of the alarm event.

In view of the above object, a third aspect of the embodiments of the present invention provides a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions being executable by the processor to perform the steps of: s1, acquiring an alarm information file additionally stored in a network, and filling alarm data information in the alarm information file; s2, sequentially judging whether each alarm event triggers an alarm according to the filled alarm data information; s3, responding to the alarm event and triggering the alarm, and calling a reporting function to report the alarm event; and S4, according to the identifier of the alarm event, calling a repair function in a failure mode library to repair the alarm event.

In some embodiments, the steps further comprise: and in response to the alarm event not triggering the alarm, calling a clearing function to clear the alarm event.

Fig. 3 is a schematic diagram of a hardware structure of an embodiment of the computer device for reporting and repairing the network attached storage failure according to the present invention.

Taking the device shown in fig. 3 as an example, the device includes a processor 301 and a memory 302.

The processor 301 and the memory 302 may be connected by a bus or other means, such as the bus connection in fig. 3.

The memory 302 is used as a non-volatile computer-readable storage medium, and can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the method for reporting and repairing a network attached storage failure in the embodiment of the present application. The processor 301 executes various functional applications and data processing of the server by running the nonvolatile software programs, instructions and modules stored in the memory 302, that is, a method for reporting and repairing a network attached storage failure is realized.

The memory 302 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a method of network-attached storage failure reporting and repairing, and the like. Further, the memory 302 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 302 optionally includes memory located remotely from processor 301, which may be connected to a local module via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Computer instructions 303 corresponding to one or more methods for reporting and repairing a network attached storage failure are stored in the memory 302, and when executed by the processor 301, perform the method for reporting and repairing a network attached storage failure in any of the above-described method embodiments.

Any embodiment of the computer device executing the method for reporting and repairing the network attached storage failure can achieve the same or similar effects as any corresponding method embodiment.

The invention also provides a computer readable storage medium, which stores a computer program for executing the method for reporting and repairing the network additional storage fault when the computer program is executed by the processor.

Fig. 4 is a schematic diagram of an embodiment of a computer storage medium for reporting and repairing the network attached storage failure according to the present invention. Taking the computer storage medium as shown in fig. 4 as an example, the computer readable storage medium 401 stores a computer program 402 which, when executed by a processor, performs the method as described above.

Finally, it should be noted that, as one of ordinary skill in the art can appreciate, all or part of the processes in the methods of the foregoing embodiments may be implemented by instructing relevant hardware by a computer program, and the program of the method for reporting and repairing a network-attached storage failure may be stored in a computer-readable storage medium, and when executed, may include the processes of the foregoing embodiments of the methods. The storage medium of the program may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. a method for network attached storage fault reporting and repair, characterized in that, comprising the steps:

Obtaining an alarm information file stored in a network attached storage, and filling the alarm information file with alarm data information;

According to the filled alarm data information, sequentially determine whether each alarm event triggers an alarm;

In response to the alarm event being able to trigger an alarm, a reporting function is called to report the alarm event; and

According to the identification of the occurrence of the alarm event, the repair function in the failure mode library is called to repair the alarm event.

2. The method according to claim 1, wherein the calling a reporting function to report the alarm event comprises:

activate the error corresponding to the alarm event in the manager corresponding to the alarm event, and check whether other managers have activated the error; and

In response to the error not being activated by other managers, the error code is mapped to the node real error code and the error flag is set.

3. The method according to claim 1, wherein the method further comprises:

In response to the alarm event not being able to trigger the alarm, a clear function is called to clear the alarm event.

4. The method according to claim 3, wherein the calling a clearing function to clear the alarm event comprises:

Clear the error code information in the cache, and determine whether the error code is the default value; and

In response to the error code being a preset value, the current mode of the platform main process is cleared, and the platform main process is set to the normal mode.

5. A system for reporting and repairing network-attached storage faults, comprising:

an obtaining module, configured to obtain an alarm information file stored in a network attached storage, and fill in the alarm data information in the alarm information file;

a judgment module, configured to sequentially judge whether each alarm event triggers an alarm according to the filled alarm data information;

a reporting module, configured to be able to trigger an alarm in response to an alarm event, and to call a reporting function to report the alarm event; and

The repair module is configured to call the repair function in the failure mode library to repair the alarm event according to the identification of the occurrence of the alarm event.

6. The system according to claim 5, wherein the reporting module is configured to:

7. The system according to claim 5, wherein the system further comprises a clearing module configured to:

8. The system of claim 7, wherein the clearing module is further configured to:

9. A computer equipment, characterized in that, comprising:

at least one processor; and

a memory storing computer instructions executable on the processor, the instructions implementing the steps of the method of any one of claims 1-4 when executed by the processor.

10. A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the steps of the method of any one of claims 1-4 are implemented.