CN114911578A - Storage system monitoring and fault collecting method and device, terminal and storage medium - Google Patents
Storage system monitoring and fault collecting method and device, terminal and storage medium Download PDFInfo
- Publication number
- CN114911578A CN114911578A CN202210589713.0A CN202210589713A CN114911578A CN 114911578 A CN114911578 A CN 114911578A CN 202210589713 A CN202210589713 A CN 202210589713A CN 114911578 A CN114911578 A CN 114911578A
- Authority
- CN
- China
- Prior art keywords
- storage system
- fault
- monitoring
- state
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012544 monitoring process Methods 0.000 title claims abstract description 105
- 238000000034 method Methods 0.000 title claims abstract description 39
- 230000002159 abnormal effect Effects 0.000 claims abstract description 11
- 238000002347 injection Methods 0.000 claims description 16
- 239000007924 injection Substances 0.000 claims description 16
- 238000012360 testing method Methods 0.000 description 12
- 238000004891 communication Methods 0.000 description 7
- 239000000243 solution Substances 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000001960 triggered effect Effects 0.000 description 3
- 230000000737 periodic effect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000005856 abnormality Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45587—Isolation or security of virtual machine instances
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45591—Monitoring or debugging support
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention relates to the field of storage system monitoring, and particularly discloses a storage system monitoring and fault collecting method, a device, a terminal and a storage medium, wherein a monitoring server is set up to enable the monitoring server to be communicated with a storage system; logging in a storage system, periodically accessing the storage system, and inquiring the state of the storage system; when the state of the storage system is abnormal, triggering dump file collection; and analyzing and positioning the fault reasons of the storage system according to the collected data in the dump file. According to the invention, the monitoring server is set up, dump information or OSES information is collected in time when a fault occurs, and the collected information contains fault information, so that the fault information is analyzed and positioned, and the embarrassment that the fault problem cannot be reproduced or is difficult to reproduce is avoided.
Description
Technical Field
The invention relates to the field of storage system monitoring, in particular to a storage system monitoring and fault collecting method, a storage system monitoring and fault collecting device, a terminal and a storage medium.
Background
In the testing process, a tester cannot always look at the storage system to run, or some fault injection or repeated testing needs to be performed for a long time, and the scheduling needs to be performed through a script, so that the state of the storage system cannot be frequently checked, logs of the storage system can grow along with time, possibly the logs can be covered, and the logs in the case of problems cannot be checked when the abnormality is found. Some faults are probabilistic and can not be reproduced every time, so that once some faults miss information at that time, a large amount of labor cost and time cost are required for reproduction again.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method, an apparatus, a terminal and a storage medium for monitoring and collecting a failure of a storage system, so as to collect all required information after a failure occurs, and perform failure analysis and location.
In a first aspect, a technical solution of the present invention provides a storage system monitoring and fault collecting method, including the following steps:
building a monitoring server to enable the monitoring server to be communicated with a storage system;
logging in a storage system, periodically accessing the storage system, and inquiring the state of the storage system;
when the state of the storage system is abnormal, triggering dump file collection;
and analyzing and positioning the fault reason of the storage system according to the collected data in the dump file.
Further, the monitoring server is connected to each controller of the storage system through a serial port;
the method further comprises the following steps:
if the storage system cannot be logged in, performing login attempt at preset time intervals;
if the storage system cannot be logged in after the login is tried for the preset times, the case management service of each controller is entered;
under the management service of the chassis, inquiring the specified information through an instruction to record;
and analyzing the fault reason of the storage system and positioning the fault reason according to the recorded specified information.
Further, when the storage system is normally logged in, the period for accessing the storage system is the same as the fault injection period of the storage system.
Further, the queried storage system state comprises a cluster state and an alarm event;
storage system state anomalies include a cluster state not meeting expectations or an unexpected alarm event being generated.
In a second aspect, the technical scheme of the invention provides a storage system monitoring and fault collecting device, which is characterized in that a monitoring server is set up to enable the monitoring server to communicate with a storage system;
the device comprises a plurality of devices which are connected with each other,
a login module: logging in a storage system;
a state query module: periodically accessing the storage system and inquiring the state of the storage system;
the file collection triggering module: when the state of the storage system is abnormal, triggering dump file collection;
the first fault analysis positioning module: and analyzing and positioning the fault reason of the storage system according to the collected data in the dump file.
Further, the monitoring server is connected to each controller of the storage system through a serial port;
if the login module cannot log in the storage system, performing login attempt at preset time intervals;
the device also comprises a control device which is used for controlling the operation of the device,
the chassis management service enters the module: if the storage system cannot be logged in after the login is tried for the preset times, the case management service of each controller is entered;
the specified information query recording module: under the management service of the chassis, inquiring the specified information through an instruction to record;
the second fault analysis positioning module: and analyzing the fault reason of the storage system and positioning the fault reason according to the recorded specified information.
Further, the period of accessing the storage system by the state query module is the same as the fault injection period of the storage system.
Further, the storage system state inquired by the fault inquiry module comprises a cluster state and an alarm event;
storage system state anomalies include a cluster state not meeting expectations or an unexpected alarm event being generated.
In a third aspect, a technical solution of the present invention provides a terminal, including:
the memory is used for storing and storing a storage system monitoring and fault collecting program;
and the processor is used for realizing the steps of the storage system monitoring and fault collecting method in any one of the above steps when the storage system monitoring and fault collecting program is executed.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a storage system monitoring and fault collection program is stored on the readable storage medium, and when executed by a processor, the storage system monitoring and fault collection program implements the steps of the storage system monitoring and fault collection method according to any one of the above.
Compared with the prior art, the method, the device, the terminal and the storage medium for monitoring the storage system and collecting the faults have the following beneficial effects that: and (3) setting up a monitoring server, collecting dump information or OSES information in time when a fault occurs, wherein the collected information contains fault information, so that the fault information is analyzed and positioned, and the embarrassment that the fault problem cannot be reproduced or is difficult to reproduce is avoided.
Drawings
In order to clearly illustrate the embodiments or technical solutions of the present application, the drawings used in the embodiments or technical solutions of the present application will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic flowchart of a storage system monitoring and fault collection method according to an embodiment of the present invention.
Fig. 2 is a schematic block diagram of a storage system monitoring and fault collection apparatus according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
Detailed Description
The following explains the english terms related to the present invention.
Dump: the Dump file is a memory image of a process, and the execution state of the program can be saved in the Dump file through a debugger.
OSES: the system is short for Organic SAS Enclosure Service, Chinese is called as unified SAS chassis Service, OSES is used as a whole chassis management module of the storage device, and the system has powerful functions, can monitor the running state of the device in real time, and can realize interaction and management with each system module; SAS is short for Serial Attached SCSI, and Chinese is called Serial port connection interface.
LDBE: is a tool that can view dump information.
In order that those skilled in the art will better understand the disclosure, the following detailed description will be given with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The core of the invention is to establish a monitoring server aiming at the problem that the log of the storage system can be covered to cause failure information to be unavailable, to monitor the state of the storage system, and to actively collect dump files or record related information through OSES when the storage system fails, thereby analyzing and positioning the failure and avoiding the failure analysis caused by the covered log.
Fig. 1 is a schematic flow chart of a storage system monitoring and fault collection method according to an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps.
S101, a monitoring server is set up, and the monitoring server is communicated with a storage system.
And a monitoring server is set up in advance and used for monitoring the storage system and issuing instructions. It can be understood that the monitoring server is in communication with the storage system, which not only realizes the monitoring of the storage system by the monitoring server, but also realizes the data collection from the storage system by the monitoring server.
The built monitoring server can be a Linux server.
S102, logging in the storage system, periodically accessing the storage system, and inquiring the state of the storage system.
The monitoring server logs in the storage system firstly, secret-free login of the monitoring server to the storage system can be configured, certainly, password login is set according to specific conditions and user requirements, and the selection of a specific login mode does not influence the implementation of the embodiment of the application.
The monitoring server periodically accesses the storage system, accesses the storage system once every certain time, inquires the state of the storage system and judges whether the storage system is abnormal or not.
S103, triggering dump file collection when the state of the storage system is abnormal.
The storage system provides a CLI command for collecting the dump files, and when the monitoring server inquires that the storage system is abnormal, the storage system is triggered to issue the CLI command to collect the dump files.
And S104, analyzing the failure reason of the storage system and positioning the failure reason according to the collected data in the dump file.
It can be understood that the collected dump files are transmitted to the monitoring server, and the monitoring server performs storage system failure cause analysis and failure cause location on the data in the dump files.
When the storage system has unexpected faults or other problems, the process runtime state is analyzed through the dump file, and therefore the problem reason is analyzed. The Dump can be triggered manually or collected automatically by the system, the storage system triggers Dump log collection when an unexpected restart or other business faults are met, and the Dump can be checked by an LDBE tool, so that maintenance or research personnel can analyze the causes of the problems.
According to the storage system monitoring and fault collecting method provided by the embodiment of the invention, the monitoring server is set up, dump information or OSES information is collected in time when a fault occurs, and the collected information contains fault information, so that the fault information is analyzed and positioned, and the dilemma that the fault problem cannot be reproduced or is difficult to reproduce is avoided.
On the basis of the above embodiment, as a preferred implementation manner, when the dump file cannot be executed because the storage system cannot be logged in, the method records the relevant information through the enclosure management server (OSES) to perform fault analysis and location, and specifically includes:
if the storage system cannot be logged in, performing login attempt at preset time intervals;
if the storage system cannot be logged in after the login is tried for the preset times, the case management service of each controller is entered;
under the management service of the chassis, inquiring the specified information through an instruction to record;
and analyzing the fault reason of the storage system and positioning the fault reason according to the recorded specified information.
It should be noted that after the monitoring server is created, the monitoring server is connected to each controller of the storage system through a serial port, so that the monitoring server enters the enclosure management service of the controller to record relevant information when the storage system cannot be logged in subsequently. The storage system carries out chassis management through the chassis management service, and some faults can cause the storage system to be incapable of being started in the using or testing process of the storage system, so that dump files cannot be collected, and the problem cannot be positioned. At this time, the OSES needs to be logged in to check the state of the hardware or the underlying software, so as to analyze the reason why the storage system cannot be started.
When the storage system can not be logged in, firstly, the logging attempt is performed at a certain time interval, and after the certain number of attempts, the storage system can not be stored, the case management service of each controller is accessed, and under the case management service, the specified information is inquired through an instruction to be recorded. For example, a login attempt is performed every 30s, and the time is exceeded after 20 attempts, and of course, the specific time interval and the number of times may be specifically set according to specific practical situations and user requirements, and the setting of the numerical value does not affect the implementation of the embodiment of the present application.
On the basis of the above embodiment, as a preferred embodiment, the method further includes:
when the storage system is normally logged in, the period for accessing the storage system is the same as the fault injection period of the storage system.
When the monitoring server can normally log in the storage system, if the storage system performs a fault injection test, the period of accessing the storage system by the monitoring server can be the same as the fault injection period of the storage system, for example, a currently performed fault injection period of 30min, the state query of the storage system also keeps the same frequency, and the storage state check is performed after each fault injection. Of course, the period of accessing the storage system by the monitoring server is determined according to the actually performed test, for example, if a regular pressure test is performed instead of the periodic fault injection test, a monitoring period can be set by itself for monitoring. It can be understood that the specific setting of the monitoring period does not affect the implementation of the embodiment of the present invention.
On the basis of the foregoing embodiment, as a preferred embodiment, when querying the storage system state, the monitoring server specifically checks what contents may also be determined by testing the contents, for example, the queried storage system state includes a cluster state and an alarm event, correspondingly, the storage system state exception includes the cluster state not meeting the expectation or an unexpected alarm event is generated, and when the cluster state does not meet the expectation or an unexpected alarm event is generated, the storage system is triggered to collect the dump file.
To further understand the present invention, an embodiment is provided below, which further illustrates the present invention and includes the following steps.
Step one, a linux server is set up and used for monitoring a storage system and issuing instructions. A linux server needs to be configured to store a secret-free login.
And step two, connecting each stored controller by using a serial port line through a serial port server to collect logs when the storage system cannot be started.
And step three, periodically accessing the storage system by the monitoring script in the linux server, and inquiring the state of the storage system.
The access period is determined according to the actual test, for example, the current fault injection is carried out for 30min for one period, so that the storage state query also keeps the same frequency, and the storage state check is ensured to be carried out after each fault injection; the storage status checks which contents may also be determined according to the test contents, such as regular cluster status, alarm events, etc.
If not the periodic fault injection test, but the conventional pressure test and the like, a monitoring period can be set by self for monitoring.
Step four, if the storage state needing to be queried is found to be inconsistent with the expectation, or an unexpected alarm event is generated, triggering livedump collection through a CLI command.
And step five, if the storage system cannot be logged in when the state query is carried out, carrying out one attempt at intervals of 30s (the time can be set by self), and overtime after 20 attempts.
And step six, after time out, connecting each controller by using a serial port through a serial port server, and then entering OSES through an instruction.
And seventhly, under an OSES command line, searching the specified information through the instruction to record.
And by collecting information in the fourth step and the seventh step, root cause positioning of the fault causes can be carried out.
The embodiment of the storage system monitoring and fault collecting method is described in detail above, and based on the storage system monitoring and fault collecting method described in the above embodiment, the embodiment of the present invention further provides a storage system monitoring and fault collecting device, which is used for implementing the storage system monitoring and fault collecting method.
Fig. 2 is a schematic block diagram of a storage system monitoring and fault collecting apparatus according to an embodiment of the present invention, and as shown in fig. 2, the apparatus includes: the system comprises a login module 101, a state query module 102, a file collection triggering module 103, a first fault analysis positioning module 104, a chassis management service entering module 105, an appointed information query recording module 106 and a second fault analysis positioning module 107.
And (5) building a monitoring server to enable the monitoring server to communicate with the storage system. The monitoring server is connected to each controller of the storage system through a serial port.
The login module 101: and logging in the storage system.
The status query module 102: and periodically accessing the storage system and inquiring the state of the storage system.
The file collection triggering module 103: and when the storage system state is abnormal, triggering dump file collection.
First fault analysis location module 104: and analyzing and positioning the fault reason of the storage system according to the collected data in the dump file.
If the login module 101 cannot log in the storage system, a login attempt is performed every preset time. The storage system state queried by the fault query module 102 includes a cluster state and an alarm event; storage system state anomalies include a cluster state not meeting expectations or an unexpected alarm event being generated.
The chassis management service entry module 105: and if the storage system cannot be logged in after the login is tried for the preset times, the chassis management service of each controller is entered.
Specifying information query record module 106: under the management service of the chassis, the specified information is inquired and recorded through the instruction.
The second fault analysis positioning module 107: and analyzing the fault reason of the storage system and positioning the fault reason according to the recorded specified information.
The storage system monitoring and fault collecting apparatus of this embodiment is used to implement the foregoing storage system monitoring and fault collecting method, and therefore, the specific implementation of the apparatus can be seen in the foregoing section of the storage system monitoring and fault collecting method, and therefore, the specific implementation of the apparatus can refer to the description of the corresponding section of the embodiment, and is not described herein again.
In addition, since the storage system monitoring and fault collecting device of the embodiment is used for implementing the storage system monitoring and fault collecting method, the function of the storage system monitoring and fault collecting device corresponds to that of the method, and details are not described here.
Fig. 3 is a schematic structural diagram of a terminal device 300 according to an embodiment of the present invention, including: a processor 310, a memory 320, and a communication unit 330. The processor 310 is configured to implement the storage system monitoring and fault collection program stored in the memory 320, and implement the following steps:
logging in a storage system, periodically accessing the storage system, and inquiring the state of the storage system;
when the state of the storage system is abnormal, triggering dump file collection;
and analyzing and positioning the fault reason of the storage system according to the collected data in the dump file.
The invention comprises the following steps: and (3) setting up a monitoring server, collecting dump information or OSES information in time when a fault occurs, wherein the collected information contains fault information, so that the fault information is analyzed and positioned, and the embarrassment that the fault problem cannot be reproduced or is difficult to reproduce is avoided.
In some embodiments, when the processor 310 executes the storage system monitoring and fault collection subroutine stored in the memory 320, the following steps may be specifically implemented: if the storage system cannot be logged in, logging in is tried every preset time interval; if the storage system cannot be logged in after the login is tried for the preset times, the case management service of each controller is entered; under the management service of the chassis, inquiring the specified information through an instruction to record; and analyzing the fault reason of the storage system and positioning the fault reason according to the recorded specified information.
In some embodiments, when the processor 310 executes the storage system monitoring and fault collection subroutine stored in the memory 320, the following steps may be specifically implemented: when the storage system is normally logged in, the period for accessing the storage system is the same as the fault injection period of the storage system.
In some embodiments, when the processor 310 executes the storage system monitoring and fault collection subroutine stored in the memory 320, the following steps may be specifically implemented: the inquired storage system state comprises a cluster state and an alarm event; storage system state anomalies include a cluster state not meeting expectations or an unexpected alarm event being generated.
The terminal device 300 includes a processor 310, a memory 320, and a communication unit 330. The components communicate via one or more buses, and those skilled in the art will appreciate that the architecture of the servers shown in the figures is not intended to be limiting, and may be a bus architecture, a star architecture, a combination of more or less components than those shown, or a different arrangement of components.
The memory 320 may be used for storing instructions executed by the processor 310, and the memory 320 may be implemented by any type of volatile or non-volatile storage terminal or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk. The executable instructions in memory 320, when executed by processor 310, enable terminal 300 to perform some or all of the steps in the method embodiments described below.
The processor 310 is a control center of the storage terminal, connects various parts of the entire electronic terminal using various interfaces and lines, and performs various functions of the electronic terminal and/or processes data by operating or executing software programs and/or modules stored in the memory 320 and calling data stored in the memory. The processor may be composed of an Integrated Circuit (IC), for example, a single packaged IC, or a plurality of packaged ICs connected with the same or different functions. For example, the processor 310 may include only a Central Processing Unit (CPU). In the embodiment of the present invention, the CPU may be a single operation core, or may include multiple operation cores.
A communication unit 330, configured to establish a communication channel so that the storage terminal can communicate with other terminals. And receiving user data sent by other terminals or sending the user data to other terminals.
The present invention also provides a computer storage medium, wherein the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).
The computer storage medium stores a storage system monitoring and fault collection program, which when executed by a processor implements the steps of:
logging in a storage system, periodically accessing the storage system, and inquiring the state of the storage system;
when the state of the storage system is abnormal, triggering dump file collection;
and analyzing and positioning the fault reason of the storage system according to the collected data in the dump file.
The invention comprises the following steps: and (3) setting up a monitoring server, collecting dump information or OSES information in time when a fault occurs, wherein the collected information contains fault information, so that the fault information is analyzed and positioned, and the embarrassment that the fault problem cannot be reproduced or is difficult to reproduce is avoided.
In some embodiments, when the storage system monitoring and fault collection subroutine stored in the readable storage medium is executed by the processor, the following steps may be specifically implemented: if the storage system cannot be logged in, performing login attempt at preset time intervals; if the storage system cannot be logged in after the login is tried for the preset times, the case management service of each controller is entered; under the management service of the chassis, inquiring the specified information through an instruction to record; and analyzing the fault reason of the storage system and positioning the fault reason according to the recorded specified information.
In some embodiments, the storage system monitoring and fault collection sub-program stored in the readable storage medium, when executed by the processor, may specifically implement: when the storage system is normally logged in, the period of accessing the storage system is the same as the fault injection period of the storage system.
In some embodiments, when the storage system monitoring and fault collection subroutine stored in the readable storage medium is executed by the processor, the following steps may be specifically implemented: the inquired storage system state comprises a cluster state and an alarm event; storage system state anomalies include a cluster state not meeting expectations or an unexpected alarm event being generated.
Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied in the form of a software product, where the computer software product is stored in a storage medium, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like, and the storage medium can store program codes, and includes instructions for enabling a computer terminal (which may be a personal computer, a server, or a second terminal, a network terminal, and the like) to perform all or part of the steps of the method in the embodiments of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The above disclosure is only for the preferred embodiments of the present invention, but the present invention is not limited thereto, and any non-inventive changes that can be made by those skilled in the art and several modifications and amendments made without departing from the principle of the present invention shall fall within the protection scope of the present invention.
Claims (10)
1. A storage system monitoring and fault collection method is characterized by comprising the following steps:
building a monitoring server to enable the monitoring server to be communicated with a storage system;
logging in a storage system, periodically accessing the storage system, and inquiring the state of the storage system;
when the state of the storage system is abnormal, triggering dump file collection;
and analyzing and positioning the fault reason of the storage system according to the collected data in the dump file.
2. The storage system monitoring and fault collection method according to claim 1, wherein the monitoring server is connected to each controller of the storage system through a serial port;
the method also includes the steps of:
if the storage system cannot be logged in, performing login attempt at preset time intervals;
if the storage system cannot be logged in after the login is tried for the preset times, the case management service of each controller is entered;
under the management service of the chassis, inquiring the specified information through an instruction to record;
and analyzing the fault reason of the storage system and positioning the fault reason according to the recorded specified information.
3. The storage system monitoring and fault collection method according to claim 2, wherein when the storage system is normally logged in, the period of accessing the storage system is the same as the fault injection period of the storage system.
4. The storage system monitoring and fault collection method according to claim 3, wherein the queried storage system states include cluster states, alarm events;
storage system state anomalies include a cluster state not meeting expectations or an unexpected alarm event being generated.
5. A storage system monitoring and fault collecting device is characterized in that a monitoring server is set up to enable the monitoring server to be communicated with a storage system;
the device comprises a plurality of devices which are connected with each other,
a login module: logging in a storage system;
a state query module: periodically accessing the storage system and inquiring the state of the storage system;
the file collection triggering module: when the state of the storage system is abnormal, triggering dump file collection;
the first fault analysis positioning module: and analyzing and positioning the fault reason of the storage system according to the collected data in the dump file.
6. The storage system monitoring and fault collection apparatus of claim 5, wherein the monitoring server is connected to each controller of the storage system via a serial port;
if the login module cannot log in the storage system, performing login attempt at preset time intervals;
the device also comprises a control device which is used for controlling the operation of the device,
the chassis management service enters the module: if the storage system cannot be logged in after the login is tried for the preset times, the case management service of each controller is entered;
the specified information query recording module: under the management service of the chassis, inquiring the specified information through an instruction to record;
the second fault analysis positioning module: and analyzing the fault reason of the storage system and positioning the fault reason according to the recorded specified information.
7. The storage system monitoring and fault collection apparatus of claim 6, wherein the access period of the status query module to the storage system is the same as the fault injection period of the storage system.
8. The storage system monitoring and fault collection device according to claim 7, wherein the storage system status queried by the fault query module includes a cluster status, an alarm event;
storage system state anomalies include a cluster state not meeting expectations or an unexpected alarm event being generated.
9. A terminal, comprising:
the memory is used for storing a storage system monitoring and fault collecting program;
a processor for implementing the steps of the storage system monitoring and fault collection method according to any one of claims 1 to 4 when executing the storage system monitoring and fault collection program.
10. A computer-readable storage medium, wherein the computer-readable storage medium has stored thereon a storage system monitoring and failure collection program, which when executed by a processor implements the steps of the storage system monitoring and failure collection method according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210589713.0A CN114911578A (en) | 2022-05-27 | 2022-05-27 | Storage system monitoring and fault collecting method and device, terminal and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210589713.0A CN114911578A (en) | 2022-05-27 | 2022-05-27 | Storage system monitoring and fault collecting method and device, terminal and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114911578A true CN114911578A (en) | 2022-08-16 |
Family
ID=82768769
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210589713.0A Pending CN114911578A (en) | 2022-05-27 | 2022-05-27 | Storage system monitoring and fault collecting method and device, terminal and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114911578A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115827413A (en) * | 2023-02-14 | 2023-03-21 | 北京大道云行科技有限公司 | Storage monitoring system and method based on large-page memory |
-
2022
- 2022-05-27 CN CN202210589713.0A patent/CN114911578A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115827413A (en) * | 2023-02-14 | 2023-03-21 | 北京大道云行科技有限公司 | Storage monitoring system and method based on large-page memory |
CN115827413B (en) * | 2023-02-14 | 2023-04-18 | 北京大道云行科技有限公司 | Storage monitoring system and method based on large-page memory |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6167358A (en) | System and method for remotely monitoring a plurality of computer-based systems | |
CN114077525A (en) | Abnormal log processing method and device, terminal equipment, cloud server and system | |
CN111881014B (en) | System test method, device, storage medium and electronic equipment | |
CN114328102A (en) | Equipment state monitoring method, device, equipment and computer readable storage medium | |
CN115033419B (en) | Method and system for realizing hardware fault self-healing | |
CN113708986B (en) | Server monitoring apparatus, method and computer-readable storage medium | |
CN110515799B (en) | MySQL monitoring system based on python language and implementation method | |
US20150370619A1 (en) | Management system for managing computer system and management method thereof | |
CN115658420A (en) | Database monitoring method and system | |
CN114911578A (en) | Storage system monitoring and fault collecting method and device, terminal and storage medium | |
CN108809729A (en) | The fault handling method and device that CTDB is serviced in a kind of distributed system | |
CN115080340A (en) | Method, system, computer device and storage medium for monitoring floppy disk array | |
CN114116330A (en) | Server performance test method, system, terminal and storage medium | |
CN114090382B (en) | Health inspection method and device for super-converged cluster | |
CN115827298A (en) | Server startup fault positioning method and device, terminal and storage medium | |
CN115470056A (en) | Method, system, device and medium for troubleshooting power-on starting of server hardware | |
CN109686017A (en) | A kind of tax controlling equipment management method and system | |
CN114138600A (en) | Storage method, device, equipment and storage medium for firmware key information | |
CN114970476A (en) | Data processing method, system, electronic device and storage medium | |
CN115129544B (en) | Out-of-band one-key acquisition method, system and device for RAID (redundant array of independent disks) logs and storage medium | |
JP2001216166A (en) | Maintenance control method for information processor, information processor, creating method for software and software | |
US20090192818A1 (en) | Systems and method for continuous health monitoring | |
CN111858528B (en) | BMC log collection and management method, system, terminal and storage medium | |
CN115550158A (en) | Processing method, processing device and electronic equipment | |
CN118747165A (en) | Method, device, computer equipment and storage medium for reading log data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |