CN114911578A

CN114911578A - Storage system monitoring and fault collection method, device, terminal and storage medium

Info

Publication number: CN114911578A
Application number: CN202210589713.0A
Authority: CN
Inventors: 王福军
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2022-08-16

Abstract

The invention relates to the field of storage system monitoring, and specifically discloses a storage system monitoring and fault collection method, device, terminal and storage medium. A monitoring server is built so that the monitoring server communicates with the storage system; Storage system status; when the storage system status is abnormal, dump file collection is triggered; based on the data in the collected dump file, the storage system fault cause analysis and fault cause location are performed. The present invention builds a monitoring server, collects dump information or OSES information in time when a fault occurs, and the collected information includes fault information, so as to analyze and locate the fault information, and avoid the dilemma that the fault problem cannot be reproduced or is difficult to reproduce.

Description

Storage system monitoring and fault collection method, device, terminal and storage medium

技术领域technical field

本发明涉及存储系统监控领域，具体涉及一种存储系统监控及故障收集方法、装置、终端及存储介质。The invention relates to the field of storage system monitoring, in particular to a storage system monitoring and fault collection method, device, terminal and storage medium.

背景技术Background technique

在测试过程中，测试人员无法一直盯着存储系统运行，或者有些需要长时间进行的故障注入或反复测试，需要通过脚本进行这些调度，所以不能频繁的去查看存储系统状态，而存储系统的日志会随着时间增长，可能日志会被覆盖，当发现异常的时候无法查看问题发生时的日志。有些故障的发生具有概率性，不能每次都可以复现，所以有些故障一旦错过了当时的信息，再次复现需要花费大量的人力成本和时间成本。During the test, the tester cannot keep staring at the operation of the storage system, or some fault injection or repeated tests that need to be performed for a long time need to be scheduled through scripts, so the storage system status cannot be checked frequently, while the storage system logs It will grow over time, and the log may be overwritten. When an exception is found, the log when the problem occurs cannot be viewed. Some faults occur probabilistically and cannot be reproduced every time. Therefore, once some faults miss the information at that time, it will cost a lot of labor and time to reproduce again.

发明内容SUMMARY OF THE INVENTION

为解决上述问题，本发明提供一种存储系统监控及故障收集方法、装置、终端及存储介质，实现在故障发生后，收集所需要的所有信息，进行故障分析及定位。In order to solve the above problems, the present invention provides a storage system monitoring and fault collection method, device, terminal and storage medium, which can collect all required information after a fault occurs, and perform fault analysis and location.

第一方面，本发明的技术方案提供一种存储系统监控及故障收集方法，包括以下步骤：In a first aspect, the technical solution of the present invention provides a storage system monitoring and fault collection method, comprising the following steps:

搭建监控服务器，使监控服务器与存储系统通信；Build a monitoring server so that the monitoring server communicates with the storage system;

登录存储系统，周期性访问存储系统，查询存储系统状态；Log in to the storage system, access the storage system periodically, and query the storage system status;

当存储系统状态异常时，触发dump文件收集；When the storage system status is abnormal, dump file collection is triggered;

根据收集的dump文件中数据进行存储系统故障原因分析及故障原因定位。Based on the collected data in the dump file, analyze the fault cause of the storage system and locate the fault cause.

进一步地，监控服务器通过串口连接到存储系统的每个控制器；Further, the monitoring server is connected to each controller of the storage system through a serial port;

该方法还包括以下步骤：The method also includes the following steps:

若无法登录存储系统，则每间隔预设时间进行一次登录尝试；If the storage system cannot be logged in, a login attempt will be made every preset time;

若尝试登录预设次数后，仍无法登录存储系统，则进入每个控制器的机箱管理服务；If you still cannot log in to the storage system after trying to log in a preset number of times, enter the chassis management service of each controller;

在机箱管理服务下，通过指令查询指定信息进行记录；Under the chassis management service, record the specified information by querying the command;

根据记录的指定信息进行存储系统故障原因分析及故障原因定位。Perform fault analysis and fault location of the storage system based on the specified information recorded.

进一步地，正常登陆存储系统时，访问存储系统的周期与存储系统的故障注入周期相同。Further, when logging into the storage system normally, the cycle of accessing the storage system is the same as the failure injection cycle of the storage system.

进一步地，查询的存储系统状态包括集群状态、告警事件；Further, the queried storage system status includes cluster status and alarm events;

存储系统状态异常包括集群状态与预期不符或者产生了非预期的告警事件。Abnormal storage system status includes that the cluster status is not as expected or an unexpected alarm event is generated.

第二方面，本发明的技术方案提供一种存储系统监控及故障收集装置，其特征在于，搭建监控服务器，使监控服务器与存储系统通信；In a second aspect, the technical solution of the present invention provides a storage system monitoring and fault collection device, which is characterized in that a monitoring server is built so that the monitoring server communicates with the storage system;

该装置包括，The device includes,

登录模块：登录存储系统；Login module: log in to the storage system;

状态查询模块：周期性访问存储系统，查询存储系统状态；Status query module: Periodically access the storage system to query the storage system status;

文件收集触发模块：当存储系统状态异常时，触发dump文件收集；File collection trigger module: when the storage system status is abnormal, trigger dump file collection;

第一故障分析定位模块：根据收集的dump文件中数据进行存储系统故障原因分析及故障原因定位。The first fault analysis and localization module: analyzes the fault cause of the storage system and locates the fault cause according to the collected data in the dump file.

登录模块若无法登录存储系统，则每间隔预设时间进行一次登录尝试；If the login module cannot log in to the storage system, a login attempt will be made every preset time;

该装置还包括，The device also includes,

机箱管理服务进入模块：若尝试登录预设次数后，仍无法登录存储系统，则进入每个控制器的机箱管理服务；Chassis management service entry module: If you still cannot log in to the storage system after trying to log in a preset number of times, enter the chassis management service of each controller;

指定信息查询记录模块：在机箱管理服务下，通过指令查询指定信息进行记录；Specified information query and record module: under the chassis management service, query specified information to record through commands;

第二故障分析定位模块：根据记录的指定信息进行存储系统故障原因分析及故障原因定位。The second fault analysis and localization module: analyzes the fault cause of the storage system and locates the fault cause according to the recorded specified information.

进一步地，状态查询模块访问存储系统的周期与存储系统的故障注入周期相同。Further, the cycle of accessing the storage system by the status query module is the same as the failure injection cycle of the storage system.

进一步地，故障查询模块查询的存储系统状态包括集群状态、告警事件；Further, the storage system status queried by the fault query module includes cluster status and alarm events;

第三方面，本发明的技术方案提供一种终端，包括：In a third aspect, the technical solution of the present invention provides a terminal, including:

存储器，用于存储存储存储系统监控及故障收集程序；Memory, used to store storage system monitoring and fault collection procedures;

处理器，用于执行所述存储存储系统监控及故障收集程序时实现如上述任一项所述存储存储系统监控及故障收集方法的步骤。The processor is configured to implement the steps of the storage storage system monitoring and fault collection method described in any one of the above when executing the storage storage system monitoring and fault collection program.

第四方面，本发明的技术方案提供一种计算机可读存储介质，所述可读存储介质上存储有存储存储系统监控及故障收集程序，所述存储存储系统监控及故障收集程序被处理器执行时实现如上述任一项所述存储存储系统监控及故障收集方法的步骤。In a fourth aspect, the technical solution of the present invention provides a computer-readable storage medium, on which a storage storage system monitoring and fault collection program is stored, and the storage storage system monitoring and fault collection program is executed by a processor When implementing the steps of the storage storage system monitoring and fault collection method described in any of the above.

本发明提供的一种存储存储系统监控及故障收集方法、装置、终端及存储介质，相对于现有技术，具有以下有益效果：搭建监控服务器，在故障时及时收集dump信息或者OSES信息，所收集信息中包含故障信息，从而进行故障信息分析和定位，避免故障问题无法复现或复现困难的窘境。Compared with the prior art, the method, device, terminal and storage medium for monitoring and collecting faults of a storage storage system provided by the present invention have the following beneficial effects: a monitoring server is built, dump information or OSES information is collected in time when a fault occurs, and the collected The information contains fault information, so as to analyze and locate the fault information, so as to avoid the dilemma that the fault problem cannot be reproduced or is difficult to reproduce.

附图说明Description of drawings

为了更清楚的说明本申请实施例或现有技术的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单的介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present application or the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only For some embodiments of the present application, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

图1是本发明实施例提供的一种存储系统监控及故障收集方法流程示意图。FIG. 1 is a schematic flowchart of a method for monitoring and collecting faults in a storage system according to an embodiment of the present invention.

图2是本发明实施例提供的一种存储系统监控及故障收集装置结构示意框图。FIG. 2 is a schematic block diagram of the structure of a storage system monitoring and fault collecting apparatus according to an embodiment of the present invention.

图3为本发明实施例提供的一种终端的结构示意图。FIG. 3 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

具体实施方式Detailed ways

以下对本发明涉及的英文术语进行解释。The English terms involved in the present invention are explained below.

Dump：Dump文件是进程的内存镜像，可以把程序的执行状态通过调试器保存到dump文件中。Dump: The dump file is the memory image of the process, and the execution state of the program can be saved to the dump file through the debugger.

OSES：是Organic SAS Enclosure Service的简称，中文全称为统一SAS机箱服务，OSES作为存储设备的整机箱管理模块，具有强大的功能，既可以实时监测设备的运行状态，也可以实现与存储各系统模块之间的交互与管理；SAS，是Serial Attached SCSI的简称，中文全称为串口连接接口。OSES: is the abbreviation of Organic SAS Enclosure Service. The full name in Chinese is Unified SAS Chassis Service. OSES, as the whole chassis management module of storage devices, has powerful functions. It can not only monitor the running status of devices in real time, but also realize and store various systems. Interaction and management between modules; SAS is the abbreviation of Serial Attached SCSI, and the full name in Chinese is serial port connection interface.

LDBE：是一个能够查看dump信息的工具。LDBE: is a tool that can view dump information.

为了使本技术领域的人员更好地理解本申请方案，下面结合附图和具体实施方式对本申请作进一步的详细说明。显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to make those skilled in the art better understand the solution of the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

本发明的核心是针对存储系统日志可能会覆盖，导致故障信息无法获取的问题，创建监控服务器，对存储系统进行状态监控，在存储系统故障时，主动收集dump文件或通过OSES记录相关信息，从而进行故障分析及定位，避免日志被覆盖而无法分析故障。The core of the present invention is to create a monitoring server to monitor the status of the storage system for the problem that the log of the storage system may be overwritten, resulting in inability to obtain fault information. Perform fault analysis and location to prevent logs from being overwritten and failure to analyze faults.

图1是本发明实施例提供的一种存储系统监控及故障收集方法流程示意图，如图1所示，该方法包括以下步骤。FIG. 1 is a schematic flowchart of a method for monitoring and collecting faults in a storage system according to an embodiment of the present invention. As shown in FIG. 1 , the method includes the following steps.

S101，搭建监控服务器，使监控服务器与存储系统通信。S101, build a monitoring server, so that the monitoring server communicates with the storage system.

预先搭建监控服务器，用来进行存储系统的监控以及指令的下发。可以理解的是，监控服务器与存储系统通信，既实现监控服务器对存储系统的监控，又实现监控服务器从存储系统进行数据收集。A monitoring server is built in advance to monitor the storage system and issue instructions. It can be understood that the communication between the monitoring server and the storage system not only realizes the monitoring of the storage system by the monitoring server, but also enables the monitoring server to collect data from the storage system.

所搭建的监控服务器可以是Linux服务器。The built monitoring server can be a Linux server.

S102，登录存储系统，周期性访问存储系统，查询存储系统状态。S102, log in to the storage system, periodically access the storage system, and query the storage system status.

监控服务器首先登录存储系统，可以配置监控服务器到存储系统的免密登录，当然，也根据具体情况和用户需求设置密码登录，具体登录方式的选择不影响本申请实施例的实现。The monitoring server first logs in to the storage system, and password-free login from the monitoring server to the storage system can be configured. Of course, password login can also be set according to specific conditions and user needs. The choice of specific login method does not affect the implementation of the embodiments of this application.

监控服务器周期性访问存储系统，每隔一定时间访问一次存储系统，查询存储系统状态，判断存储系统是否出现异常。The monitoring server periodically accesses the storage system, visits the storage system at regular intervals, queries the storage system status, and determines whether the storage system is abnormal.

S103，当存储系统状态异常时，触发dump文件收集。S103, when the storage system state is abnormal, trigger the dump file collection.

存储系统提供收集dump文件的CLI命令，在监控服务器查询到存储系统异常时，触发存储系统下发CLI命令，进行dump文件的收集。The storage system provides CLI commands for collecting dump files. When the monitoring server finds that the storage system is abnormal, it triggers the storage system to issue CLI commands to collect dump files.

S104，根据收集的dump文件中数据进行存储系统故障原因分析及故障原因定位。S104 , analyzing the failure cause of the storage system and locating the failure cause according to the collected data in the dump file.

可以理解的是，收集的dump文件传输至监控服务器，由监控服务器对dump文件中数据进行存储系统故障原因分析及故障原因定位。It is understandable that the collected dump file is transmitted to the monitoring server, and the monitoring server analyzes the storage system failure cause and locates the failure cause for the data in the dump file.

存储系统发生意外故障或者其他问题时，通过dump文件来分析进程运行时状态，从而分析问题原因。Dump可以通过手动触发或者系统自动收集，存储系统在遇到意外重启或其他一些业务故障的时候触发dump日志收集，可以使用LDBE工具进行dump查看，从而供维护或研发人员分析问题原因。When an unexpected failure or other problem occurs in the storage system, the dump file is used to analyze the running state of the process to analyze the cause of the problem. Dump can be triggered manually or collected automatically by the system. The storage system triggers dump log collection when it encounters unexpected restarts or some other business failures. You can use the LDBE tool to view the dump, so that maintenance or R&D personnel can analyze the cause of the problem.

本发明实施例提供的一种存储系统监控及故障收集方法，搭建监控服务器，在故障时及时收集dump信息或者OSES信息，所收集信息中包含故障信息，从而进行故障信息分析和定位，避免故障问题无法复现或复现困难的窘境。The embodiment of the present invention provides a storage system monitoring and fault collection method. A monitoring server is built to collect dump information or OSES information in time when a fault occurs. The collected information includes fault information, so as to analyze and locate fault information and avoid fault problems. Difficult dilemmas cannot be reproduced or reproduced.

在上述实施例的基础上，作为优选的实施方式，该方法在遇到无法登录存储系统而无法进行dump文件的情况时，通过机箱管理服务器（OSES）记录相关信息进行故障分析及定位，具体包括：On the basis of the above embodiment, as a preferred implementation, the method records the relevant information through the chassis management server (OSES) to analyze and locate the fault when encountering the situation that the storage system cannot be logged in and the dump file cannot be performed. Specifically, the method includes: :

需要说明的是，创建了监控服务器之后，将监控服务器通过串口连接到存储系统的每个控制器，以便后续在无法登录存储系统时，进入控制器的机箱管理服务进行相关信息记录。存储系统通过机箱管理服务进行机箱管理，在存储系统使用或者测试过程中，有些故障可能导致存储系统无法启动，从而无法收集dump文件，无法进行问题的定位。这时就需要登录OSES，查看硬件或底层软件的状态，从而分析存储系统无法启动的原因。It should be noted that after the monitoring server is created, connect the monitoring server to each controller of the storage system through the serial port, so that when the storage system cannot be logged in later, it can enter the chassis management service of the controller to record relevant information. The storage system manages the chassis through the chassis management service. During the use or testing of the storage system, some faults may cause the storage system to fail to start, so that the dump file cannot be collected and the problem cannot be located. At this time, you need to log in to OSES to check the status of the hardware or underlying software, so as to analyze the reason why the storage system cannot be started.

在遇到无法登录存储系统的情况时，首先尝试每间隔一定时间进行一次登录尝试，在尝试一定次数后，仍无法存储系统的情况下，再进入每个控制器的机箱管理服务，在机箱管理服务下，通过指令查询指定信息进行记录。例如，每间隔30s进行一次登录尝试，在尝试20次后超时，当然，具体时间间隔和次数可根据具体实际情况和用户需求进行具体设定，设定数值不影响本申请实施例的实现。When you cannot log in to the storage system, first try to log in at a certain interval. If the storage system still fails after a certain number of attempts, enter the chassis management service of each controller, and then log in to the chassis management system. Under the service, the specified information is queried through the command to record. For example, a login attempt is made at an interval of 30s, and a timeout occurs after 20 attempts. Of course, the specific time interval and number of times can be specifically set according to the actual situation and user needs, and the set value does not affect the implementation of the embodiments of the present application.

在上述实施例的基础上，作为优选的实施例，该方法还包括：On the basis of the above embodiment, as a preferred embodiment, the method further includes:

正常登陆存储系统时，访问存储系统的周期与存储系统的故障注入周期相同。When logging in to the storage system normally, the cycle of accessing the storage system is the same as the failure injection cycle of the storage system.

监控服务器能正常登陆存储系统时，如果存储系统在进行故障注入测试，监控服务器访问存储系统的周期可与存储系统的故障注入周期相同，例如当前进行的故障注入30min一个周期，那么存储系统状态查询也保持相同频率，保证每次故障注入后都进行存储状态检查。当然，监控服务器访问存储系统的周期根据实际进行的测试来定，例如，如果不是进行周期性的故障注入测试，而是进行常规的压力测试等，那可以自行设置一个监控周期进行监控。可以理解的是，监控周期的具体设定值，不影响本发明实施例的实现。When the monitoring server can log in to the storage system normally, if the storage system is performing a fault injection test, the monitoring server accesses the storage system at the same cycle as the storage system's fault injection cycle. For example, the current fault injection cycle is 30 minutes. The same frequency is also maintained to ensure that the storage state is checked after each fault injection. Of course, the period of the monitoring server accessing the storage system is determined according to the actual test. For example, if a regular stress test is not performed for periodic fault injection tests, you can set a monitoring period for monitoring. It can be understood that the specific set value of the monitoring period does not affect the implementation of the embodiments of the present invention.

在上述实施例的基础上，作为优选的实施例，监控服务器在查询存储系统状态时，具体检查哪些内容也可以测试内容来定，例如查询的存储系统状态包括集群状态、告警事件，相应的，存储系统状态异常包括集群状态与预期不符或者产生了非预期的告警事件，当出现了集群状态与预期不符或者产生了非预期的告警事件时，触发存储系统进行dump文件收集。On the basis of the above-mentioned embodiment, as a preferred embodiment, when the monitoring server queries the storage system status, the content to be checked can also be determined by the test content. For example, the storage system status to be queried includes cluster status and alarm events. The abnormal state of the storage system includes that the cluster state does not meet expectations or an unexpected alarm event is generated. When the cluster state does not meet expectations or an unexpected alarm event occurs, the storage system is triggered to collect dump files.

为进一步理解本发明，以下提供一具体实施例，对本发明进一步说明，该具体实施例包括以下步骤。In order to further understand the present invention, a specific embodiment is provided below to further illustrate the present invention, and the specific embodiment includes the following steps.

步骤一，搭建一个linux服务器，用来进行存储系统的监控以及指令的下发。需要配置linux服务器到存储的免密登录。Step 1, build a linux server to monitor the storage system and issue instructions. You need to configure the password-free login from the Linux server to the storage.

步骤二，通过串口服务器将存储的每个控制器都使用串口线进行连接，用来进行存储系统无法启动时的日志收集。Step 2: Connect each controller in the storage with a serial cable through the serial port server, which is used to collect logs when the storage system fails to start.

步骤三，linux服务器中监控脚本周期性访问存储系统，查询存储系统状态。Step 3: The monitoring script in the Linux server periodically accesses the storage system to query the status of the storage system.

访问周期根据实际进行的测试来定，例如当前进行的故障注入30min一个周期，那么存储状态查询也保持相同频率，保证每次故障注入后都进行存储状态检查；存储状态检查哪些内容也可以根据测试内容来定，例如常规的集群状态、告警事件等。The access period is determined according to the actual test. For example, if the current fault injection is performed for a period of 30 minutes, the storage status query will also maintain the same frequency to ensure that the storage status check is performed after each fault injection; the storage status check can also be checked according to the test. It depends on the content, such as general cluster status, alarm events, etc.

如果不是进行周期性的故障注入测试，而是进行常规的压力测试等，那可以自行设置一个监控周期进行监控。If you do not perform periodic fault injection tests, but perform regular stress tests, etc., you can set a monitoring cycle for monitoring.

步骤四，如果发现需要查询的存储状态与预期不符，或者产生了非预期的告警事件，那么就通过CLI命令触发livedump收集。Step 4: If it is found that the storage status to be queried is not in line with expectations, or an unexpected alarm event is generated, the CLI command is used to trigger the liveump collection.

步骤五，如果进行状态查询的时候无法登录存储系统，那么每间隔30s（时间可以自行设定）进行一次尝试，在尝试20次后超时。Step 5: If you cannot log in to the storage system during status query, try once every 30s (the time can be set by yourself), and the timeout will expire after 20 attempts.

步骤六，超时后通过串口服务器使用串口连接每个控制器，然后通过指令进入OSES。Step 6: After the timeout, use the serial port to connect to each controller through the serial port server, and then enter OSES through the command.

步骤七，在OSES命令行下，通过指令查询指定的信息进行记录。Step 7: Under the OSES command line, query the specified information through the command to record.

通过步骤四和步骤七中收集信息，可以进行故障原因的根因定位。Through the information collected in steps 4 and 7, the root cause of the fault can be located.

上文中对于一种存储系统监控及故障收集方法的实施例进行了详细描述，基于上述实施例描述的存储系统监控及故障收集方法，本发明实施例还提供一种存储系统监控及故障收集装置，用于实现存储系统监控及故障收集方法。Embodiments of a storage system monitoring and fault collection method are described in detail above. Based on the storage system monitoring and fault collection methods described in the above embodiments, an embodiment of the present invention further provides a storage system monitoring and fault collection apparatus, It is used to implement storage system monitoring and fault collection methods.

图2是本发明实施例提供的一种存储系统监控及故障收集装置结构示意框图，如图2所示，该装置包括：登录模块101、状态查询模块102、文件收集触发模块103、第一故障分析定位模块104、机箱管理服务进入模块105、指定信息查询记录模块106、第二故障分析定位模块107。FIG. 2 is a schematic block diagram of the structure of a storage system monitoring and fault collection device provided by an embodiment of the present invention. As shown in FIG. 2, the device includes: a login module 101, a status query module 102, a file collection trigger module 103, a first fault An analysis and location module 104 , a chassis management service entry module 105 , a specified information query and record module 106 , and a second fault analysis and location module 107 .

搭建监控服务器，使监控服务器与存储系统通信。监控服务器通过串口连接到存储系统的每个控制器。Build a monitoring server so that the monitoring server communicates with the storage system. The monitoring server is connected to each controller of the storage system through the serial port.

登录模块101：登录存储系统。Login module 101: log in to the storage system.

状态查询模块102：周期性访问存储系统，查询存储系统状态。Status query module 102: Periodically access the storage system to query the storage system status.

文件收集触发模块103：当存储系统状态异常时，触发dump文件收集。File collection triggering module 103 : triggering dump file collection when the storage system status is abnormal.

第一故障分析定位模块104：根据收集的dump文件中数据进行存储系统故障原因分析及故障原因定位。The first fault analysis and location module 104 : analyzes the fault cause of the storage system and locates the fault cause according to the collected data in the dump file.

其中，登录模块101若无法登录存储系统，则每间隔预设时间进行一次登录尝试。故障查询模块102查询的存储系统状态包括集群状态、告警事件；存储系统状态异常包括集群状态与预期不符或者产生了非预期的告警事件。Wherein, if the login module 101 fails to log in to the storage system, a login attempt is made every preset time. The storage system status queried by the fault query module 102 includes the cluster status and alarm events; the abnormal storage system status includes that the cluster status is inconsistent with expectations or an unexpected alarm event is generated.

机箱管理服务进入模块105：若尝试登录预设次数后，仍无法登录存储系统，则进入每个控制器的机箱管理服务。Chassis management service entry module 105: If the storage system cannot be logged in after a preset number of attempts, enter the chassis management service of each controller.

指定信息查询记录模块106：在机箱管理服务下，通过指令查询指定信息进行记录。Specified information query and record module 106: Under the chassis management service, query specified information through an instruction to record.

第二故障分析定位模块107：根据记录的指定信息进行存储系统故障原因分析及故障原因定位。The second fault analysis and location module 107 : analyzes the fault cause of the storage system and locates the fault cause according to the recorded specified information.

本实施例的存储存储系统监控及故障收集装置用于实现前述的存储存储系统监控及故障收集方法，因此该装置中的具体实施方式可见前文中的存储存储系统监控及故障收集方法的实施例部分，所以，其具体实施方式可以参照相应的各个部分实施例的描述，在此不再展开介绍。The storage and storage system monitoring and fault collection apparatus in this embodiment is used to implement the aforementioned storage and storage system monitoring and fault collection method. Therefore, the specific implementation of the apparatus can be found in the embodiment section of the storage and storage system monitoring and fault collection method above. Therefore, reference may be made to the descriptions of the corresponding partial embodiments for specific implementations thereof, which will not be described herein again.

另外，由于本实施例的存储存储系统监控及故障收集装置用于实现前述的存储存储系统监控及故障收集方法，因此其作用与上述方法的作用相对应，这里不再赘述。In addition, since the storage storage system monitoring and fault collecting apparatus in this embodiment is used to implement the aforementioned storage storage system monitoring and fault collecting method, its function corresponds to the function of the above method, and will not be repeated here.

图3为本发明实施例提供的一种终端装置300的结构示意图，包括：处理器310、存储器320及通信单元330。所述处理器310用于实现存储器320中保存的存储存储系统监控及故障收集程序时实现以下步骤：FIG. 3 is a schematic structural diagram of a terminal device 300 according to an embodiment of the present invention, including: a processor 310 , a memory 320 , and a communication unit 330 . The processor 310 implements the following steps when implementing the storage storage system monitoring and fault collection program stored in the memory 320:

本发明：搭建监控服务器，在故障时及时收集dump信息或者OSES信息，所收集信息中包含故障信息，从而进行故障信息分析和定位，避免故障问题无法复现或复现困难的窘境。The present invention: builds a monitoring server, collects dump information or OSES information in time when a fault occurs, and the collected information includes fault information, so as to analyze and locate the fault information, and avoid the dilemma that the fault problem cannot be reproduced or is difficult to reproduce.

在一些具体实施例中，所述处理器310执行存储器320中保存的存储系统监控及故障收集子程序时，具体可以实现：若无法登录存储系统，则每间隔预设时间进行一次登录尝试；若尝试登录预设次数后，仍无法登录存储系统，则进入每个控制器的机箱管理服务；在机箱管理服务下，通过指令查询指定信息进行记录；根据记录的指定信息进行存储系统故障原因分析及故障原因定位。In some specific embodiments, when the processor 310 executes the storage system monitoring and fault collection subprograms stored in the memory 320, it can be specifically implemented that: if the storage system cannot be logged in, a login attempt is made every preset time; After trying to log in to the storage system for a preset number of times, if you still cannot log in to the storage system, enter the chassis management service of each controller; under the chassis management service, query the specified information through commands to record; analyze the storage system failure cause and analyze the specified information according to the recorded specified information. Locating the cause of the fault.

在一些具体实施例中，所述处理器310执行存储器320中保存的存储系统监控及故障收集子程序时，具体可以实现：正常登陆存储系统时，访问存储系统的周期与存储系统的故障注入周期相同。In some specific embodiments, when the processor 310 executes the storage system monitoring and fault collection subprograms stored in the memory 320, it can be specifically implemented: when logging into the storage system normally, the period of accessing the storage system and the period of fault injection of the storage system same.

在一些具体实施例中，所述处理器310执行存储器320中保存的存储系统监控及故障收集子程序时，具体可以实现：查询的存储系统状态包括集群状态、告警事件；存储系统状态异常包括集群状态与预期不符或者产生了非预期的告警事件。In some specific embodiments, when the processor 310 executes the storage system monitoring and fault collection subprograms stored in the memory 320, it can be specifically implemented that: the queried storage system status includes the cluster status and alarm events; the abnormal storage system status includes the cluster status The status is not as expected or an unexpected alarm event is generated.

该终端装置300包括处理器310、存储器320及通信单元330。这些组件通过一条或多条总线进行通信，本领域技术人员可以理解，图中示出的服务器的结构并不构成对本发明的限定，它既可以是总线形结构，也可以是星型结构，还可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。The terminal device 300 includes a processor 310 , a memory 320 and a communication unit 330 . These components communicate through one or more buses. Those skilled in the art can understand that the structure of the server shown in the figure does not constitute a limitation of the present invention. It can be either a bus structure, a star structure, or a More or fewer components than shown may be included, or some components may be combined, or a different arrangement of components.

其中，该存储器320可以用于存储处理器310的执行指令，存储器320可以由任何类型的易失性或非易失性存储终端或者它们的组合实现，如静态随机存取存储器（SRAM），电可擦除可编程只读存储器（EEPROM），可擦除可编程只读存储器（EPROM），可编程只读存储器（PROM），只读存储器（ROM），磁存储器，快闪存储器，磁盘或光盘。当存储器320中的执行指令由处理器310执行时，使得终端300能够执行以下上述方法实施例中的部分或全部步骤。Wherein, the memory 320 can be used to store the execution instructions of the processor 310, and the memory 320 can be implemented by any type of volatile or non-volatile storage terminal or their combination, such as static random access memory (SRAM), electrical Erasable Programmable Read Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk . When the execution instructions in the memory 320 are executed by the processor 310, the terminal 300 is enabled to execute some or all of the steps in the following method embodiments.

处理器310为存储终端的控制中心，利用各种接口和线路连接整个电子终端的各个部分，通过运行或执行存储在存储器320内的软件程序和/或模块，以及调用存储在存储器内的数据，以执行电子终端的各种功能和/或处理数据。所述处理器可以由集成电路(Integrated Circuit，简称IC) 组成，例如可以由单颗封装的IC 所组成，也可以由连接多颗相同功能或不同功能的封装IC而组成。举例来说，处理器310可以仅包括中央处理器(Central Processing Unit，简称CPU)。在本发明实施方式中，CPU可以是单运算核心，也可以包括多运算核心。The processor 310 is the control center of the storage terminal, using various interfaces and lines to connect various parts of the entire electronic terminal, by running or executing the software programs and/or modules stored in the memory 320, and calling the data stored in the memory, To perform various functions of the electronic terminal and/or process data. The processor may be composed of an integrated circuit (Integrated Circuit, IC for short), for example, may be composed of a single packaged IC, or may be composed of a plurality of packaged ICs connected with the same function or different functions. For example, the processor 310 may only include a central processing unit (Central Processing Unit, CPU for short). In the embodiment of the present invention, the CPU may be a single computing core, or may include multiple computing cores.

通信单元330，用于建立通信信道，从而使所述存储终端可以与其它终端进行通信。接收其他终端发送的用户数据或者向其他终端发送用户数据。The communication unit 330 is used for establishing a communication channel, so that the storage terminal can communicate with other terminals. Receive user data sent by other terminals or send user data to other terminals.

本发明还提供一种计算机存储介质，这里所说的存储介质可为磁碟、光盘、只读存储记忆体（英文：read-only memory，简称：ROM）或随机存储记忆体（英文：random accessmemory，简称：RAM）等。The present invention also provides a computer storage medium, which can be a magnetic disk, an optical disk, a read-only memory (English: read-only memory, ROM for short) or a random access memory (English: random access memory). , referred to as: RAM) and so on.

计算机存储介质存储有存储系统监控及故障收集程序，所述存储系统监控及故障收集程序被处理器执行时实现以下步骤：The computer storage medium stores a storage system monitoring and fault collection program, and when the storage system monitoring and fault collection program is executed by the processor, the following steps are implemented:

在一些具体实施例中，所述可读存储介质中存储的存储系统监控及故障收集子程序被处理器执行时，具体可以实现：若无法登录存储系统，则每间隔预设时间进行一次登录尝试；若尝试登录预设次数后，仍无法登录存储系统，则进入每个控制器的机箱管理服务；在机箱管理服务下，通过指令查询指定信息进行记录；根据记录的指定信息进行存储系统故障原因分析及故障原因定位。In some specific embodiments, when the storage system monitoring and fault collection subprograms stored in the readable storage medium are executed by the processor, it can be specifically implemented that: if the storage system cannot be logged in, a login attempt is made every preset time. ;If you still cannot log in to the storage system after trying to log in for a preset number of times, enter the chassis management service of each controller; under the chassis management service, query the specified information through commands to record; according to the specified information recorded, determine the cause of the storage system failure Analysis and fault location.

在一些具体实施例中，所述可读存储介质中存储的存储系统监控及故障收集子程序被处理器执行时，具体可以实现：正常登陆存储系统时，访问存储系统的周期与存储系统的故障注入周期相同。In some specific embodiments, when the storage system monitoring and fault collection subprograms stored in the readable storage medium are executed by the processor, it can be specifically implemented that: when logging into the storage system normally, the period of accessing the storage system and the failure of the storage system The injection cycle is the same.

在一些具体实施例中，所述可读存储介质中存储的存储系统监控及故障收集子程序被处理器执行时，具体可以实现：查询的存储系统状态包括集群状态、告警事件；存储系统状态异常包括集群状态与预期不符或者产生了非预期的告警事件。In some specific embodiments, when the storage system monitoring and fault collection subprograms stored in the readable storage medium are executed by the processor, it can be specifically implemented that: the queried storage system status includes cluster status and alarm events; the storage system status is abnormal Including that the cluster state is not as expected or an unexpected alarm event is generated.

本领域的技术人员可以清楚地了解到本发明实施例中的技术可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解，本发明实施例中的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中如U盘、移动硬盘、只读存储器（ROM，Read-Only Memory）、随机存取存储器（RAM，Random Access Memory）、磁碟或者光盘等各种可以存储程序代码的介质，包括若干指令用以使得一台计算机终端（可以是个人计算机，服务器，或者第二终端、网络终端等）执行本发明各个实施例所述方法的全部或部分步骤。Those skilled in the art can clearly understand that the technology in the embodiments of the present invention can be implemented by means of software plus a necessary general hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention can be embodied in the form of software products in essence or in the parts that make contributions to the prior art. The computer software products are stored in a storage medium such as a USB flash drive, a mobile Hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes, including several instructions to make a computer terminal (It may be a personal computer, a server, or a second terminal, a network terminal, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention.

在本发明所提供的几个实施例中，应该理解到，所揭露的系统、装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided by the present invention, it should be understood that the disclosed systems, devices and methods may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.

以上公开的仅为本发明的优选实施方式，但本发明并非局限于此，任何本领域的技术人员能思之的没有创造性的变化，以及在不脱离本发明原理前提下所作的若干改进和润饰，都应落在本发明的保护范围内。The above disclosure is only the preferred embodiment of the present invention, but the present invention is not limited thereto, any non-creative changes that can be conceived by those skilled in the art, and some improvements and modifications made without departing from the principles of the present invention , all should fall within the protection scope of the present invention.

Claims

1. a storage system monitoring and fault collection method, is characterized in that, comprises the following steps:

Build a monitoring server so that the monitoring server communicates with the storage system;

Log in to the storage system, access the storage system periodically, and query the storage system status;

When the storage system status is abnormal, dump file collection is triggered;

Based on the collected data in the dump file, analyze the fault cause of the storage system and locate the fault cause.

2. The storage system monitoring and fault collection method according to claim 1, wherein the monitoring server is connected to each controller of the storage system through a serial port;

The method also includes the following steps:

If the storage system cannot be logged in, a login attempt will be made every preset time;

If you still cannot log in to the storage system after trying to log in a preset number of times, enter the chassis management service of each controller;

Under the chassis management service, record the specified information by querying the command;

Perform fault analysis and fault location of the storage system based on the specified information recorded.

3 . The storage system monitoring and fault collection method according to claim 2 , wherein when logging into the storage system normally, the period of accessing the storage system is the same as the period of fault injection of the storage system. 4 .

4. The storage system monitoring and fault collection method according to claim 3, wherein the queried storage system status includes a cluster status and an alarm event;

Abnormal storage system status includes that the cluster status is not as expected or an unexpected alarm event is generated.

5. A storage system monitoring and fault collection device, characterized in that a monitoring server is set up so that the monitoring server communicates with the storage system;

The device includes,

Login module: log in to the storage system;

Status query module: Periodically access the storage system to query the storage system status;

File collection trigger module: when the storage system status is abnormal, trigger dump file collection;

The first fault analysis and localization module: analyzes the fault cause of the storage system and locates the fault cause according to the collected data in the dump file.

6. The storage system monitoring and fault collection device according to claim 5, wherein the monitoring server is connected to each controller of the storage system through a serial port;

If the login module cannot log in to the storage system, a login attempt will be made every preset time;

The device also includes,

Chassis management service entry module: If you still cannot log in to the storage system after trying to log in a preset number of times, enter the chassis management service of each controller;

Specified information query and record module: under the chassis management service, query specified information to record through commands;

The second fault analysis and localization module: analyzes the fault cause of the storage system and locates the fault cause according to the recorded specified information.

7 . The storage system monitoring and fault collection device according to claim 6 , wherein the cycle of accessing the storage system by the status query module is the same as the failure injection cycle of the storage system. 8 .

8 . The storage system monitoring and fault collection device according to claim 7 , wherein the storage system status queried by the fault query module includes a cluster status and an alarm event; 8 .

9. A terminal, characterized in that, comprising:

Memory, used to store storage system monitoring and fault collection procedures;

The processor is configured to implement the steps of the storage system monitoring and fault collecting method according to any one of claims 1-4 when executing the storage system monitoring and fault collecting program.

10. A computer-readable storage medium, characterized in that, a storage system monitoring and fault collection program is stored on the readable storage medium, and the storage system monitoring and fault collection program is implemented as claimed in claim 1 when the storage system monitoring and fault collection program is executed by a processor. -4 The steps of any one of the storage system monitoring and fault collection methods.