CN112069014A - Storage system fault simulation method, device, equipment and medium - Google Patents

Storage system fault simulation method, device, equipment and medium Download PDF

Info

Publication number
CN112069014A
CN112069014A CN202010889696.3A CN202010889696A CN112069014A CN 112069014 A CN112069014 A CN 112069014A CN 202010889696 A CN202010889696 A CN 202010889696A CN 112069014 A CN112069014 A CN 112069014A
Authority
CN
China
Prior art keywords
node
nodes
capacity
preset
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010889696.3A
Other languages
Chinese (zh)
Other versions
CN112069014B (en
Inventor
孙京本
李佩
刘如意
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202010889696.3A priority Critical patent/CN112069014B/en
Publication of CN112069014A publication Critical patent/CN112069014A/en
Application granted granted Critical
Publication of CN112069014B publication Critical patent/CN112069014B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/26Functional testing
    • G06F11/261Functional testing by simulating additional hardware, e.g. fault simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a storage system fault simulation method, a device, equipment and a medium, wherein the method comprises the following steps: constructing a cluster to be tested; the cluster to be tested comprises a plurality of nodes; the plurality of nodes comprise a main node and at least one non-main node; each node is configured with at least one first node storage pool; controlling a plurality of nodes to simulate IO read-write to each first node storage pool; monitoring a first usage capacity of each first pool of node storage; controlling at least one node in the plurality of nodes to simulate and execute a preset action according to the first use capacity; and updating the state change of each node in the plurality of nodes to a node log configured for each node, wherein the node log is used for providing a judgment basis for evaluating the compression resistance of a storage system installed on the node. The method and the device can simulate more fault types with low cost, high efficiency and high accuracy so as to judge whether the storage system has high availability or not and whether the storage system has the pressure resistance required by corresponding services or not.

Description

Storage system fault simulation method, device, equipment and medium
Technical Field
The present invention relates to the field of storage cluster technologies, and in particular, to a method, an apparatus, a device, and a medium for simulating a storage system fault.
Background
The storage system is a system in which a computer is composed of various storage devices for storing programs and data, a control unit, a device (hardware) for managing information scheduling, and an algorithm (software). The main memory of the computer can not meet the requirements of high access speed, large storage capacity and low cost at the same time, and a multi-level hierarchical memory with the speed from slow to fast and the capacity from large to small is required in the computer to form a storage system with acceptable performance by an optimal control scheduling algorithm and reasonable cost.
In the practical application process, a storage mode of distributing data to each node in a storage cluster is adopted to improve the storage capacity supported by the storage system, wherein the storage cluster is a storage pool capable of aggregating storage spaces in a plurality of physical volumes (such as magnetic disks or hard disks) into one unified access interface capable of providing clients with a unified access interface, and the clients can access and utilize the storage spaces on the storage cluster through the access interface.
After a storage cluster is built, the availability of the storage cluster needs to be tested. The availability test method provided in the related art is to artificially simulate a failure of a storage cluster, that is, manually change the state of a node, to find the defects and shortcomings existing in the storage cluster, and then to perfect the storage cluster according to the corresponding defects and shortcomings. However, since the storage space of the storage cluster is large, the storage cluster is occupied once to complete a complete simulation, and the time spent on occupying the storage cluster is extremely long; moreover, the number of nodes in the cluster may be large, and many human errors may exist when the nodes are artificially simulated to be offline and online. Therefore, if artificial fault simulation is adopted, the method and the device have the defects of high labor cost, low efficiency, low accuracy and limited fault simulation types, so that the method and the device capable of automatically simulating the faults of the storage cluster are needed.
Disclosure of Invention
The embodiment of the application provides a storage system fault simulation method, device, equipment and medium, solves the technical problems that in the prior art, the manpower cost is high, the efficiency is low, the accuracy is low in a mode of manually simulating the faults of a storage cluster, the types of the faults capable of being simulated are limited, and achieves the technical effect of simulating more types of the faults with low cost, high efficiency and high accuracy.
In a first aspect, the present application provides a storage system fault simulation method, including:
constructing a cluster to be tested; the cluster to be tested comprises a plurality of nodes; the plurality of nodes comprise a main node and at least one non-main node; each node is configured with at least one first node storage pool;
controlling a plurality of nodes to simulate IO read-write to each first node storage pool;
monitoring a first usage capacity of each first pool of node storage;
controlling at least one node in the plurality of nodes to simulate and execute a preset action according to the first use capacity;
updating the state change of each node in the plurality of nodes to a node log configured for each node, wherein the node log is used for providing a judgment basis for evaluating the compression resistance of a storage system installed on the node; the state change of each node is generated by simulating and executing a preset action by at least one node in the plurality of nodes.
Further, the preset actions comprise an offline action and an online action; according to the first use capacity, at least one node in the plurality of nodes is controlled to simulate and execute a preset action, and the method specifically comprises the following steps:
judging whether the variable quantity of the first use capacity exceeds a first preset capacity or not;
when the variation of the first use capacity exceeds a first preset capacity, selecting one non-main node from the non-main nodes as a first non-main node, and triggering the first non-main node to simulate and execute an offline action; and triggering the first non-host node to simulate and execute the on-line action after a preset time threshold value.
Further, the preset action comprises a restart action; according to the first use capacity, at least one node in the plurality of nodes is controlled to simulate and execute a preset action, and the method specifically comprises the following steps:
judging whether the first use capacity exceeds a second preset capacity or not;
when the first use capacity exceeds a second preset capacity, the main nodes are triggered to simulate and execute a restarting action, and one non-main node is selected from the non-main nodes to serve as a new main node of the cluster to be tested; and after the master node is restarted, the master node is used as a non-master node to be added into the cluster to be tested again.
Further, the preset action comprises an updating action; according to the first use capacity, at least one node in the plurality of nodes is controlled to simulate and execute a preset action, and the method specifically comprises the following steps:
judging whether the first use capacity exceeds a third preset capacity or not;
when the first used capacity exceeds a third preset capacity, simulating and executing an updating action on each node, wherein the updating action is used for updating the node storage pool of each node; recording the updated first node storage pool of each node as a second node storage pool; controlling a plurality of nodes to simulate IO read-write to each second node storage pool; monitoring a second usage capacity of each of the second pool of nodes; and controlling at least one node in the plurality of nodes to simulate and execute a preset action according to the second use capacity.
In a second aspect, the present application provides a storage system fault simulation apparatus, comprising:
the building module is used for building a cluster to be tested; the cluster to be tested comprises a plurality of nodes; the plurality of nodes comprise a main node and at least one non-main node; each node is configured with at least one first node storage pool;
the control module is used for controlling the plurality of nodes to simulate IO read-write to each first node storage pool;
a monitoring module, configured to monitor a first usage capacity of each first node storage pool;
the execution module is used for controlling at least one node in the plurality of nodes to simulate and execute a preset action according to the first use capacity;
the updating module is used for updating the state change of each node in the plurality of nodes to a node log configured by each node, and the node log is used for providing a judgment basis for evaluating the compression resistance of a storage system installed on the node; the state change of each node is generated by simulating and executing a preset action by at least one node in the plurality of nodes.
Further, the preset actions comprise an offline action and an online action; wherein, the execution module specifically includes:
the first judgment submodule is used for judging whether the variation of the first use capacity exceeds a first preset capacity or not;
the selection submodule is used for selecting one non-main node from the non-main nodes as a first non-main node when the variable quantity of the first use capacity exceeds a first preset capacity, and triggering the first non-main node to simulate and execute an offline action; and triggering the first non-host node to simulate and execute the on-line action after a preset time threshold value.
Further, the preset action comprises a restart action; wherein, the execution module specifically includes:
the second judgment submodule is used for judging whether the first use capacity exceeds a second preset capacity or not;
the restarting submodule is used for triggering the main nodes to simulate and execute restarting actions when the first using capacity exceeds the second preset capacity, and selecting one non-main node from the non-main nodes as a new main node of the cluster to be tested; and after the master node is restarted, the master node is used as a non-master node to be added into the cluster to be tested again.
Further, the preset action comprises an updating action; the execution module specifically comprises:
the third judgment submodule is used for judging whether the first use capacity exceeds a third preset capacity or not;
the updating submodule is used for simulating and executing an updating action on each node when the first using capacity exceeds a third preset capacity, and the updating action is used for updating the node storage pool of each node; recording the updated first node storage pool of each node as a second node storage pool; controlling a plurality of nodes to simulate IO read-write to each second node storage pool; monitoring a second usage capacity of each of the second pool of nodes; and controlling at least one node in the plurality of nodes to simulate and execute a preset action according to the second use capacity.
In a third aspect, the present application provides an electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to execute to implement a storage system fault simulation method.
In a fourth aspect, the present application provides a non-transitory computer readable storage medium having instructions that, when executed by a processor of an electronic device, enable the electronic device to perform a method of implementing storage system fault simulation.
One or more technical solutions provided in the embodiments of the present application have at least the following technical effects or advantages:
1. this application is through to the IO reading and writing of node storage pool simulation, monitors the use capacity in node storage pool, and then carries out preset action to the node according to the use capacity to corresponding trouble has been avoided artifical simulation cluster trouble to the simulation, and then can realize that low cost, high efficiency, high accuracy ground simulate more trouble kind, and then more comprehensive usability to storage system carries out the analysis.
2. The method and the device can automatically simulate various faults of the nodes, and can record the problems in the node logs and the state information change, avoid the problem that workers must manually implement fault simulation in the related technology, monitor the whole fault simulation process of the storage system, liberate labor force, save a large amount of time of the workers, and also avoid the manual error caused by the workers in the fault simulation, so as to improve the efficiency and the accuracy of the fault simulation of the storage system.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is a flow chart illustrating the steps of a method for simulating a failure in a storage system according to the present invention;
FIG. 2 is a schematic structural diagram of a storage system fault simulation apparatus provided in the present application;
fig. 3 is a schematic structural diagram of an electronic device provided in the present application.
Detailed Description
The embodiment of the application provides a storage system fault simulation method, and solves the technical problems that in the prior art, the manpower cost is high, the efficiency is low, the accuracy is low, and the types of simulated faults are limited in a mode of manually simulating the faults of a storage cluster.
In order to solve the technical problems, the general idea of the embodiment of the application is as follows:
a storage system fault simulation method comprises the following steps: constructing a cluster to be tested; the cluster to be tested comprises a plurality of nodes; the plurality of nodes comprise a main node and at least one non-main node; each node is configured with at least one first node storage pool; controlling a plurality of nodes to simulate IO read-write to each first node storage pool; monitoring a first usage capacity of each first pool of node storage; controlling at least one node in the plurality of nodes to simulate and execute a preset action according to the first use capacity; updating the state change of each node in the plurality of nodes to a node log configured for each node, wherein the node log is used for providing a judgment basis for evaluating the compression resistance of a storage system installed on the node; the state change of each node is generated by simulating and executing a preset action by at least one node in the plurality of nodes. The method and the device have the advantages that the technical effect of simulating more fault types with low cost, high efficiency and high accuracy is achieved.
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
The application provides a storage system fault simulation method as shown in fig. 1, which includes:
step S11, constructing a cluster to be tested; the cluster to be tested comprises a plurality of nodes; the plurality of nodes comprise a main node and at least one non-main node; each node is configured with at least one first pool of nodes.
The nodes are the basis for constructing the cluster, and each node is installed with a corresponding storage system (the storage system referred to herein and hereinafter mainly refers to an algorithm or software running on the node), and the storage system is used for driving the operation of the node and the cluster. Each cluster includes one master node and at least one non-master node. The master node is a channel for data interaction between the clusters and external equipment, and each cluster is provided with only one master node.
The storage cluster is configured with a corresponding storage space, the storage space can be divided into a plurality of storage pools, each storage pool can be divided into a plurality of storage volumes, and the storage volumes correspond to corresponding hosts, so that corresponding storage services are provided for the hosts. Since the technical scheme provided by the application is more closely related to the storage pool, and only has an indirect relationship with the storage volume, the application only refers to the part related to the storage pool and does not refer to the part related to the storage volume when describing the technical scheme.
Each storage pool may correspond to one master node, and one master node may correspond to one, two or more storage pools, and the master node has control over the corresponding storage pool with a first priority. For convenience of description, each storage pool is referred to as a node storage pool.
To illustrate the relationship between the node storage pool and the nodes, an example is proposed, as shown in table 1, in which a cluster includes node 1, node 2, node 3, node 4, and node 5, where node 1 is the master node. The storage space is divided into a node storage pool 0, a node storage pool 1, a node storage pool 2, a node storage pool 3, a node storage pool 4, a node storage pool 5 and a node storage pool 6. The node 1 is a master control node of the node storage pool 0 and the node storage pool 1, the node 2 is a master control node of the node storage pool 2, and so on.
Node 1 has first priority control over node storage pool 0 and node storage pool 1, and when node 1 receives an access command with respect to node storage pool 0, the corresponding IO read-write contents are stored in node storage pool 0 (or read from node storage pool 0).
When the node 1 receives an access instruction about the node storage pool 2, the access instruction is informed to the corresponding node 2 in the node storage pool 2, and the corresponding IO read-write content is stored in the node storage pool 2 (or read from the node storage pool 2) by the node 2.
That is, each node may receive an access instruction regarding any of the node storage pools, but it is the master node corresponding to the corresponding node storage pool that finally completes the access operation.
TABLE 1
Figure BDA0002656547740000071
Returning to step S11, in order to test the availability of the storage system, or in order to test the performance of the storage system, the test needs to be completed depending on the cluster. Therefore, step S11 is to construct a cluster to be tested.
The size of the node storage pool corresponding to each node can be set according to specific situations. The number of node storage pools corresponding to each node may also be set according to specific situations.
And step S12, controlling the plurality of nodes to simulate IO read and write to each first node storage pool.
IO read-write is simulated to each first node storage pool through client flow software (the first node storage pool is used for distinguishing from other node storage pools in the following text, the first node storage pool does not have any actual meaning, and further the subsequent first node storage pool and the subsequent second node storage pool do not have any actual meaning), all nodes and all node storage pools in the cluster to be tested are used as normal nodes and storage pools, when problems occur in the using process, the storage system in the node can be determined to have corresponding problems, and further the storage system can be perfected in the later period.
Step S13, monitoring a first usage capacity of each first-node storage pool;
each first-node storage pool has a fixed capacity, which is denoted as a base capacity. When the client-side flow software executes IO read-write operation to the first node storage pool, the storage space in the first node storage pool is occupied, the occupied storage space is recorded as a first use capacity, and the first use capacity is gradually increased along with the increase of IO read-write time of the first node storage pool. The first used capacity of each first-node storage pool is monitored and step S14 is performed.
Step S14, controlling at least one node in the plurality of nodes to simulate and execute a preset action according to the first use capacity;
and determining preset actions which should be simulated and executed by each node according to the first use capacity, so that each node simulates faults which may occur when the cluster is normally used. After one or more nodes are in fault, the problems possibly existing in the storage system are analyzed according to the specific condition that whether the storage system can drive other nodes in the cluster to perform proper processing, and then basis is provided for later-stage debugging of the storage system. The technical scheme provided by the application aims to simulate the storage system fault, and in the process of simulating the fault, the storage system cannot be improved due to the fault, and only the fault and the state change of each node are recorded. After the fault simulation is finished, the storage system is improved correspondingly by the staff according to the records.
The first usage capacity in step S14 means: a first used capacity of any one of the storage pools of nodes corresponding to any one of the nodes.
Step S14 may include the following three cases:
in the first case:
the preset actions comprise off-line actions and on-line actions; step S14 specifically includes:
step 1401, judging whether the variation of the first usage capacity exceeds a first preset capacity;
the change of the first used capacity of any one of the node storage pools is monitored, and when the first used capacity increases by a first preset capacity, that is, the change of the first used capacity exceeds the first preset capacity, step S1402 is executed. For example, the first preset capacity is 10% of the base capacity.
Step S1402, when the variation of the first usage capacity exceeds a first preset capacity, selecting one non-host node from the non-host nodes as a first non-host node, and triggering the first non-host node to simulate and execute an offline action; and triggering the first non-host node to simulate and execute the on-line action after a preset time threshold value.
When the first usage capacity is increased by the first preset capacity, namely, the variation of the first usage capacity exceeds the first preset capacity, one non-master node is selected from all the non-master nodes, and the non-master node is triggered to execute the offline action. Step S1402 is to simulate an offline failure of a non-master node in the cluster, and after the non-master node goes offline, determine whether the storage system can handle the corresponding failure according to whether other nodes can take over the offline node storage pool of the non-master node, and further determine whether the availability of the storage system in this aspect meets the requirement.
After a preset time threshold (for example, 10 minutes), triggering the first non-host node to be online, judging whether the storage system can cope with corresponding conditions according to the condition that whether the corresponding node storage pool can be taken over again after the first non-host node is online, and further judging whether the availability of the storage system in the aspect meets the requirement.
It should be noted that step S1402 is executed for each time the first used capacity is increased by the first preset capacity, that is, for each time the variation of the first used capacity exceeds the first preset capacity. The purpose of step S1402 is to continuously trigger the non-master node to go offline and online, so as to detect whether the control of the storage system on the non-master node needs to be perfected.
In the second case:
the preset action comprises a restarting action; step S14 specifically includes:
step S1411, determining whether the first usage capacity exceeds a second preset capacity;
step S1412, when the first use capacity exceeds the second preset capacity, triggering the master nodes to simulate and execute a restart action, and selecting one non-master node from the non-master nodes as a new master node of the cluster to be tested; and after the master node is restarted, the master node is used as a non-master node to be added into the cluster to be tested again.
Monitoring the first usage capacity, and when the first usage capacity of any one node storage pool of any node exceeds the second preset capacity, executing step S1412 to trigger the master node to restart, that is, to trigger the master node to go offline and then online, that is, to simulate the situation of master node failure.
In the cluster, the master node is the only channel for interaction between the external device and the cluster, so that an IO reading error can be caused when the master node goes offline, and in order to avoid the occurrence of the IO reading error, another non-master node takes over the task of the master node to become a new master node, so that cluster paralysis is avoided.
And the offline master node needs to be online after recovering to normal, and when online, the offline master node is taken as a non-master node and added into the cluster again.
When the first usage capacity exceeds the second preset capacity, the first usage capacity is always in a state exceeding the second preset capacity. That is, step S1411 is repeatedly executed, and as a result, the first usage capacity exceeds the second preset capacity, and step S1412 is repeatedly executed, which aims to repeatedly let the master node simulate the failure condition, and further detect whether the availability of the storage system in the control of the master node meets the requirement.
For example, as shown in table 1, if the first used capacity of the node storage pool 2 exceeds a second preset capacity (e.g., 50% of the basic capacity), the node 1 serving as the master node is restarted, and at this time, the node 2 is a new master node, and the node 1 serves as a non-master node to rejoin the cluster. At this time, step S1411 is executed, and the first used capacity of the node storage pool 2 still exceeds the second preset capacity, step S1412 is executed, where the node 2 serving as the master node is restarted, at this time, the node 5 is a new master node, and the node 2 serving as a non-master node rejoins the cluster. At this time, step S1411 is executed, and the first used capacity of the node storage pool 2 still exceeds the second preset capacity, step S1412 is executed, and the node 5 serving as the master node is restarted, at this time, the node 3 is the new master node, and the node 5 serving as the non-master node rejoins the cluster.
That is, steps S1411 to S1412 need to be repeatedly executed, the master node failure is continuously simulated, and the master node restart is continuously triggered to replace a new master node, so as to test whether the storage system has a defect.
In the third case:
the preset action comprises an updating action; step S14 specifically includes:
step S1421, determining whether the first usage capacity exceeds a third preset capacity;
step S1422, when the first usage capacity exceeds a third preset capacity, an update action is simulated and executed on each node, and the update action is used for updating the node storage pool of each node; recording the updated first node storage pool of each node as a second node storage pool;
and monitoring whether the first use capacity exceeds a third preset capacity, and simulating and executing an updating action on each node when the first use capacity exceeds the third preset capacity, wherein the updating action is to update the node storage pool of each node. Wherein, updating the node storage pool of each node means: and deleting each node storage pool of each node, and creating a new node storage pool for each node.
Executing step S1422 means that the cluster to be tested has completed one round of testing, but it is far from sufficient to verify the performance of the cluster to be tested, and it is also necessary to perform multiple tests. Therefore, step S1422 is executed, which means that the cluster to be tested has completed updating, and the next round of testing can be entered, i.e., step S1423 is executed.
That is, after the step S1422 is performed, the steps S12 to S14 need to be repeatedly performed, and the number of times of repeating the steps S12 to S14 may be set as the case may be. For example, when the service accuracy or availability of the storage system application is strong, the steps S12 to S14 may be repeatedly performed a greater number of times; when the service accuracy or availability of the storage system application is weak, the steps S12 to S14 may be repeatedly performed a fewer number of times. That is, the more times the steps S12-S14 are repeated, the more problems the test may have, and thus the finer the adjustment and improvement of the storage system, the better the availability of the storage system will be.
Step S1423, controlling a plurality of nodes to simulate IO read-write to each second node storage pool;
after step S1422 is performed, the node storage pool of each node has been updated, and therefore IO read/write operations on each second node storage pool need to be restarted. Step S1423 is similar to step S13, and is not described here.
Step S1424, monitoring a second usage capacity of each second-node storage pool;
step S1424 is similar to step S15, and is not described here.
Step S1425, controlling at least one node of the plurality of nodes to simulate and execute a preset action according to the second usage capacity.
Step S1425 is similar to step S14, and is not described here.
Step 1421 — step 1425 is whether the storage system can reconfigure the node storage pool for each node when the simulated node storage pool fails when the first used capacity reaches a third capacity threshold (for example, the third capacity threshold is 90% of the base capacity); or simulating whether the storage system can reconfigure the node storage pool for each node under the condition that the storage space of the node storage pool is not large or insufficient.
The three conditions related to step S14 may be simulated individually, or in a mixture of any two conditions, or in a mixture of three conditions, and the specific simulation conditions may be set according to specific requirements.
To illustrate the above three cases, an example is now provided, which is as follows:
as shown in table 2, the cluster to be tested includes a node 1 and a node 2, where the node 1 is a master node; node 1 is configured with a node storage pool 0 (denoted as pool 0) and node 2 is configured with a node storage pool 1 (denoted as pool 1).
IO read-write is simulated to the pool 0 and the pool 1, and the first use capacity of the pool 0 and the pool 1 is monitored.
In the first case: when the variation of the first usage capacity of the pool 0 or the pool 1 exceeds the first preset capacity, for convenience of description, it is assumed herein that the variation of the first usage capacity of the pool 0 exceeds the first preset capacity (the first preset capacity is 5% of the basic capacity), the node 2 is taken off-line, and the node 1 is used as a control node of the pool 1.
After a preset time threshold (for example, 10 minutes), the control node 2 goes online, and the control right of the node 2 to the pool 1 is restored.
And performing the offline action and the online action of the node 2 every time the variation of the first used capacity of the pool 0 or the pool 1 exceeds a first preset capacity.
In the second case: when the first used capacity of the pool 0 or the pool 1 exceeds a second preset capacity (for example, 50% of the basic capacity), the node 1 (master node) is triggered to restart, when the node 1 goes offline, the node 2 becomes a new master node, and the node 2 has the control right of the pool 0 and the pool 1.
After the first used capacity of the pool 0 or the pool 1 exceeds the second preset capacity once, the state is maintained until the node storage pool is deleted, so that after the node 2 becomes a master node, the node 2 is triggered to restart, when the node 2 goes offline, the node 1 becomes a new master node, and the node 1 has the control right of the pool 0 and the pool 1. It can be seen that node 1 and node 2 will instead become the master node before the node storage pool is deleted.
In the third case: when the first used capacity of pool 0 or pool 1 exceeds the third preset capacity (e.g., the third preset capacity is 90% of the base capacity), pool 0 and pool 1 of node 1 and node 2 are deleted and a new node storage pool is configured for node 1 and node 2, as shown in table 3.
Steps S12 through S14 are again performed for table 3.
TABLE 2
Node 1 Node 2
Pool 0 Tank 1
TABLE 3
Node 1 Node 2
Pool 3 Pool 4
Therefore, according to the method and the device, the using capacity of the node storage pool is monitored, and then the preset action is executed on the node according to the using capacity, namely, the cluster fault is automatically simulated according to the using capacity, so that the cluster fault is avoided being simulated manually, a large amount of time of workers is saved, and then more fault types can be simulated with low cost, high efficiency and high accuracy, and then the availability of the storage system is judged more comprehensively, so that the storage system is adjusted correspondingly according to the requirement.
In executing steps S11-S14, each step S15 is executed:
step S15, updating the state change of each node in the plurality of nodes to a node log configured for each node, wherein the node log is used for providing a judgment basis for evaluating the compression resistance of a storage system installed on the node; the state change of each node is generated by simulating and executing a preset action by at least one node in the plurality of nodes.
The purpose of step S15 is to record all status changes of each node in the cluster, such as the actions of offline, online, master node becoming non-master node and non-master node becoming master node, and the action of updating the node storage pool. During the node action, i.e. during the execution of steps S11-S15, all state changes of the node are recorded in the node log.
The method and the device can automatically simulate various faults of the nodes, and can record the problems in the node logs and the state information change, avoid the problem that workers must manually implement fault simulation in the related technology, monitor the whole fault simulation process of the storage system, liberate labor force, save a large amount of time of the workers, and also avoid the manual error caused by the workers in the fault simulation, so as to improve the efficiency and the accuracy of the fault simulation of the storage system.
In addition, in the process of executing step S11-step S15, when a system interruption error occurs in the storage system of the node, the error is updated to the node log, and step S11-step S15 may be executed again, or the storage system may be adjusted according to the error recorded in the node log, and step S11-step S15 are executed after the adjustment, so as to provide a judgment basis for testing the performance of the storage system.
Based on the same inventive concept, another embodiment of the present application provides a storage system fault simulation apparatus as shown in fig. 2, the apparatus including:
the building module 21 is used for building a cluster to be tested; the cluster to be tested comprises a plurality of nodes; the plurality of nodes comprise a main node and at least one non-main node; each node is configured with at least one first node storage pool;
the control module 22 is configured to control the plurality of nodes to simulate IO reading and writing to each first node storage pool;
a monitoring module 23, configured to monitor a first usage capacity of each first node storage pool;
the execution module 24 is configured to control at least one node of the plurality of nodes to simulate and execute a preset action according to the first usage capacity;
an updating module 25, configured to update a state change of each node in the plurality of nodes to a node log configured for each node, where the node log is used to provide a judgment basis for evaluating the compression resistance of a storage system installed on the node; the state change of each node is generated by simulating and executing a preset action by at least one node in the plurality of nodes.
Specifically, the preset actions include an offline action and an online action; the execution module 24 specifically includes:
the first judgment submodule is used for judging whether the variation of the first use capacity exceeds a first preset capacity or not;
the selection submodule is used for selecting one non-main node from the non-main nodes as a first non-main node when the variable quantity of the first use capacity exceeds a first preset capacity, and triggering the first non-main node to simulate and execute an offline action; and triggering the first non-host node to simulate and execute the on-line action after a preset time threshold value.
Specifically, the preset action comprises a restart action; the execution module 24 specifically includes:
the second judgment submodule is used for judging whether the first use capacity exceeds a second preset capacity or not;
the restarting submodule is used for triggering the main nodes to simulate and execute restarting actions when the first using capacity exceeds the second preset capacity, and selecting one non-main node from the non-main nodes as a new main node of the cluster to be tested; and after the master node is restarted, the master node is used as a non-master node to be added into the cluster to be tested again.
Specifically, the preset action comprises an updating action; the execution module 24 specifically includes:
the third judgment submodule is used for judging whether the first use capacity exceeds a third preset capacity or not;
the updating submodule is used for simulating and executing an updating action on each node when the first using capacity exceeds a third preset capacity, and the updating action is used for updating the node storage pool of each node; recording the updated first node storage pool of each node as a second node storage pool; controlling a plurality of nodes to simulate IO read-write to each second node storage pool; monitoring a second usage capacity of each of the second pool of nodes; and controlling at least one node in the plurality of nodes to simulate and execute a preset action according to the second use capacity.
Based on the same inventive concept, another embodiment of the present application provides an electronic device as shown in fig. 3, including:
a processor 31;
a memory 32 for storing instructions executable by the processor 31;
wherein the processor 31 is configured to execute to implement a storage system fault simulation method.
Based on the same inventive concept, another embodiment of the present application provides a non-transitory computer-readable storage medium, wherein when instructions in the storage medium are executed by a processor of an electronic device, the electronic device is enabled to execute a method for simulating a storage system fault.
Since the electronic device described in this embodiment is an electronic device used for implementing the method for processing information in this embodiment, a person skilled in the art can understand the specific implementation manner of the electronic device of this embodiment and various variations thereof based on the method for processing information described in this embodiment, and therefore, how to implement the method in this embodiment by the electronic device is not described in detail here. Electronic devices used by those skilled in the art to implement the method for processing information in the embodiments of the present application are all within the scope of the present application.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A method of storage system fault simulation, the method comprising:
constructing a cluster to be tested; wherein the cluster to be tested comprises a plurality of nodes; the plurality of nodes comprise a main node and at least one non-main node; each node is configured with at least one first node storage pool;
controlling a plurality of nodes to simulate IO read-write to each first node storage pool;
monitoring a first usage capacity of each of the first pool of nodes;
controlling at least one node in the plurality of nodes to simulate and execute a preset action according to the first use capacity;
updating the state change of each node in the plurality of nodes to a node log configured for each node, wherein the node log is used for providing a judgment basis for evaluating the compression resistance of a storage system installed on the node; wherein the state change of each node is generated by simulating the execution of the preset action by at least one node in the plurality of nodes.
2. The method of claim 1, wherein the preset actions include an offline action and an online action; according to the first usage capacity, at least one of the nodes is controlled to simulate and execute a preset action, and the method specifically includes:
judging whether the variable quantity of the first use capacity exceeds a first preset capacity or not;
when the variation of the first usage capacity exceeds the first preset capacity, selecting one non-main node from the non-main nodes as a first non-main node, and triggering the first non-main node to simulate and execute the offline action; and triggering the first non-main node to simulate and execute the online action after a preset time threshold value.
3. The method of claim 1, wherein the preset action comprises a restart action; according to the first usage capacity, at least one node in the plurality of nodes is controlled to simulate and execute a preset action, and the method specifically comprises the following steps:
judging whether the first use capacity exceeds a second preset capacity or not;
when the first use capacity exceeds the second preset capacity, triggering the main nodes to simulate and execute the restarting action, and selecting one non-main node from the non-main nodes as a new main node of the cluster to be tested; and after the master node is restarted, the master node is used as the non-master node to rejoin the cluster to be tested.
4. The method of claim 1, wherein the preset action comprises an update action; according to the first usage capacity, at least one node in the plurality of nodes is controlled to simulate and execute a preset action, and the method specifically comprises the following steps:
judging whether the first use capacity exceeds a third preset capacity or not;
when the first used capacity exceeds the third preset capacity, simulating and executing the updating action on each node, wherein the updating action is used for updating the node storage pool of each node; recording the updated first node storage pool of each node as a second node storage pool; controlling a plurality of the nodes to simulate IO read-write to each second node storage pool; monitoring the second used capacity of each of the second pool of nodes; and controlling at least one node in the plurality of nodes to simulate and execute the preset action according to the second use capacity.
5. A storage system fault simulation apparatus, the apparatus comprising:
the building module is used for building a cluster to be tested; wherein the cluster to be tested comprises a plurality of nodes; the plurality of nodes comprise a main node and at least one non-main node; each node is configured with at least one first node storage pool;
the control module is used for controlling the plurality of nodes to simulate IO read-write to each first node storage pool;
a monitoring module, configured to monitor a first usage capacity of each of the first node storage pools;
the execution module is used for controlling at least one node in the plurality of nodes to simulate and execute a preset action according to the first use capacity;
the updating module is used for updating the state change of each node in the plurality of nodes to a node log configured by each node, and the node log is used for providing a judgment basis for evaluating the compression resistance of a storage system installed on the node; wherein the state change of each node is generated by simulating the execution of the preset action by at least one node in the plurality of nodes.
6. The apparatus of claim 5, wherein the preset actions comprise an offline action and an online action; wherein, the execution module specifically comprises:
the first judgment submodule is used for judging whether the variation of the first use capacity exceeds a first preset capacity or not;
the selection submodule is used for selecting one non-main node from the non-main nodes as a first non-main node when the variation of the first use capacity exceeds the first preset capacity, and triggering the first non-main node to simulate and execute the offline action; and triggering the first non-main node to simulate and execute the online action after a preset time threshold value.
7. The apparatus of claim 5, wherein the preset action comprises a restart action; wherein, the execution module specifically comprises:
the second judgment submodule is used for judging whether the first use capacity exceeds a second preset capacity or not;
the restarting submodule is used for triggering the main nodes to simulate and execute the restarting action when the first using capacity exceeds the second preset capacity, and selecting one non-main node from the non-main nodes as a new main node of the cluster to be tested; and after the master node is restarted, the master node is used as the non-master node to rejoin the cluster to be tested.
8. The apparatus of claim 5, wherein the preset action comprises an update action; the execution module specifically comprises:
a third judging submodule, configured to judge whether the first usage capacity exceeds a third preset capacity;
an update sub-module, configured to, when the first used capacity exceeds the third preset capacity, perform the update action on each node in a simulated manner, where the update action is used to update the node storage pool of each node; recording the updated first node storage pool of each node as a second node storage pool; controlling a plurality of the nodes to simulate IO read-write to each second node storage pool; monitoring the second used capacity of each of the second pool of nodes; and controlling at least one node in the plurality of nodes to simulate and execute the preset action according to the second use capacity.
9. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute to implement a storage system fault simulation method as claimed in any one of claims 1 to 4.
10. A non-transitory computer readable storage medium having instructions therein which, when executed by a processor of an electronic device, enable the electronic device to perform a method of storage system fault simulation implementing any of claims 1 to 4.
CN202010889696.3A 2020-08-28 2020-08-28 Storage system fault simulation method, device, equipment and medium Active CN112069014B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010889696.3A CN112069014B (en) 2020-08-28 2020-08-28 Storage system fault simulation method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010889696.3A CN112069014B (en) 2020-08-28 2020-08-28 Storage system fault simulation method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN112069014A true CN112069014A (en) 2020-12-11
CN112069014B CN112069014B (en) 2022-12-27

Family

ID=73660562

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010889696.3A Active CN112069014B (en) 2020-08-28 2020-08-28 Storage system fault simulation method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN112069014B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780442A (en) * 2022-06-22 2022-07-22 杭州悦数科技有限公司 Testing method and device for distributed system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130339605A1 (en) * 2012-06-19 2013-12-19 International Business Machines Corporation Uniform storage collaboration and access
CN106888116A (en) * 2016-12-30 2017-06-23 北京同有飞骥科技股份有限公司 A kind of dispatching method of dual controller cluster shared resource
CN107800784A (en) * 2017-10-19 2018-03-13 郑州云海信息技术有限公司 A kind of remote disk accesses the environmental structure method of testing of simulation local disk
US20190384675A1 (en) * 2018-06-19 2019-12-19 International Business Machines Corporation Dynamically Directing Data in a Deduplicated Backup System

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130339605A1 (en) * 2012-06-19 2013-12-19 International Business Machines Corporation Uniform storage collaboration and access
CN106888116A (en) * 2016-12-30 2017-06-23 北京同有飞骥科技股份有限公司 A kind of dispatching method of dual controller cluster shared resource
CN107800784A (en) * 2017-10-19 2018-03-13 郑州云海信息技术有限公司 A kind of remote disk accesses the environmental structure method of testing of simulation local disk
US20190384675A1 (en) * 2018-06-19 2019-12-19 International Business Machines Corporation Dynamically Directing Data in a Deduplicated Backup System

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780442A (en) * 2022-06-22 2022-07-22 杭州悦数科技有限公司 Testing method and device for distributed system

Also Published As

Publication number Publication date
CN112069014B (en) 2022-12-27

Similar Documents

Publication Publication Date Title
US9727432B1 (en) Accelerated testing using simulated failures in a multi-device storage system
CN112148542B (en) Reliability testing method, device and system for distributed storage cluster
CN112214369A (en) Hard disk fault prediction model establishing method based on model fusion and application thereof
CN110187841A (en) A kind of method, apparatus and storage server of system management memory disk
CN110377471B (en) Interface verification data generation method and device, storage medium and electronic equipment
CN110597655A (en) Fast predictive restoration method for coupling migration and erasure code-based reconstruction and implementation
CN116662214B (en) Hard disk garbage recycling method, device, system and medium based on fio
CN112069014B (en) Storage system fault simulation method, device, equipment and medium
CN111708488A (en) Distributed memory disk-based Ceph performance optimization method and device
CN116501259A (en) Disk group dual-activity synchronization method and device, computer equipment and storage medium
CN114265733A (en) Automatic testing method and device for abnormal power failure of solid state disk
CN103279408A (en) RAID (redundant array of inexpensive disk) performance testing method
CN115391110A (en) Test method of storage device, terminal device and computer readable storage medium
CN113986618B (en) Cluster brain fracture automatic repair method, system, device and storage medium
CN115981940A (en) Storage server testing method and device, electronic equipment and medium
CN115480948A (en) Hard disk failure prediction method and related equipment
Sun et al. Quantifying failure risk of version switch for rolling upgrade on clouds
CN110287066B (en) Server partition migration method and related device
CN113625950A (en) Method, system, equipment and medium for initializing redundant array of independent disks
CN113111009A (en) Software testing device and testing method
US20080209259A1 (en) Method and system for testing reliability of data stored in raid
CN109460366A (en) A kind of software stability test method, device, equipment and medium
CN113849384B (en) Method and device for determining test duration of background task of RAID (redundant array of independent disks) system
CN117215813A (en) Centralized storage reliability test method, system, equipment and medium
CN115509839A (en) Method and device for testing magnetic disk of server, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant