CN114416501A

CN114416501A - Storage double-activity and test system and method

Info

Publication number: CN114416501A
Application number: CN202111595424.3A
Authority: CN
Inventors: 李迎军
Original assignee: Agricultural Bank Of China Ltd Yunnan Branch
Current assignee: Agricultural Bank Of China Ltd Yunnan Branch
Priority date: 2021-12-23
Filing date: 2021-12-23
Publication date: 2022-04-29

Abstract

The invention relates to a storage double-activity and test system and a method, wherein the system comprises a host, a first site storage system, a second site storage system and an arbitration server; the host sends data to be written into the first site storage system and the second site storage system; the first site storage system and the second site storage system are connected through a double active replication link; when the whole fault of any data center or the link fault between the arrays occurs, the arrays send arbitration requests to the arbitration server, and the arbitration server comprehensively judges which end wins; one party with the winning arbitration continues to provide the service, and the other party stops the service; the preferential station in the arbitration server mode preferentially wins the arbitration. The invention can eliminate all single-point faults, has a full redundancy framework of the whole system, does not interrupt the zero-loss data service and is compatible with the original storage system.

Description

Storage double-activity and test system and method

Technical Field

The invention belongs to the field of storage systems, and particularly relates to a storage double-active and test system and a method.

Background

With the continuous improvement of bank informatization level, a large number of new systems and new applications are on line, and great challenges are brought to the operation and maintenance work of front-line production. To enhance data security and reduce the risk of long-time shutdown and data loss caused by storage device failure to normal operation of an application system, we try to configure LUN dual activities for 2 st of storage space which is not divided and used on an SAN storage OceanStore 5600V 3. The safety and high availability are improved from the bottom layer, and the data availability and the service continuity are guaranteed.

Disclosure of Invention

In order to solve the problems, the invention provides a storage dual-active and test system and a method, which eliminate all single-point faults, have a full redundancy architecture of the whole system, have no interruption of data loss service, are compatible with the original storage system, have small change of the original networking, are compatible with the original storage system, fully utilize resources, reduce upgrading cost and are simple to manage.

The technical scheme of the invention is as follows:

a storage double-live and test system comprises a host, a first site storage system, a second site storage system and an arbitration server; the host sends data to be written into the first site storage system and the second site storage system; the first site storage system and the second site storage system are connected through a double active replication link;

when the whole fault of any data center or the link fault between the arrays occurs, the arrays send arbitration requests to the arbitration server, and the arbitration server comprehensively judges which end wins; one party with the winning arbitration continues to provide the service, and the other party stops the service; the preferential station in the arbitration server mode preferentially wins the arbitration.

The invention also relates to a storage double-live and test method, which comprises the following steps:

the method comprises the following steps of performing fault test on a service host and a storage single link, unplugging a link between the service host and a storage device, simulating the fault condition of the single link in the current network environment, and observing whether the service operation condition is normal or not;

testing the link full fault between the service host and the storage, pulling out all links between a single storage and the service host, and observing whether the service normally runs or not; after the link is recovered, observing whether the double live volumes are synchronous or not and whether the service is normal or not;

testing single link faults between arrays, unplugging a double active copy link between storages, simulating faults, and observing whether VMware service operation is normal;

and (4) testing the fault of the full link among the arrays, pulling out all double active copy links among the arrays, and simulating the fault. And observing whether the VMware service is normally operated.

Furthermore, the method also comprises the steps of carrying out fault test on the array and the arbitration server IP single link, unplugging a link between the array A control and the arbitration server, and simulating the fault; observing whether the VMware service is normally operated; reinserting the unplugged arbitration link back to recover the fault; and observing whether the VMware service is normally operated.

Further, still include:

testing all IP link faults of the array and the arbitration server, issuing a test service on a VMware platform, unplugging all links between the storage array and the arbitration server, and simulating faults; observing whether the VMware service is normally operated; reinserting the unplugged arbitration link back to recover the fault; and observing whether the service of the Swingbench operates normally.

Further, still include:

testing the fault of the arbitration server, restarting the virtual machine of the arbitration server, and recovering the arbitration server to be normal after the arbitration server is electrified; and observing whether the VMware service is normally operated.

Further, still include: and (4) service verification, namely checking whether the original service is normal or not, and whether the original data read-write is normal or not in the storage.

Compared with the prior art, the invention has the following beneficial effects:

the invention can eliminate all single-point faults, has a full redundancy framework of the whole system, does not interrupt the zero-loss data service and is compatible with the original storage system.

The invention has the advantages of small change of the original networking, compatibility with the original storage system, full utilization of resources, reduction of upgrading cost and simple management. The fault is fully automatically switched, enough time is reserved for unified management of the on-line problem solving storage system, the management difficulty is reduced, and the operation and maintenance cost is reduced.

Furthermore, single storage fails, data is lost zero, and service is not interrupted; the original network architecture is not influenced; the host layer does not need to install any software; the heterogeneous array is virtualized to protect the existing investment; easy dilatation, easy maintenance.

Drawings

FIG. 1 is a schematic block diagram of the system of the present invention;

FIG. 2 is a block diagram of the arbitration principle of the present invention;

FIG. 3 is a schematic diagram of a link full fault test between a service host and a storage according to the present invention;

FIG. 4 is a schematic diagram of the array and arbitration server full IP link failure test of the present invention;

FIG. 5 is a arbitration server fault test schematic of the present invention.

Detailed Description

The technical solutions in the embodiments will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the examples without making any creative effort, shall fall within the protection scope of the present application.

Unless otherwise defined, technical or scientific terms used in the embodiments of the present application should have the ordinary meaning as understood by those having ordinary skill in the art. The use of "first," "second," and similar terms in the present embodiments does not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. "mounted," "connected," and "coupled" are to be construed broadly and may, for example, be fixedly coupled, detachably coupled, or integrally coupled; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. "Upper," "lower," "left," "right," "lateral," "vertical," and the like are used solely in relation to the orientation of the components in the figures, and these directional terms are relative terms that are used for descriptive and clarity purposes and that can vary accordingly depending on the orientation in which the components in the figures are placed.

Double-activity is a computer disaster backup scheme for saving resources. As shown in fig. 1 and 2, in this embodiment, the storage double-active and test system includes a host, a first site storage system, a second site storage system, and a mediation server; the host machine sends data to be written into the first site storage system and the second site storage system through the FC/SAN, and the data can be sent through the Macco switch; the first site storage system and the second site storage system are connected through a double active replication link FC/IP. The hosts may be VMware vSphere cluster H3C servers, with a first site storage system (a-site) serving the cluster and a second site storage system (B-site) serving the cluster at the same time. The first site storage system, the second site storage system and the arbitration server are connected through IP.

The arbitration process of this embodiment is as follows:

in the arbitration server mode, once the integral fault of any data center or the fault of the link among the arrays occurs, the arrays send arbitration requests to the arbitration server, and the arbitration server comprehensively judges which end wins; one party with the winning arbitration continues to provide the service, and the other party stops the service; the preferential station in the arbitration server mode preferentially wins the arbitration.

One storage system fails, and the double active Pair is in a state to be synchronized. And the LUN of the data center A fails, and the LUN of the data center B continues to run services.

As shown in fig. 3, 4 and 5, the method for dual active storage and testing of the present embodiment includes the following steps:

and (1) performing fault test on the service host and the storage single link.

And pulling out a link between the service host and the storage equipment, simulating the fault condition of the single link in the current network environment, and observing whether the service operation condition is normal or not.

And (2) testing the link full fault between the service host and the storage.

As shown in fig. 3, all links between a single storage and a service host are pulled out, and whether the service is operating normally is observed; and after the link is recovered, observing whether the double live volumes are synchronous or not and whether the service is normal or not.

And (3) testing the single link fault between the arrays.

And pulling out a double active copy link between the storages to simulate a fault. And observing whether the VMware service is normally operated.

And (4) testing the full link faults among the arrays.

And pulling out all double active copy links between the arrays to simulate the fault. And observing whether the VMware service is normally operated. And the LUN of the data center A is invalid, the LUN of the data center B continues to operate services, and whether storage is online or not is checked in VMware.

And (5) testing the array and the arbitration server IP single link fault.

And pulling out a link between the array A control server and the arbitration server to simulate a fault. And observing whether the VMware service is normally operated. And the unplugged arbitration link is reinserted back to recover the fault. And observing whether the VMware service is normally operated.

And (6) testing all IP link faults of the array and the arbitration server.

As shown in fig. 4, a test service is issued on the VMware platform, all links between the storage array and the arbitration server are unplugged, and a fault is simulated. And observing whether the VMware service is normally operated. And the unplugged arbitration link is reinserted back to recover the fault. And observing whether the service of the Swingbench operates normally.

And (7) arbitrating the fault test of the server.

And restarting the virtual machine of the arbitration server, and recovering to be normal after the arbitration server is electrified. And observing whether the VMware service is normally operated.

And (8) service verification.

And checking whether the original service is normal or not, and whether the original data read and write in the storage are normal or not.

As shown in fig. 5, the main verification method is to check whether the virtual machine deployed on the PC server storing the original LUN association can operate normally, and whether a storage configuration related alarm is generated on the Vcenter; and deploying a test virtual machine on the live LUN to test whether the system runs normally.

The main verification method is to check whether the virtual machine deployed on the PC server storing the original LUN association can run normally or not, and whether a storage configuration related alarm is generated on a Vcenter or not; and deploying a test virtual machine on the live LUN to test whether the system runs normally.

The specific application example of this embodiment is as follows:

oracle RAC live access with load balancing

The two centers form an RAC cluster, simultaneously provide service for the same database service, and the load balancing transparent application is switched: when storage, server or network are in failure, database service is transparently switched, users have no perception, the snapshot technology combining artificial misoperation recovery and Oracle database application is adopted, and artificial misoperation data are recovered

Load balancing by combining with DRS

2.VMware/FusionSphere

The method has the advantages that the method supports the function of a vSphere cluster DRS, monitors the resource utilization rate of a server in real time, flexibly and online migrates the automatic load balance of the virtual machine to support the migration function of the virtual machine, the online migrates of the virtual machine can guarantee zero loss of data during system maintenance, the virtual machine automatically and rapidly migrates when the service does not interrupt the automatic switching storage of the application and the server or network fails, and the service interruption time and the operation and maintenance complexity are reduced.

The business of this embodiment is lasting high-efficient convenient intelligence nimble:

the service continuously runs RPO =0 in 7x24 hours, the RTO-0 maintains the service without interrupting the service dual-active access, takes over and is compatible with various brands of equipment, the storage equipment uniformly manages the data access nearby, and the service is automatically load balanced.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A storage double-live and test system is characterized in that: the system comprises a host, a first site storage system, a second site storage system and an arbitration server; the host sends data to be written into the first site storage system and the second site storage system; the first site storage system and the second site storage system are connected through a double active replication link;

2. A storage double-live and test method is characterized in that: the method comprises the following steps:

testing the fault of the full link among the arrays, pulling out all double active copy links among the arrays, and simulating the fault; and observing whether the VMware service is normally operated.

3. The method of claim 2, wherein: the method also comprises the steps of testing the fault of the array and the arbitration server IP single link, unplugging a link between the array A control and the arbitration server, and simulating the fault; observing whether the VMware service is normally operated; reinserting the unplugged arbitration link back to recover the fault; and observing whether the VMware service is normally operated.

4. The method of claim 2, wherein: further comprising:

5. The method of claim 2, wherein: further comprising:

6. The method of claim 2, wherein: further comprising: and (4) service verification, namely checking whether the original service is normal or not, and whether the original data read-write is normal or not in the storage.