CN101557320A

CN101557320A - Disaster tolerance realizing method and communication equipment thereof

Info

Publication number: CN101557320A
Application number: CNA2009101437293A
Authority: CN
Inventors: 陈乾业; 周迪
Original assignee: Hangzhou H3C Technologies Co Ltd
Current assignee: New H3C Technologies Co Ltd
Priority date: 2009-05-25
Filing date: 2009-05-25
Publication date: 2009-10-14
Anticipated expiration: 2029-05-25
Also published as: CN101557320B

Abstract

The invention discloses a disaster tolerance realizing method and communication equipment thereof, which are applied to a system comprising at least one application server, one production equipment and at least one disaster tolerance equipment; wherein the production equipment conducts state detection to an RAID of the production equipment. The method comprises the following steps: when detecting that the RAID comes up with failure and is in a critical state, the production equipment synchronizes information in the RAID to an RAID of the disaster tolerance equipment; and after synchronization is finished, the production equipment redirects a reading request or a writing request sent to the RAID of the production equipment by the application server to the RAID of the disaster tolerance equipment so as to realize corresponding operations. The application of the disaster tolerance realizing method and the communication equipment thereof improve the data reliability of a communication system and the safety of data storage, prevent data loss due to the failures of the storage resources of the production equipment and reduce the PRO and PTO of the communication system.

Description

The implementation method of disaster tolerance and communication equipment thereof

Technical field

The present invention relates to communication technical field, particularly a kind of implementation method of disaster tolerance and communication equipment thereof.

Background technology

Disaster tolerance (Disaster Tolerance) is exactly when production equipment breaks down, and is guaranteeing that production equipment tries one's best under the situation of few obliterated data the professional continual operation that keeps production equipment.That is, the data of production equipment are copied to disaster tolerance equipment (Disaster Preparation Center), after the fault of production equipment is got rid of, can be by the resume production data of equipment of the data of disaster tolerance device backup.

Disaster tolerance system is meant in the strange land far away that is separated by, set up two covers or overlap the identical information technology of function (Information Technology more, IT) system, can carry out state of health monitoring and function mutually switches, when the system at a place wherein because of unexpected (as reasons such as fire, earthquakes) when quitting work, the whole service flow process can switch in the system at another place and realize, makes this systemic-function can continue operate as normal.

Disaster tolerance technology is a part of the high availability technology of system, and disaster tolerance system emphasizes to handle the influence of external environment to system, particularly catastrophic event more to the influence of entire I T node, provides node level other system restoration function.

Divide from its degree of protection, disaster tolerance system can be divided into data disaster tolerance and use disaster tolerance, be described below system:

So-called data disaster tolerance just is meant the data system of setting up a strange land, and this system is available duplicating of local crucial application data.When disaster appearred in local data and whole application system, system preserved the data of a available key business at least in the strange land.These data can be and the duplicating fully in real time of local creation data, also can be more backward slightly than local data, but must be available.The major technique that adopts is data backup and Data Replication Technology in Mobile.The data disaster tolerance technology, be called the strange land Data Replication Technology in Mobile again,, mainly can be divided into Synchronous Transfer Mode and Asynchronous Transfer Mode (each manufacturer may be different on technical terms) according to the technical approach of its realization, in addition, also just like " half synchronously " such mode.Half Synchronous Transfer Mode is basic identical with Synchronous Transfer Mode, and (Input/Output, when I/O) proportion was big, Synchronous Transfer Mode can improve the speed of I/O slightly relatively just to account for the I/O request in read request (Read).And according to the distance of disaster tolerance, data disaster tolerance can be divided into teledata disaster tolerance and short-range data disaster tolerance mode again.

The so-called disaster tolerance of using is on the basis of data disaster tolerance, sets up the complete back-up application system suitable with local production system (can be to backup each other) of a cover in the strange land.It is relatively complicated setting up such system, not only needs a available data to duplicate, and also resources such as the network of comprising, main frame, application even IP will be arranged, and the good coordination between each resource.Major technology comprises load balancing, Clustering.Data disaster tolerance is a technology of using disaster tolerance, and using disaster tolerance is the target of data disaster tolerance.

When selecting the structure of disaster tolerance system, also to set up multi-level Wide Area Network fail-over scheme.Local high-availability system refers under the situation of one of a plurality of server operation or multiple application, in the time of should guaranteeing that any fault appears in any server, the application of its operation can not be interrupted, and application program and system should be able to switch to rapidly on other server and move, i.e. local system cluster and Hot Spare.

In long-range disaster tolerance system, realize complete application disaster tolerance, should comprise the security mechanism of local system, long-range data copying machine system, also should have the remote failure switching capability and the trouble diagnosibility of wide area network scope.That is to say that in case fault takes place, system will have powerful failure diagnosis and switchover policy to work out mechanism, guarantee to react fast and service take-over rapidly.In fact, the high available capability of wide area network scope and the high available capability of local system should form an integral body, realize multistage failover and Restoration Mechanism, guarantee the reliable and safety of system in each scope.

Certain disaster tolerance system is estimated, the evaluation criterion of generally acknowledging be recovery point objectives (Recovery PointObjective, RPO) and recovery time target (Recovery Time Objective, RTO).

RPO: with time is unit, promptly when fault takes place, and the time point requirement that system and data must return to.The maximum data amount lost that the RPO designation system can be tolerated.The data volume that system tolerant is lost is more little, and the value of RPO is more little.

RTO: with time is unit, and promptly after fault took place, information system or business function were from stopping to the time requirement that must recover.The maximum duration of the service stopping that the RTO designation system can be tolerated.The urgency of system service requires high more, and the value of RTO is more little.

RPO at be loss of data, RTO at be loss of service, both do not have necessary relation, and both determine and must the demand according to business determine after carrying out risk analysis and business impact analysis.

RAID is provided in the prior art (Redundant Array of Inexpensive Disks, Redundant Array of Inexpensive Disc) technology, the RAID technology utilizes the combination of a plurality of disks to be linked to be an array, realization is read and write data in magnetic disk in quick, accurate and safe mode, thereby reaches a kind of means that improve reading and writing data speed and fail safe.The major function of RAID is to improve the availability and the memory capacity of network data, and data are distributed on a plurality of disks selectively, thereby improves the data throughout of whole network system.

In concrete application scenarios, RAID comprises following several operating state:

Normal condition (Normal): all disks all are in normal operating condition among the RAID, and can allow one or more disk failure this moment.

Degrading state/critical condition (Degrade/Critical): the state after one (such as the RAID5 array) or a plurality of (such as the RAID6 array) disk failure is arranged among the RAID, but still can provide normal read-write service to the external world, if have a disk failure this moment again, then whole array is in disabled state.

Reconstruction state (Recover/Rebuild): to the transition state the normal condition, have a normal disk to replace the disk that lost efficacy from degrading state/critical condition among the RAID of this moment, redundant data is rebuild.

Failure state (Failed): the number of disks that lost efficacy among the RAID of this moment has surpassed array and has allowed redundant quantity, and whole RAID is in down state.

In the prior art scheme, created RAID in the production equipment, storage resources externally is provided as the production storage system.Application server connects production equipment, and the RAID by production equipment carries out the business datum access, will produce storage resources by production equipment (or production storage system) and copy to disaster tolerance equipment (Disaster Preparation Center).

After production equipment or production storage system broke down, prior art was taked following two kinds of reset modes usually:

1, duplicate counter-rotating (recovery): promptly after the fault recovery of production equipment or production storage system, the data of disaster tolerance equipment are returned to production equipment, and then the business of the equipment that resumes production.

2, promote: directly the resource with disaster tolerance equipment promotes, and connects the storage resources of disaster tolerance equipment then on application server, carries out the business datum access.

In the disaster tolerance implementation method that prior art provides, there is following problem:

When the RAID of production equipment fault,, promptly can there be loss of data because the simultaneous techniques life period error that existing disaster tolerance technology is adopted so the data in this time error scope can not obtain synchronously fully, causes RPO＞0 of existing disaster tolerance technology.And fault is used " duplicating counter-rotating " mode when recovering after getting rid of, understand owing to switching time error cause very long service disruption, promptly cause very big RTO; And use " lifting " mode, and application server must be able to be communicated with network between the disaster tolerance equipment, and is the protected data completeness, and general disaster tolerance equipment all uses dedicated network, and therefore, there is great limitation in the application of " lifting " mode.

Summary of the invention

The invention provides a kind of implementation method and communication equipment thereof of disaster tolerance, to realize improving the data reliability in the communication system, avoid the loss of data that when the storage resources of production equipment breaks down, caused, reduce PRO and PTO in the communication system, improve the purpose of system's security of storage data.

For achieving the above object, one aspect of the present invention provides a kind of implementation method of disaster tolerance, be applied to comprise that described production equipment carries out state-detection to the RAID of self in the system of at least one application server, a production equipment and at least one disaster tolerance equipment, described method comprises:

The RAID that detects self when described production equipment breaks down, when being in critical condition, described production equipment with the information synchronization on the described RAID to the RAID of described disaster tolerance equipment;

Described finish synchronously after, described production equipment is redirected to described application server on the RAID of described disaster tolerance equipment to read request or the write request that the RAID of described production equipment sends, and realizes corresponding operating.

Preferably, described production equipment with the process of the information synchronization on the described RAID to the RAID of described disaster tolerance equipment in, described application server is handled according to following strategy read request or write request that the RAID of described production equipment sends:

When described application server sent write request to the RAID of described production equipment, described production equipment was redirected to the RAID of described disaster tolerance equipment with described write request, and in data in synchronization tabulation the pairing address of the described write request of deletion;

When described application server sends read request to the RAID of described production equipment, described production equipment judges that the pairing information of described read request is whether in data in synchronization tabulation not, if in data in synchronization is not tabulated, the data that described production equipment reads on the RAID of described production equipment return to described application server, if not in data in synchronization is not tabulated, described production equipment is redirected to described read request the RAID of described disaster tolerance equipment, obtain the pairing information of described read request, return described information and give described application server.

Preferably, described finish synchronously after, described production equipment is redirected to read request or the write request of described application server to the RAID transmission of described production equipment on the RAID of described disaster tolerance equipment, realize after the corresponding operating, described production equipment is proceeded state-detection to self RAID, and described method also comprises:

When described production equipment detected the RAID recovery normal condition of self, described production equipment returned to described synchronous information to the RAID of described disaster tolerance equipment on the RAID of described production equipment;

After described recovery is finished, on the RAID that read request that described production equipment sends described application server to the RAID of described production equipment or write request are redirected to described production equipment, realize corresponding operating.

Preferably, in described production equipment returned to the described information to the RAID of described disaster tolerance equipment synchronously process on the RAID of described production equipment, described application server was handled according to following strategy read request or write request that the RAID of described production equipment sends:

When described application server sent write request to the RAID of described production equipment, described production equipment write the RAID of described production equipment with the pairing information of described write request, and in data in synchronization tabulation the pairing address of the described write request of deletion;

When described application server sends read request to the RAID of described production equipment, described production equipment judges that the pairing information of described read request is whether in data in synchronization tabulation not, if in data in synchronization is not tabulated, described production equipment is redirected to described read request the RAID of described disaster tolerance equipment, obtain the pairing information of described read request, return described information and give described application server, if not in the tabulation of data in synchronization not, the data that described production equipment reads on the RAID of described production equipment return to described application server.

On the other hand, the present invention also provides a kind of communication equipment, comprise the RAID that is used for storage, described communication equipment is applied to comprise in the system of at least one application server, a production equipment and at least one disaster tolerance equipment as production equipment, specifically comprises:

Detection module is used to detect the operating state of described RAID;

Handover module, electrically connect with the RAID of described detection module, described RAID and described disaster tolerance equipment, at least being used for detecting described RAID at described detection module breaks down, when being in critical condition, according to default selection strategy, the RAID of the current use of described communication equipment is switched to the RAID of a described disaster tolerance equipment by described RAID, and with the information synchronization on the described RAID to the RAID of a described disaster tolerance equipment that is switched;

Processing module, electrically connect with the RAID of described handover module, described RAID and described disaster tolerance equipment, at least be used for after the simultaneous operation of described handover module is finished, read request that described application server is sent or write request are redirected on the RAID of the disaster tolerance equipment that described handover module switches, and realize corresponding operating.

Preferably, described handover module with the process of the information synchronization on the described RAID to the RAID of described disaster tolerance equipment in, described processing module, also be used for by safeguarding the not progress of the described simultaneous operation of data in synchronization list records, and judge that whether the pairing information of read request that the described application server receive sends finishes described simultaneous operation, handles according to following strategy for read request or write request that described application server sends:

When described communication equipment receives the write request of described application server transmission, described processing module is redirected to described write request on the RAID of the described disaster tolerance equipment that described handover module switches, and in described not data in synchronization tabulation the address of the pairing information of the described write request of deletion;

When described communication equipment receives the read request that described application server sends, and described processing module judges when the pairing information of read request that the described application server that receives sends is finished described simultaneous operation, and the read request that described processing module sends described application server is redirected to the RAID of the described disaster tolerance equipment that described handover module switches.

Preferably, described handover module also is used for when described detection module detects described RAID and recovers normal condition, and described information to the RAID of described disaster tolerance equipment is synchronously returned on the RAID of described communication equipment;

Described processing module also is used for after the recovery operation of described handover module is finished, and read request that described application server is sent the RAID of described communication equipment or write request realize corresponding operating on described RAID.

Preferably, in described handover module returns to the described information to the RAID of described disaster tolerance equipment synchronously process on the RAID of described communication equipment, described processing module, also be used for by safeguarding the not progress of the described recovery operation of data in synchronization list records, and judge that whether the pairing information of read request that the described application server receive sends finishes described recovery operation, handles according to following strategy for read request or write request that described application server sends:

When described communication equipment received the write request of described application server transmission, described processing module write described RAID with described write request, and deleted the address of the pairing information of described write request in described not data in synchronization tabulation;

When described communication equipment receives the read request that described application server sends, and described processing module judges that the read request that described processing module sends described application server was redirected to the RAID of described disaster tolerance equipment when the pairing information of read request that the described application server that receives sends was not finished described recovery operation.

Compared with prior art, the present invention has the following advantages:

By the present invention, in the time of can being in critical condition at the RAID of production equipment with data message synchronously to disaster tolerance equipment, the fault of equipment to be produced returns to production equipment with the data in the disaster tolerance equipment after getting rid of, and said process is to carry out under the situation of application server unaware, do not need to stop corresponding service, thereby, improved data reliability and the security of storage data in the communication system, avoid the loss of data that when the storage resources of production equipment breaks down, caused, reduced PRO and the PTO in the communication system.

Description of drawings

Fig. 1 is the schematic flow sheet of the implementation method of a kind of disaster tolerance provided by the invention;

Fig. 2 is a kind of schematic flow sheet of implementing the implementation method of the disaster tolerance under the scene provided by the invention;

Fig. 3 is for carrying out the schematic flow sheet of Business Processing under the system provided by the invention normal condition;

Fig. 4 carries out the schematic flow sheet of Business Processing under the disaster tolerance state for system provided by the invention;

Fig. 5 carries out the schematic flow sheet of Business Processing under returning to form for system provided by the invention;

Fig. 6 is the application scenarios schematic diagram of the implementation method of a kind of disaster tolerance provided by the invention;

Fig. 7 is the structural representation of a kind of communication equipment provided by the invention.

Embodiment

As stated in the Background Art, in the disaster tolerance technology that existing communication system adopted, mostly be the data of after fault takes place, storing in disaster tolerance equipment promote guarantee professional, and after getting rid of, fault the data copied back in the disaster tolerance equipment is produced the technical scheme of equipment by duplicating counter-rotating, but can cause bigger PRO and PTO, traffic affecting is used, and causes data to lose.In order to remedy such deficiency, the present invention is by in the storage system (RAID) of production equipment when being in critical condition, just start the simultaneous operation of production equipment and disaster tolerance equipment, and after finishing synchronously, follow-up business is redirected to the scheme of disaster tolerance equipment, when guaranteeing that production equipment breaks down, disaster tolerance equipment can continue to provide business service with up-to-date business datum, after fault is got rid of, data in the disaster tolerance equipment are returned to production equipment, and after recovery is finished, follow-up business is continued to realize by production equipment, effectively reduced RPO, further, in above-mentioned simultaneous operation or recovery operation process, carry out the specific aim processing by read request or write request that application server sends according to default processing policy, guaranteed that business is not interrupted in synchronous or recovery process, thereby effectively reduce RTO, improved data reliability and the security of storage data in the communication system.

To achieve these goals, the present invention proposes a kind of implementation method of disaster tolerance, when being in critical condition, just start the simultaneous operation of production equipment and disaster tolerance equipment in the storage system (RAID) of production equipment, and after finishing synchronously, follow-up business is redirected to disaster tolerance equipment.

Technical scheme proposed by the invention is applied to specifically comprise that in the system of at least one application server, a production equipment and at least one disaster tolerance equipment, wherein, production equipment carries out state-detection to the RAID of self.

Concrete schematic flow sheet as shown in Figure 1, this method specifically may further comprise the steps:

Step S101, production equipment carry out state-detection to the RAID of self, judge whether RAID is in critical condition.

When judging that RAID is in critical condition, the triggering synchronous flow process changes step S102 over to;

When judging that RAID is not in critical condition, continue circulation and carry out state-detection.

Step S102, production equipment with the information synchronization on the RAID of self to the RAID of disaster tolerance equipment.

The related simultaneous operation of this step can realize that the variation of concrete method for synchronous does not influence protection scope of the present invention by simultaneous techniques of the prior art.

Need further be pointed out that, mentioned in this step needs are specially logical resource data information on this RAID by the information on the RAID of synchronous production equipment, because, the information that production equipment offers application server is logical resource data information, the target that is application server transmission read request or these requests should be the logical resource data information on the RAID in the production equipment, therefore, mentioned disaster tolerance in the technical program, the objectives of duplicating or being redirected be the logical resource data information on the RAID in the production equipment, rather than the total data on the whole RAID.By such setting, can reduce unnecessary synchrodata flow, conserve system resources shortens time of synchronizing process institute loss, further reduces the RPO and the RTO of system disaster tolerance.

On the other hand, after mentioned simultaneous operation was finished in this step, the information on the RAID of disaster tolerance equipment was actually the copy of the logical resource data information on the RAID in the production equipment.

In the follow-up embodiment of the present invention, in order to narrate conveniently, the notion of the usefulness information of summary replaces the notion of above-mentioned logical resource data information, and such variation does not influence protection scope of the present invention.

In the simultaneous operation process, in order to write down the progress of simultaneous operation, in production equipment, also safeguarded a not data in synchronization tabulation, the pairing identification information of synchrodata is not (for example: address information) to write down current all, after the last pairing data sync of data in synchronization tabulation is not finished, not deleting corresponding data address in the data in synchronization tabulation, finish synchronously to represent the pairing data message in this address.It is to be noted; in above-mentioned not data in synchronization tabulation; can only comprise all not identification informations of synchrodata; also can further comprise not pairing other information of synchrodata; can be used for the identification information of recognition data and the particular content of identification information can be address information or other, such variation does not influence protection scope of the present invention.Be provided with by the tabulation of such not data in synchronization, can make clear and definite the identifying data and whether finish synchronously of production equipment,, can not repeat synchronously for the data message of having finished synchronously.

In the application process of reality, because simultaneous operation need consume the regular hour, especially relate to when needing the data in synchronization amount bigger, the time of loss can further prolong, so, in this process, if there is not further professional safeguard, can produce huge RTO, have a strong impact on professional normally carrying out, therefore, the present invention is in above-mentioned simultaneous operation process, in order to reduce the RTO in the disaster tolerance process, further according to following default processing policy, read request or write request that application server sends are handled accordingly.Read request wherein or write request are specially application server and read the I/O instruction or write the I/O instruction to what production equipment sent, the concrete instruction interaction agreement based on prior art of above-mentioned instruction realizes that application server carries out corresponding read-write operation by above-mentioned read request or write request to production equipment (or disaster tolerance equipment) institute data information stored.

1, when application server sends write request to the RAID of production equipment, production equipment is redirected to the RAID of disaster tolerance equipment with this write request, and in data in synchronization tabulation the pairing address of deletion write request.In this way, these data of mark no longer need synchronously, thereby guarantee that the data that newly write can not covered by legacy data because of simultaneous operation, that is, the data in the disaster tolerance equipment are latest datas.

Need to prove, when if production equipment is redirected to the RAID of disaster tolerance equipment with this write request, data on the pairing address of this write request have been finished synchronously, no longer preserve the information of this address during promptly data in synchronization is not tabulated, then disaster tolerance equipment directly writes data according to the write request after being redirected on corresponding address, override synchronous data to disaster tolerance equipment, thereby, guarantee that directly the data in the disaster tolerance equipment are latest data.

2, when application server sends read request to the RAID of production equipment, production equipment judges that at first the pairing information of read request is whether in data in synchronization tabulation not.

If this information is not in data in synchronization is tabulated, promptly also not synchronously to disaster tolerance equipment, the data message in the production equipment should be up-to-date information to this information, so the data that production equipment directly reads on the local RAID return to application server.

If this information is not in data in synchronization is tabulated, promptly this information is synchronously to disaster tolerance equipment, data message in the disaster tolerance equipment should be up-to-date information, so, production equipment is redirected to read request the RAID of disaster tolerance equipment, obtain the pairing information of this read request, and return this information and give application server.

After above-mentioned simultaneous operation is finished, change step S103 over to.

Step S103, production equipment are redirected to read request or the write request of application server to the RAID transmission of production equipment on the RAID of disaster tolerance equipment, realize corresponding operating.

Because the simultaneous operation of step S102 is finished, thereby guaranteed that the data message in the disaster tolerance equipment is a latest data, so, production equipment all is redirected to disaster tolerance equipment with read request or the write request that all application servers send over, to guarantee proceeding of application server institute's requested service by the latest data in the disaster tolerance equipment.

So far, the RAID that is in the production equipment of critical condition no longer bears business, even break down, also not can to business proceed exert an influence, and since disaster tolerance equipment to data message synchronously, make that the data message in the system can not incur loss, guaranteed the integrality of data.

In the application scenarios of reality, after step S103 finishes, system or production equipment itself can also further be alarmed to the keeper, the trouble hunting of RAID is carried out in prompting, and after the fault eliminating, technical scheme proposed by the invention also further comprises the Data Recovery Process of disaster tolerance equipment to production equipment, and the specific implementation process is as follows:

After step S103 finishes, production equipment continues to keep the state-detection to self RAID, if the fault of RAID is excluded, promptly this RAID returns to normal condition or production equipment has been changed new RAID, and production equipment returns to aforementioned information to the RAID of disaster tolerance equipment synchronously on the RAID of production equipment.

Above-mentioned recovery operation can be equivalent to the contrary operation of aforementioned simultaneous operation, just the direction of data sync by become from production equipment to disaster tolerance equipment in the aforesaid simultaneous operation the recovery operation from disaster tolerance equipment to production equipment.

It is to be noted, in the recovery operation process, the same with aforesaid simultaneous operation, in production equipment, safeguarded a not data in synchronization tabulation, write down current all and do not carry out the data of recovery operation, after the recovery operation that pairing data are gone up in the tabulation of data in synchronization was not finished, the corresponding data address of deletion in data in synchronization tabulation had been finished recovery to represent the pairing data message in this address.

On the other hand, in above-mentioned recovery operation process, in order to reduce the RTO in the disaster tolerance process, production equipment further according to following default processing policy, handle accordingly by read request or write request that application server sent.

1, when application server sends write request to the RAID of production equipment, production equipment is the RAID of the pairing information production equipment of this write request, and in data in synchronization tabulation the pairing address of deletion write request.In this way, these data of mark no longer need recovery operation, thereby guarantee that the data that newly write can not covered by legacy data because of recovery operation, that is, the data in the production equipment are latest datas.

Need to prove, when if production equipment receives the write request of application server transmission, data on the pairing address of this write request have been finished recovery, no longer preserve the information of this address during promptly data in synchronization is not tabulated, then production equipment directly writes data according to this write request on the local address of correspondence, override the data that return in the production equipment, thereby, guarantee that directly the data in the production equipment are latest data.

If this information is not in data in synchronization is tabulated, promptly this information does not also return to production equipment, then the data message in the disaster tolerance equipment should be up-to-date information, so, production equipment is redirected to read request the RAID of disaster tolerance equipment, obtain the pairing information of this read request, and return this information and give application server.

If in data in synchronization was not tabulated, promptly this information had not returned to production equipment to this information, the data message in the production equipment should be up-to-date information, so the data that production equipment directly reads on the local RAID return to application server.

By using technical scheme provided by the present invention, in the time of can being in critical condition at the RAID of production equipment with data message synchronously to disaster tolerance equipment, the fault of equipment to be produced returns to production equipment with the data in the disaster tolerance equipment after getting rid of, and said process is to carry out under the situation of application server unaware, do not need to stop corresponding service, thereby, improved data reliability and the security of storage data in the communication system, avoid the loss of data that when the storage resources of production equipment breaks down, caused, reduced PRO and the PTO in the communication system.

Below, the enforcement scene further combined with concrete describes technical scheme proposed by the invention.As shown in Figure 2, specifically comprise following flow process:

Disk failure has appearred in the redundant array in step S201, the production equipment (RAID), enters critical condition (Critical).

Whether the resource that step S202, production equipment detect on the RAID array has enabled disaster tolerance mechanism, triggers data resource and the disaster tolerance equipment of having enabled disaster tolerance mechanism and carries out this data resource being copied on the RAID of disaster tolerance equipment synchronously.

In general disaster tolerance environment, production equipment at set intervals or a certain amount of data variation of every generation just can trigger once the data sync with disaster tolerance equipment.

And the synchronous and above-mentioned regular data sync among the step S202 is different, when system need enter critical condition at the RAID of above-mentioned production equipment, the storage data of production equipment and disaster tolerance equipment are carried out synchronously, guarantee that the data that disaster tolerance equipment is stored are up-to-date, thereby avoid because the RAID fault of production equipment causes data degradation.

Described as last embodiment, in the simultaneous operation process of step S202,, in production equipment, also safeguarded a not data in synchronization tabulation in order to write down the progress of simultaneous operation, write down not synchrodata of current all.

If application server issues to production equipment and writes I/O in this synchronizing process, i.e. write request, then execution in step S203;

If application server issues to production equipment and reads I/O in this synchronizing process, i.e. read request, then execution in step S204;

If application server does not issue to production equipment and does not write I/O or read I/O in this process, then after above-mentioned simultaneous operation is finished, direct execution in step S207.

Step S203, production system directly are redirected to disaster tolerance equipment with the pairing data of write request.

Simultaneously, if the pairing data of this write request are not also finished simultaneous operation, no longer carry out synchronously the production equipment mark address that just write data so, i.e. the pairing address information of this write request of deletion in data in synchronization tabulation, thus the data that guarantee disaster tolerance equipment are latest datas.

Step S204, production equipment are judged the current data address of reading whether in data in synchronization tabulation not, and promptly whether these data have been synchronized to disaster tolerance equipment.

If in data in synchronization is not tabulated, the expression data message also is not synchronized to disaster tolerance equipment, and promptly the data message in the production equipment is up-to-date data, execution in step S205;

If not in data in synchronization is not tabulated, the expression data message has been synchronized to disaster tolerance equipment, promptly the data message in the disaster tolerance equipment is up-to-date data, execution in step S206.

Step S205, production equipment be reading of data from the RAID of this locality directly, submits to application server.

Step S206, production equipment will be read I/O and be redirected to disaster tolerance equipment, return to production equipment by disaster tolerance equipment sense data, and production equipment returns to application server with these data more then.

Step S207, the production equipment read-write I/O that application server is all are redirected in the storage of disaster tolerance equipment.

By this step, make application server no longer continue to read and write local RAID, avoided the loss of data that when local RAID enters failure state, is caused.

After this step is finished, before RAID in production equipment does not obtain maintenance or replacing, read-write requests after the RAID of disaster tolerance equipment is redirected according to the production equipment that receives, operate accordingly, in this process, application server only carries out alternately with production equipment, can traffic affecting does not carry out because of the variation of concrete operations object.

When the keeper makes local redundant array recover normal (as changing disk, disk chassis etc.), execution in step S208.

Step S208, will return to production equipment again to the data resource of disaster tolerance equipment synchronously in the above-mentioned steps.

Same as last embodiment is described, in the recovery operation process of step S208,, in production equipment, also safeguarded a not data in synchronization tabulation in order to write down the progress of recovery operation, write down not synchrodata of current all.

If application server issues to production equipment and writes I/O in this synchronizing process, i.e. write request, then execution in step S209;

If application server issues to production equipment and reads I/O in this synchronizing process, i.e. read request, then execution in step S210;

If application server does not issue to production equipment and does not write I/O or read I/O in this process, then after above-mentioned recovery operation is finished, direct execution in step S213.

Step S209, production equipment will be write the pairing data of the I/O local RAID that writes direct.

Simultaneously, if the pairing data of this write request are not also finished recovery operation, no longer recover the production equipment mark address that just write data so, i.e. the pairing address information of this write request of deletion in data in synchronization tabulation, thus the data that guarantee production equipment are latest datas.

Step S210, production system judge the address of reading whether in above-mentioned not data in synchronization tabulation, and promptly whether these data are resumed to production equipment.

If in data in synchronization is not tabulated, the expression data message also is not resumed to production equipment, and promptly the data message in the disaster tolerance equipment is up-to-date data, execution in step S211;

If not in data in synchronization is not tabulated, the expression data message has been resumed to production equipment, promptly the data message in the production equipment is up-to-date data, execution in step S212;

Step S211, production equipment will be read I/O and be redirected to disaster tolerance equipment, return to production equipment by disaster tolerance equipment sense data, and production equipment returns to application server with these data more then.

Step S212, production equipment be reading of data from the RAID of this locality directly, submits to application server.

Step S213, the production equipment read-write I/O that application server is all carry out on the RAID of this locality.

By this step, the RAID of production equipment becomes the Action Target of application server again, the professional normal condition that returns.

For further detailed explanation the technical program, in conjunction with concrete enforcement scene, below technical scheme that the present invention is proposed be further divided into normal condition, disaster tolerance state and the three kinds of situations that return to form are illustrated.

As shown in Figure 3, for carrying out the schematic flow sheet of Business Processing under system's normal condition, may further comprise the steps:

Step S301, application server send a plurality of professional read-write requests to production equipment.

By these professional read-write requests, in production equipment, carried out corresponding data read-write operation.

According to default rule, when having passed through default replicative cycle or the data variation amount that data message produced having been reached preset threshold value by above-mentioned professional read-write operation, execution in step S302.

Step S302, production equipment triggering synchronous flow process, with the data message in the current production equipment synchronously to disaster tolerance equipment.

After finishing synchronously, continue to return step S301, promptly repeat aforesaid operations, to the data sync of current system execution cycle property.

As shown in Figure 4, the schematic flow sheet for system carries out Business Processing under the disaster tolerance state may further comprise the steps:

The redundant array of step S401, production equipment (RAID) enters critical condition.

Data sync operation between step S402, production equipment triggering and the disaster tolerance equipment.

Step S403, production equipment are to disaster tolerance device replication data.

So far, finished synchronous triggering flow process between production equipment and the disaster tolerance equipment.

Trigger after the flow process, entered the simultaneous operation flow process, in this process, further according to the read-write requests of application server to the production equipment transmission, this flow process is further comprising the steps of:

Step S404, application server send write request to production equipment.

Step S405, production equipment are redirected this write request, send to disaster tolerance equipment.

Step S406, disaster tolerance equipment return the affirmation message of writing success to production equipment.

Step S407, production equipment through after being redirected, send to application server with this acknowledge message, simultaneously, in the data in synchronization tabulation (treating synchronous address queue) the write request corresponding address are not being deleted.

Above-mentioned step S404 is the handling process of write request in the synchronizing process to step S407.If in synchronizing process, application server does not send write request to production equipment, and then above-mentioned step S404 can not appear in the concrete flow operations to step S407, and such variation does not influence protection scope of the present invention.

Step S408, application server send read request to production equipment.

Step S409, production equipment judge whether the pairing address of this read request is contained in the not data in synchronization tabulation.

If judged result is for being then to change step S410 over to;

If judged result then changes step S411 over to for not.

Step S410, production equipment return to application server according to read request reading of data in local RAID.

Step S411, production equipment are redirected this read request, send to disaster tolerance equipment.

Step S412, disaster tolerance equipment according to this read request to the production equipment return data.

Step S413, production equipment send to application server after the data process of returning is redirected.

Above-mentioned step S408 is the handling process of read request in the synchronizing process to step S413.If in synchronizing process, application server does not send read request to production equipment, and then above-mentioned step S408 can not appear in the concrete flow operations to step S413, and such variation does not influence protection scope of the present invention.

In this handling process, optionally comprise above-mentioned handling process to read request or write request, still, after the simultaneous operation that above-mentioned steps S403 is carried out was finished, this flow process also further comprised following steps:

Step S414, application server send professional read request/write request to production equipment.

Step S415, production equipment are redirected this business read request/write request, send to disaster tolerance equipment.

Step S416, disaster tolerance equipment return the operating result of professional read request/write request to production equipment.

Step S417, production equipment through after being redirected, send to application server with this operating result.

As shown in Figure 5, the schematic flow sheet for system carries out Business Processing under returning to form may further comprise the steps:

The redundant array of step S501, production equipment (RAID) recovers normal condition.

Step S502, production equipment trigger and disaster tolerance equipment between data restore operation (be in the previous embodiment proposed recovery operation).

Step S503, disaster tolerance equipment start and recover flow process to the production equipment copy data.

So far, the recovery of having finished between production equipment and the disaster tolerance equipment triggers flow process.

Trigger and recover to have entered recovery operation flow process (being the recovery operation flow process) after the flow process, in this process, further according to the read-write requests of application server to the production equipment transmission, this flow process is further comprising the steps of:

Step S504, application server send write request to production equipment.

Step S505, production equipment directly carry out write operation according to this write request in local RAID, simultaneously, in the data in synchronization tabulation (treating synchronous address queue) corresponding address is not being deleted.

Step S506, production equipment return the affirmation message of writing success to application server.

Above-mentioned step S504 is the handling process of write request in the recovery process to step S506.If in recovery process, application server does not send write request to production equipment, and then above-mentioned step S504 can not appear in the concrete flow operations to step S506, and such variation does not influence protection scope of the present invention.

Step S507, application server send read request to production equipment.

Step S508, production equipment judge whether the pairing address of this read request is contained in the not data in synchronization tabulation.

If judged result then changes step S509 over to for not;

If judged result is for being then to change step S510 over to.

Step S509, production equipment return to application server according to read request reading of data in local RAID.

Step S510, production equipment are redirected this read request, send to disaster tolerance equipment.

Step S511, disaster tolerance equipment according to this read request to the production equipment return data.

Step S512, production equipment send to application server after the data process of returning is redirected.

Above-mentioned step S507 is the handling process of read request in the recovery process to step S512.If in recovery process, application server does not send read request to production equipment, and then above-mentioned step S507 can not appear in the concrete flow operations to step S512, and such variation does not influence protection scope of the present invention.

In this handling process, optionally comprise above-mentioned handling process to read request or write request, still, after the recovery operation that above-mentioned steps S503 is carried out was finished, this flow process also further comprised following steps:

Step S513, application server send professional read request/write request to production equipment.

Step S514, production equipment are operated according to this business read request/write request, and return the operating result of professional read request/write request to application server.

Through after the above-mentioned recovery operation, system service recovery is normal, and application server has returned to flow process as shown in Figure 3 again to the professional read-write operation of production equipment.

Further, in order to realize above-mentioned technical scheme, as shown in Figure 6, the present invention also provides a kind of schematic diagram of application scenarios of the implementation method that realizes above-mentioned disaster tolerance, comprise at least one application server 61, a production equipment 62 and at least one a disaster tolerance equipment 63, wherein, production equipment 62 and disaster tolerance equipment 63 comprise RAID620 and the RAID630 that is used to store data respectively.

Production equipment 62, electrically connect with application server 61 and disaster tolerance equipment 63 respectively, be used for when the RAID620 of production equipment 62 is in critical condition or reverts to normal condition, carry out the synchronous or recovery operation of information with disaster tolerance equipment 63, and after simultaneous operation is finished, read request or write request that application server 61 is sent are redirected on the RAID630 of disaster tolerance equipment 63, realize corresponding operating.

In concrete application scenarios, as shown in Figure 7, above-mentioned production equipment 62 is specially a kind of communication equipment 62, comprises the RAID620 that is used for storage, further, also specifically comprises with lower module:

Detection module 621 is used to detect the operating state of RAID620.

Handover module 622, electrically connect with the RAID630 of detection module 621, RAID620 and disaster tolerance equipment 63, at least being used for detecting RAID620 at detection module 621 breaks down, when being in critical condition, according to default selection strategy, the RAID of communication equipment 62 current uses is switched to the RAID630 of a disaster tolerance equipment 63 by RAID620, and with the information synchronization on the RAID620 to the RAID630 of disaster tolerance equipment 63.

Owing to can have a plurality of disaster tolerance equipment 63 in the system; and the RAID630 of each disaster tolerance equipment 63 all keeps electrically connecting with the handover module 622 of communication equipment 62; therefore; handover module 622 is when detection module 621 detects RAID620 and is in critical condition; need in a plurality of disaster tolerance equipment 63 that link to each other, select at least one RAID630 to replace RAID620 to proceed information stores as the RAID of current use; and concrete selection strategy can require according to concrete application scenarios to set; maximum as current abundant resources; nearest etc. with communication equipment 620 physical distances; the content change of concrete selection strategy does not influence protection scope of the present invention.

Processing module 623, electrically connect with the RAID630 of handover module 622, RAID620 and disaster tolerance equipment 63, at least be used for after the simultaneous operation of handover module 622 is finished, read request that application server 61 is sent or write request are redirected on the RAID630 of the disaster tolerance equipment 63 that handover module 622 switched, and realize corresponding operating.

In concrete application scenarios, handover module 622 with the process of the information synchronization on the RAID620 to the RAID630 of disaster tolerance equipment 63 in, processing module 623 also is used for by safeguarding the not progress of this simultaneous operation of data in synchronization list records, and judge that whether the pairing information of read request that the application server 61 receive sends finishes above-mentioned simultaneous operation, handles according to following strategy for read request or write request that application server 61 sends:

When communication equipment 62 receives the write request of application server 61 transmissions, processing module 623 is redirected to this write request on the RAID630 of the disaster tolerance equipment 63 that handover module 622 switched, and in data in synchronization tabulation the pairing address of this write request of deletion;

When communication equipment 62 receives the read request that application server sends, and processing module 623 judges when the pairing information of read request that the application server 61 that receives sends is finished above-mentioned simultaneous operation, and the read request that application server 61 is sent is redirected to the RAID630 of the disaster tolerance equipment 63 that handover module 622 switched.

Above-mentioned be in disaster tolerance process that critical condition causes because of RAID620 and realize after, in concrete application scenarios, also corresponding exist RAID620 to remove critical condition, and the recovery process after the recovery normal operating conditions, in such process, handover module 622 also is used for when detection module 621 detects RAID620 recovery normal condition synchronous information to the RAID630 of disaster tolerance equipment 63 being returned on the RAID620 of communication equipment 62;

Processing module 623 also is used for after the recovery operation of handover module 622 is finished, and read request or write request that the RAID620 of 61 pairs of communication equipments 62 of application server is sent realize corresponding operating on the RAID620 of communication equipment 620.

Similar with above-mentioned disaster tolerance process, handover module 622 will be synchronously information to the RAID630 of disaster tolerance equipment 63 return in the process on the RAID620 of communication equipment 62, processing module 623 also is used for by safeguarding the not progress of this recovery operation of data in synchronization list records, and judge that whether the pairing information of read request that the application server 61 receive sends finishes above-mentioned recovery operation, handles according to following strategy for read request or write request that application server 61 sends:

When communication equipment 62 received the write request of application server 61 transmissions, processing module 623 write RAID620 with this write request, and deleted the address of the pairing information of this write request in data in synchronization is not tabulated;

When communication equipment 62 receives the read request that application server 61 sends, and processing module 623 judges that the read request that processing module 623 sends application server 61 was redirected to the RAID630 of disaster tolerance equipment 63 when the pairing information of read request that the application server 61 that receives sends was not finished above-mentioned recovery operation.

It is pointed out that above-mentioned RAID620 can be used as a module distribution among communication equipment 62, also can be used as an independent memory device and be independent of outside the communication equipment 62 that such variation does not influence protection scope of the present invention.

Above-mentioned module can be distributed in a device, also can be distributed in multiple arrangement.Above-mentioned module can be merged into a module, also can further split into a plurality of submodules.

By communication system provided by the present invention and equipment, in the time of can being in critical condition at the RAID of production equipment with data message synchronously to disaster tolerance equipment, the fault of equipment to be produced returns to production equipment with the data in the disaster tolerance equipment after getting rid of, and said process is to carry out under the situation of application server unaware, do not need to stop corresponding service, thereby, improved data reliability and the security of storage data in the communication system, avoid the loss of data that when the storage resources of production equipment breaks down, caused, reduced PRO and the PTO in the communication system.

It will be appreciated by those skilled in the art that accompanying drawing is a preferred schematic diagram of implementing scene, module in the accompanying drawing or flow process might not be that enforcement the present invention is necessary.

It will be appreciated by those skilled in the art that the module in the device of implementing in the scene can be distributed in the device of implementing scene according to implementing scene description, also can carry out respective change and be arranged in the one or more devices that are different from this enforcement scene.The module of above-mentioned enforcement scene can be merged into a module, also can further split into a plurality of submodules.

The invention described above sequence number is not represented the quality of implementing scene just to description.

Through the above description of the embodiments, those skilled in the art can be well understood to the present invention and can realize by hardware, also can realize by the mode that software adds necessary general hardware platform.Based on such understanding, technical scheme of the present invention can embody with the form of software product, it (can be CD-ROM that this software product can be stored in a non-volatile memory medium, USB flash disk, portable hard drive etc.) in, comprise some instructions with so that computer equipment (can be personal computer, server, the perhaps network equipment etc.) each implements the described method of scene to carry out the present invention.

More than disclosed only be several concrete enforcement scene of the present invention, still, the present invention is not limited thereto, any those skilled in the art can think variation all should fall into protection scope of the present invention.

Claims

1, a kind of implementation method of disaster tolerance is applied to comprise in the system of at least one application server, a production equipment and at least one disaster tolerance equipment that it is characterized in that, described production equipment carries out state-detection to the RAID of self, and described method comprises:

2, the method for claim 1, it is characterized in that, described production equipment with the process of the information synchronization on the described RAID to the RAID of described disaster tolerance equipment in, described application server is handled according to following strategy the read request or the write request of the RAID transmission of described production equipment, wherein, described production equipment is by safeguarding the not described synchronous progress of data in synchronization list records:

3, the method for claim 1, it is characterized in that, at described production equipment read request or the write request of described application server to the RAID transmission of described production equipment is redirected on the RAID of described disaster tolerance equipment, realize after the corresponding operating, described production equipment is proceeded state-detection to self RAID, and described method also comprises:

4, method as claimed in claim 3, it is characterized in that, in described production equipment returns to the described information to the RAID of described disaster tolerance equipment synchronously process on the RAID of described production equipment, described application server is handled according to following strategy the read request or the write request of the RAID transmission of described production equipment, wherein, described production equipment is by safeguarding the not described synchronous progress of data in synchronization list records:

5, a kind of communication equipment comprises the RAID that is used for storage, it is characterized in that, described communication equipment is applied to comprise in the system of at least one application server, a production equipment and at least one disaster tolerance equipment as production equipment, specifically comprises:

Detection module is used to detect the operating state of described RAID;

6, communication equipment as claimed in claim 5, it is characterized in that, described handover module with the process of the information synchronization on the described RAID to the RAID of described disaster tolerance equipment in, described processing module, also be used for by safeguarding the not progress of the described simultaneous operation of data in synchronization list records, and judge that whether the pairing information of read request that the described application server receive sends finishes described simultaneous operation, handles according to following strategy for read request or write request that described application server sends:

7, communication equipment as claimed in claim 5 is characterized in that,

Described handover module also is used for when described detection module detects described RAID and recovers normal condition, and described information to the RAID of described disaster tolerance equipment is synchronously returned on the RAID of described communication equipment;

8, communication equipment as claimed in claim 7, it is characterized in that, in described handover module returns to the described information to the RAID of described disaster tolerance equipment synchronously process on the RAID of described communication equipment, described processing module, also be used for by safeguarding the not progress of the described recovery operation of data in synchronization list records, and judge that whether the pairing information of read request that the described application server receive sends finishes described recovery operation, handles according to following strategy for read request or write request that described application server sends: