CN103354503A

CN103354503A - Cloud storage system capable of automatically detecting and replacing failure nodes and method thereof

Info

Publication number: CN103354503A
Application number: CN2013101937604A
Authority: CN
Inventors: 陈清华; 杜国娟
Original assignee: ZHEJIANG SANLOGIC TECHNOLOGY Co Ltd
Current assignee: ZHEJIANG SANLOGIC TECHNOLOGY Co Ltd
Priority date: 2013-05-23
Filing date: 2013-05-23
Publication date: 2013-10-16

Abstract

The invention discloses a cloud storage system capable of automatically detecting and replacing failure nodes and a method thereof, and aims to provide a cloud storage system with a self-repairing capability. The cloud storage system comprises a storage cluster, a monitoring and management server and a plurality of standby servers, wherein the storage cluster comprises a plurality of storage servers, the monitoring and management server is connected with all of the standby servers and all of the storage servers in the storage cluster, and the monitoring and management server is provided with an input/output interface which is communicated with the outside. Users store or read data in the storage servers through the input/output interface of the monitoring and management server. Meanwhile, the monitoring sever monitors health conditions of each of the storage servers, and if a certain storage server breaks down, the storage sever is replaced by using the standby server, thereby ensuring normal operations of the cloud storage system. The cloud storage system disclosed by the invention is applicable to all cloud storage architectures.

Description

A kind of cloud storage system and method thereof that can automatically detect and replace malfunctioning node

Technical field

The present invention relates to a kind of cloud storage system, especially relate to a kind of cloud storage system and method thereof that can automatically detect and replace malfunctioning node.

Background technology

The cloud storage is in the conceptive extension of cloud computing and development new ideas out, refer to by functions such as cluster application, grid or distributed file systems, a large amount of various dissimilar memory devices in the network are gathered collaborative work by application software, a system of data storage and Operational Visit function externally is provided jointly.The cloud memory technology is the direction of IT future development.

Because cloud storage system is in large scale, number of nodes is many, the situation of memory node fault inevitably can occur.

It is the patent documentation of CN101753617A that State Intellectual Property Office of the People's Republic of China discloses publication number on 06 23rd, 2010, title is a kind of cloud storage system and method, this system comprises overall scheduling layer and cloud accumulation layer, wherein: described overall scheduling layer, be used for according to the access request that receives, according to the resource of described access request, locate the position of the described cloud accumulation layer in described resource place; Described overall scheduling layer is comprised of one or more server; Described cloud accumulation layer is comprised of at least one cloud memory node.By using overall scheduling layer and cloud accumulation layer, so that can either utilize the advantage of the conventional store framework that the overall scheduling layer has, the extensibility that also can utilize the cloud accumulation layer to have simultaneously is strong, the advantage that cost is low.But certain node (server) in the cloud accumulation layer is difficult to effectively process when breaking down, and can affect follow-up use, even causes irremediable loss.

Summary of the invention

The present invention mainly be solve prior art existing be difficult to the node that breaks down process, to the technical problem that follow-up use can exert an influence, a kind of cloud storage system and the method thereof that can automatically detect and replace malfunctioning node that can replace malfunctioning node, ensure the normal operation of cloud storage system is provided.

The present invention is directed to above-mentioned technical problem is mainly solved by following technical proposals: a kind of cloud storage system that can automatically detect and replace malfunctioning node, comprise storage cluster and monitoring management server, also comprise several standby servers, described storage cluster comprises several storage servers, described monitoring management server connects respectively all storage servers in all standby servers and the storage cluster, and described monitoring management server is provided with the input/output interface with PERCOM peripheral communication.

Each storage server is a memory node.The user deposits in or reading out data in storage server by the input/output interface of monitoring management server.Monitoring server is monitored the health status of each storage server simultaneously, if certain storage server breaks down, then uses standby server to replace, and ensures the normal operation of cloud storage system.

As preferably, cloud storage system also comprises Alarm Server, and described Alarm Server is connected with described monitoring management server.After certain storage server broke down, the monitoring management server was reported to the police by Alarm Server, notified administrative staff that failed server is keeped in repair.

As preferably, described Alarm Server comprises wireless communication unit.Alarm Server can break away from the constraint of cable by wireless communication unit, realizes long-distance alarm.Wireless communication unit can support mobile communications network or/and WLAN (wireless local area network).

As preferably, described each standby server comprises a power control module, and described power control module is connected with described monitoring management server.Power control module can be controlled standby server and be in sleep state or wake-up states.When standby server is not activated replacement, be in sleep state, only have minimum electric current to pass through, power consumption is little, and is energy-conservation; When certain standby server replacement failed server of needs was carried out work, power control module woke this standby server up, provides normal operation required electric current, guarantees to store working properly carrying out.

As preferably, cloud storage system also comprises caching server, and described caching server is connected with described monitoring management server.When the user deposited file in to cloud storage system, file was temporary in caching server first, waits in the storage server that re-sends to appointment after finishing receiving and stores.Just in case in storing process, be transfused to the data storage server and break down, then whole file intactly can be deposited in the standby server that is replaced again, reduce the risk of File lose or partial loss like this.

As preferably, cloud storage system also comprises fire compartment wall, and described fire compartment wall is serially connected on the input/output interface of monitoring management server.Fire compartment wall prevents that cloud storage system is subject to external attack.

A kind of cloud storage system detects and replaces the method for malfunctioning node automatically, may further comprise the steps: step 1, monitoring management server detect the state of each storage server, when finding to enter step 2 after certain storage server breaks down, the storage server that breaks down is failed server;

Step 2, monitoring management server wake a standby server up and are promoted to the grade of the standby server that is waken up identical with failed server;

Step 3, monitoring management server are low to moderate fault level with the level down of failed server, the grade that fault level possesses less than the storage server of all normal operations;

Step 4, monitoring management server detect the state of the standby server that is waken up, if normal then the follow-up transfer of data that should store failed server into is stored in the standby server that is waken up, if undesired then the standby server that is waken up is set as new failed server, and repeat step 2 to step 4.

As preferably, failed server replaced with standby server after, report to the police by Alarm Server.

As preferably, when cloud storage system starts for the first time, whether the monitoring management module detects first all storage servers, standby server, caching server and fire compartment wall normal, and all normal later on control standby server enters resting state, and enters step 1; If a not normal operation is arranged in the detected equipment, then enters holding state.

The substantial effect that the present invention brings is, can in time replace the server that breaks down, and guarantees that cloud storage system normally moves; Failure condition in time can be circulated a notice of to administrative staff; Can reduce the risk of File lose or partial loss.

Description of drawings

Fig. 1 is the structural representation of a kind of cloud storage system of the present invention;

Fig. 2 is a kind of method flow diagram that detects and replace failed server of the present invention;

Among the figure: 1, monitoring management server, 2, storage server, 3, standby server, 4, Alarm Server, 5, fire compartment wall, 6, caching server.

Embodiment

Below by embodiment, and by reference to the accompanying drawings, technical scheme of the present invention is described in further detail.

Embodiment: a kind of cloud storage system that can automatically detect and replace malfunctioning node of present embodiment as shown in Figure 1, comprises storage cluster, monitoring management server 1, Alarm Server 4, fire compartment wall 5, caching server 6 and two standby servers 3.Storage cluster comprises several storage servers 2.Monitoring management server 1 connects respectively all storage server 2 and all standby servers 3.Monitoring management server 1 also connects respectively Alarm Server 4 and caching server 6.Fire compartment wall 5 is serially connected on the input/output interface of monitoring management server 1.The external data of all turnover cloud storage systems all will first through the filtration of fire compartment wall 5, prevent that external attack from destroying cloud storage system.

Storage server 2, standby server 3 and caching server 6 are referred to as the storage class server.Monitoring management server 1 and the transmission that being connected of each storage server 2, each standby server 3 and caching server 6 comprises data-signal and the transmission of control signal.Data-signal is the file data that deposits in or the file data that reads from cloud storage system in cloud storage system; Control signal is the signal of each server operation of control and the status signal of each storage class server feedback.Each storage class server feedback comprises heartbeat signal to the signal of monitoring management server 1, and monitoring management server 1 can obtain the health status of each storage class server from heartbeat signal.

Each standby server 3 comprises a power control module, and power control module is connected with monitoring management server 1.Power control module can be controlled standby server 3 and be in sleep state or wake-up states.When standby server 3 is not activated replacement, be in sleep state, only have minimum electric current to pass through, power consumption is little, and is energy-conservation; When certain standby server 3 replacement failed server of needs were carried out work, power control module woke this standby server 3 up, provides normal operation required electric current, guarantees to store working properly carrying out.

When the user deposited file in to cloud storage system, file was temporary in caching server 6 first, waits in the storage server 2 that re-sends to appointment after finishing receiving and stores.Just in case in storing process, be transfused to data storage server 2 and break down, then file intactly can be deposited in the standby server 3 that is replaced again, reduce the risk of File lose or partial loss like this.

A kind of cloud storage system detects and replace the method for malfunctioning node automatically, and is specific as follows:

The automatic replacement module is divided into monitoring server end and data server (comprising the preliminary data server) end two parts:

The data server end:

The effect of the module of data server end has: periodic test data server running status; Periodically send heartbeat message to monitoring server; Send role's task that book server is served as to monitoring server.

The data server running status comprises the system CPU temperature detection, the disk array state-detection, and hard disk S.M.A.R.T information detects, the key messages such as network condition detection.

System temperature and cpu temperature obtain by the transducer that carries on the mainboard, and temperature surpasses the threshold value of setting, and will send abnormal information to monitoring server, makes corresponding processing mode by monitoring server.

System disk S.M.A.R.T information can according to the frequency that sets, detect the hard disk in the system.Can judge the health status of hard disk by S.M.A.R.T information.Notify the keeper to change hard disk during very low at the hard disk health degree, that damage is arranged risk.

Array status detects, and each back end is set up disk array with the hard disk in the system with raid5 or raid6 pattern, and the Redundant backup dish is set in array.Under this pattern, under this pattern, in the situation of the disk failures in the array 1 (raid6 can damage 2), array still can work; And system uses HotSpare disk and replaces the hard disk that has damaged, and by notifying the keeper to change the hard disk of damage to management node transmission information.System can be added to new hard disk the HotSpare disk of array automatically.After HotSpare disk replace to damage hard disk, array will enter degraded mode, and return to by the data that algorithm will be replaced dish and to replace the disk of coming in.In this case, can advise that the keeper reduces the load of this node, has accelerated reparation speed.Reduce the risk that array damages.If other disk failures occur in this process, array will quit work fully.Node will be judged as fault, and monitor node comes the normal operation of the whole storage cluster in position with the starter node replacement operation.

The role that node server is served as in storage cluster need to be saved in the monitoring server.In case node breaks down, monitoring server replaces malfunctioning node to continue to bear corresponding role's task secondary node according to these Role Informations.Role Information comprises the teaming method of disk array in the node and serve as brick role in which logical volume.The change of these information occurs in the back end, will be synchronized in the monitor node immediately and preserve.

Back end when network failure occurs, can't be communicated by letter with monitor node the reporting system state.Node server will carry out alarm by modes such as indicator light flickers.If the monitor node overstepping the time limit can't obtain this back end server heartbeat message, will think that this node breaks down, starter node is replaced program.

The monitoring server end:

The monitoring client server is accepted the heartbeat message that the back end server sends, and the time of record heartbeat message.Heartbeat message sent once in per 2 minutes.The information such as running status that comprise the back end server in the heartbeat message.The heartbeat message unification of each node is kept in the status file of monitoring server.

Monitoring server is preserved the Role Information of Servers-all in the cluster, is respectively: normal operation, standby for subsequent use, fault, four kinds of states of role are not set.In the storage cluster of newly building, all back end all can send heartbeat message to monitor node.The role that serves as that keeper's need are registered each node server according to plan of distribution: workspace server, standby server.The server that is set to standby for subsequent use will enter holding state.Workspace server creates corresponding disk array with the node disk as required, and configuration forms the cluster stores logical volumes.The array information of each back end and also will be saved in the monitoring server synchronously in the assignment information of cluster logical storage volumes simultaneously.

The heartbeat message that monitoring server System reliability node sends according to the condition information of each back end of reporting in the heartbeat message and the alert level of setting, is reported the state information mail of cluster to the keeper.

If detect the back end fault, can't work, monitor node is replaced program with starter node, and as shown in Figure 2, the replacement program step is as follows:

Step 1, monitoring management server detect the state of each storage server, and when finding to enter step 2 after certain storage server breaks down, the storage server that breaks down is failed server;

Monitoring server will reconfigure the array of replacing server according to the disk array information that is replaced node, be configured as consistent with the malfunctioning node array pattern.And the role who serves as in the cluster logical volume before according to malfunctioning node replaces.Cluster will be rebuild the data that are replaced in the node according to algorithm, preserve situation to recover fault data before.At the same time, monitoring server will be notified the keeper, in order in time repair malfunctioning node, and add new standby server to guarantee the normal operation of fault automatic replacement mechanism.

In the running of cloud storage system, the state-detection of storage server is continued always, can replace after guaranteeing to break down at once, will lose and impact is reduced to minimum.

When system detects a plurality of storage servers and makes mistakes, at first the storage server number of fault can be checked by system, then system can will wake the standby server of respective numbers up and be its load store node procedure, make its role with storage server add system to substitute the storage server of former fault, and then the state-detection of execution storage server, confirm its health.

Basic step is as follows:

1. system detects the storage server fault;

2. the failed storage number of servers is checked by system;

3. the standby server of system wake-up respective numbers and load store node procedure;

4. replace the storage server of fault;

5. detection of stored server state again;

6. determine that its healthy rear system continues operation.

After failed server replaced with standby server, report to the police by Alarm Server.Alarm Server comprises wireless communication unit.Alarm Server can break away from the constraint of cable by wireless communication unit, realizes long-distance alarm.Wireless communication unit can support mobile communications network or/and WLAN (wireless local area network).

When cloud storage system started for the first time, whether the monitoring management module detects first all storage servers, standby server, caching server and fire compartment wall normal, and all normal later on control standby server enters resting state, and enters step 1; If a not normal operation is arranged in the detected equipment, then enters holding state, and report to the police by Alarm Server.

Specific embodiment described herein only is to the explanation for example of the present invention's spirit.Those skilled in the art can make various modifications or replenish or adopt similar mode to substitute described specific embodiment, but can't depart from spirit of the present invention or surmount the defined scope of appended claims.

Although this paper has more used the terms such as cloud storage, standby server, monitoring management server, do not get rid of the possibility of using other term.Using these terms only is in order to describe more easily and explain essence of the present invention; They are construed to any additional restriction all is contrary with spirit of the present invention.

Claims

1. cloud storage system that can automatically detect and replace malfunctioning node, comprise storage cluster and monitoring management server, it is characterized in that, also comprise several standby servers, described storage cluster comprises several storage servers, described monitoring management server connects respectively all storage servers in all standby servers and the storage cluster, and described monitoring management server is provided with the input/output interface with PERCOM peripheral communication.

2. a kind of cloud storage system that can automatically detect and replace malfunctioning node according to claim 1 is characterized in that, also comprises Alarm Server, and described Alarm Server is connected with described monitoring management server.

3. a kind of cloud storage system that can automatically detect and replace malfunctioning node according to claim 2 is characterized in that described Alarm Server comprises wireless communication unit.

4. a kind of cloud storage system that can automatically detect and replace malfunctioning node according to claim 1 and 2, it is characterized in that, described each standby server comprises a power control module, and described power control module is connected with described monitoring management server.

5. a kind of cloud storage system that can automatically detect and replace malfunctioning node according to claim 3 is characterized in that, also comprises caching server, and described caching server is connected with described monitoring management server.

6. a kind of cloud storage system that can automatically detect and replace malfunctioning node according to claim 1 is characterized in that also comprise fire compartment wall, described fire compartment wall is serially connected on the input/output interface of monitoring management server.

7. a cloud storage system detects and replaces the method for malfunctioning node automatically, it is characterized in that may further comprise the steps: step 1, monitoring management server detect the state of each storage server, when finding to enter step 2 after certain storage server breaks down, the storage server that breaks down is failed server;

8. a kind of cloud storage system according to claim 7 automatically detects and replaces the method for malfunctioning node, it is characterized in that, failed server is replaced with standby server after, report to the police by Alarm Server.

9. a kind of cloud storage system according to claim 7 detects and replaces the method for malfunctioning node automatically, it is characterized in that, when cloud storage system starts for the first time, whether the monitoring management module detects first all storage servers, standby server, caching server and fire compartment wall normal, all normal later on control standby server enters resting state, and enters step 1; If a not normal operation is arranged in the detected equipment, then enters holding state.