CN105490847A - Real-time detecting and processing method of node failure in private cloud storage system - Google Patents

Real-time detecting and processing method of node failure in private cloud storage system Download PDF

Info

Publication number
CN105490847A
CN105490847A CN201510897964.5A CN201510897964A CN105490847A CN 105490847 A CN105490847 A CN 105490847A CN 201510897964 A CN201510897964 A CN 201510897964A CN 105490847 A CN105490847 A CN 105490847A
Authority
CN
China
Prior art keywords
data
memory node
services
network
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510897964.5A
Other languages
Chinese (zh)
Other versions
CN105490847B (en
Inventor
刘树发
温晋英
杨连群
王莹
宋津旭
王鹏
李翔宇
卢鑫刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TIANJIN CITY CHUZHI TECHNOLOGY Co Ltd
Original Assignee
TIANJIN CITY CHUZHI TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TIANJIN CITY CHUZHI TECHNOLOGY Co Ltd filed Critical TIANJIN CITY CHUZHI TECHNOLOGY Co Ltd
Priority to CN201510897964.5A priority Critical patent/CN105490847B/en
Publication of CN105490847A publication Critical patent/CN105490847A/en
Application granted granted Critical
Publication of CN105490847B publication Critical patent/CN105490847B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0668Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Hardware Redundancy (AREA)
  • Computer And Data Communications (AREA)

Abstract

The present invention relates to a real-time detecting and processing method of node failure in a private cloud storage system, characterized in that storage nodes are connected by a data synchronization network and are connected with a cloud computing server through a data service network; a management terminal is arranged in the storage nodes, and is used for checking work states of all the storage nodes; and self-check is performed on each storage node. According to the present invention, various data services in the private cloud storage system can be managed effectively; when the server breaks down, a way of automatically restoring the data services effectively facilitates the operation of a user and reduces the labor cost of a user side. The data service interruption caused by each equipment failure is avoided by automatically restoring the data services, thereby reducing the loss caused by the interruption of application business using the data services.

Description

A kind of privately owned cloud storage system interior joint fault detects and processing method in real time
Technical field
The invention belongs to cloud storage system error correcting technique field, especially a kind of privately owned cloud storage system interior joint fault detects and processing method in real time.
Background technology
Cloud storage is in the conceptive extension of cloud computing and the new concept of development out one, it is a kind of emerging Network storage technology, refer to by functions such as cluster application, network technology or distributed file systems, various dissimilar memory device a large amount of in network is gathered collaborative work by application software, the common system that data storage and Operational Visit function are externally provided, the core of this system is that application software combines with memory device, realizes the transformation of memory device to stores service by application software.Compared with conventional memory device, cloud storage system not only relates to hardware, but the complication system of multiple part composition such as the network equipment, memory device, server, application software, a public access interface, each several part take memory device as core, externally provides data to store and Operational Visit service by application software.Such as: the places such as school, enterprise, government, information centre, data center, it is deepened day by day to the dependence of data, and data have become numerous business activity and to have rely the basis of carrying out.
Some only provide the structure of corresponding stores service to be referred to as privately owned cloud storage system to limited users, it is a kind of is government department or the customized cloud stores service scheme of corporate client, top quality service next to the skin can not only be provided for client, and security risk can also be reduced on certain procedures.But, for data, services fault and equipment fault, allow user manually carry out fault location and respective handling is unpractical, therefore for privately owned cloud storage system, how data service fault and equipment fault are positioned and to be processed, be user-friendly to, become a problem needing to solve.
Summary of the invention
The object of the invention is to overcome the deficiencies in the prior art, provide Real-Time Monitoring and the corresponding one of Different treatments privately owned cloud storage system interior joint fault of taking detects and processing method in real time.
The technical scheme that the present invention takes is:
Advantage of the present invention and good effect are:
In the present invention, memory node is connected by data sync network, and memory node is connected with cloud computing server by data services network, in memory node, management end is set, the operating state of management end to all memory nodes is adopted to check, each memory node checks oneself self store status, data services network state, data sync network state, data service state, independent these contents of IP state, how to be a place by the inspection of whole and part thus, all processing method is provided with for the different conditions occurred in each step simultaneously, it can manage the various data, services in privately owned cloud storage system effectively, when server fail, the mode of automation recovery data, services facilitates the operation of user effectively, reduce the cost of labor of user side.By automatically recovering data, services, avoid each data, services interruption equipment fault occurring and causes, thus the loss that the applied business reducing usage data service interrupts and causes.
Accompanying drawing explanation
Fig. 1 is structural representation of the present invention.
Embodiment
Below in conjunction with embodiment, the present invention is further described, and following embodiment is illustrative, is not determinate, can not limit protection scope of the present invention with following embodiment.
A kind of privately owned cloud storage system interior joint fault detects and processing method in real time, as shown in Figure 1, innovation of the present invention is: comprise multiple memory node and multiple cloud computing server that several data can be provided to serve, inner exchanges data is completed by data sync network between multiple memory node, multiple memory node completes the data, services with cloud computing server by data services network, in memory node, arrange a management end, described method comprises initialization procedure, management end detection and processing procedure and memory node detects and processing procedure;
Described initialization procedure comprises the following steps:
(1) management end preserves the preparation of the stored configuration of all memory nodes, network configuration and data, services in advance;
(2) memory node only preserves the storage preparation of this node, network configuration and data, services preparation;
(3) for each data, services selects any two memory nodes mirror image distribute separate tP address each other;
(4) the detection time of management end and memory node is set;
Described management end detects and processing procedure comprises the following steps:
(1) the connection status of each memory node is automatically checked successively according to detection time;
(2) when certain memory node is without response, this memory node is set to unavailable, illustrate that delay machine or network of the equipment at this place connects and disconnect, the data, services on the mirrored storage node that on current memory node, all data, services of original configuration are corresponding provides service, enters step (4); All data, services of above-mentioned memory node can configure on another one memory node, mirror image each other; Also multiple data, services can be configured respectively on other multiple memory node, mirror image each other;
When certain memory node normal response, enter next step;
The method of operation whether responded is: directly this memory node of PING or detect program corresponding on this memory node and whether normally run; The program of above-mentioned correspondence refers to the program of state in the detection memory device that memory node below detects and runs in advance in processing procedure;
(3) obtain the store status of this memory node; The store status at this place refers to the feedback record of management end themselves capture, the source of these feedback records be memory node below detect and in processing procedure during different conditions to the feedback record that management end sends;
When store status is abnormal, this memory node is set to unavailable, stop the data, services on this memory node, the data, services on the mirrored storage node that on current memory node, all data, services of original configuration are corresponding provides service, enters step (4) simultaneously; All data, services of above-mentioned memory node can configure on another one memory node, mirror image each other; Also multiple data, services can be configured respectively on other multiple memory node, mirror image each other;
(4) continue to detect next memory node, until complete the detection of all memory nodes;
(5) after management end receives the disabled information of memory node, meeting mail or other known mode reporting system keeper, system manager can attempt recovering or contact technical staff voluntarily and recover, and after node to be stored reverts to upstate, starts the data, services on this memory node; Above-mentioned recovery voluntarily can be: machine delay machine time restart equipment in memory node, network connect to detect when disconnecting netting twine, network interface card or, the network equipment such as switch or router;
Described memory node detects and processing procedure comprises the following steps:
(1) according to the store status checking this memory node detection time;
(2), when the memory device of this memory node is without response, by the information feed back of this memory device to management end, this has detected;
Above-mentioned memory device can be that common hard disc, disk array etc. are for storing the equipment of data;
When the memory device normal response of this memory node, enter step (3);
Three kinds of situations are divided into without response:
(1) the plant: scanning system, when chkdsk label does not exist, attempts reloading storage (system carries the program reloaded, and can attempt reconnecting memory device after operation), feed back to management end when cannot reload;
(2) the plant: disk failures, directly feeds back to management end;
(3) the plant: subregion is inconsistent, directly feeds back to management end;
Subregion is inconsistent is two kinds of situations:
1. the plant: memory device has been deleted without response or subregion;
2. the plant: subregion is modified.
(3) check the data services network of this memory node, when data services network disconnects, suspend all data, services on this memory node, this has detected;
When the data services network of this memory node is normal, enter step (4);
Data services network when disconnecting is: accessed the some memory nodes preset by data services network, if all cannot access, thinks disconnection.
(4) check the data sync network of this memory node, when data sync network disconnects, directly terminate this and detect and processing procedure, do not do any operation, this has detected;
When data sync network is normal, enter step (5);
Data sync network when disconnecting is: accessed the some memory nodes preset by data sync network, if all cannot access, thinks disconnection.
(5) check the data service state of this memory node, when this memory node data service state is halted state, (halted state comprises: 1. current data service is set to not use, data belong to the legacy data that abandons or data, services is normally closed; 2. the non-responsive state (2) related in step), enter step (7);
When this memory node data service state is halted state, detect the data service state of mirror image each other, when the data service state of mirror image starts, enter step (7), if the data service state of mirror image does not start, the data service state of this memory node is started, enters step (6);
When this data service state starts, enter step (6);
(6) the independent IP state of this memory node data, services checked, when independent IP loses, recovers independent IP, enters next step;
When independent IP is normal, enter next step;
(7) jump to the inspection that (5) step carries out next data service state, until complete the detection of the data, services of all memory devices.
Embodiment 1
In a certain laboratory of school, number of servers is limited, only has two to be used as memory node, directly uses big capacity hard disk to install on the server as storage medium.
For this situation, management end is arranged on one of them memory node, the situation of machine, another memory node normal operation if the memory node occurring just to serve in service data is delayed, and the testing process of memory node part is such:
The memory node of machine of delaying cannot run, thus cannot detect.
Normal memory node, can detect that memory device normal response, data services network are normal, data sync network is normal successively.When detecting that data service state is halted state, detect the data service state of another memory node of mirror image each other, the machine because memory node has been delayed, so data service state does not start, then can start the data, services of this node, thus ensure normally to provide data, services.
If management end program operates in normal node, then can the memory node of the machine of delaying be set to unavailable, and reporting system keeper.
Embodiment 2
In a certain data center, there is special memory device, use this memory device to be connected to server as storage medium.Now, if there is the connection fault of memory device and memory node, the testing process of memory node part is such:
On the node broken down, can detect that memory device is without response, then feed back to management end, management end can stop all data, services of this memory node.
Normal memory node, can detect that memory device normal response, data services network are normal, data sync network is normal successively.When detecting that data service state is halted state, detecting the data service state of another memory node of mirror image each other, because data, services stops, then can start the data, services of this node, thus ensure normally to provide data, services.

Claims (5)

1. a privately owned cloud storage system interior joint fault detects and processing method in real time, it is characterized in that: comprise multiple memory node and multiple cloud computing server that several data can be provided to serve, inner exchanges data is completed by data sync network between multiple memory node, multiple memory node completes the data, services with cloud computing server by data services network, in memory node, arrange a management end, described method comprises initialization procedure, management end detection and processing procedure and memory node detects and processing procedure;
Described initialization procedure comprises the following steps:
(1) management end preserves the preparation of the stored configuration of all memory nodes, network configuration and data, services in advance;
(2) memory node only preserves the storage preparation of this node, network configuration and data, services preparation;
(3) for each data, services selects any two memory nodes mirror image distribute separate tP address each other;
(4) the detection time of management end and memory node is set;
Described management end detects and processing procedure comprises the following steps:
(1) the connection status of each memory node is automatically checked successively according to detection time;
(2) when certain memory node is without response, this memory node is set to unavailable, illustrate that delay machine or network of the equipment at this place connects and disconnect, the data, services on the mirrored storage node that on current memory node, all data, services of original configuration are corresponding provides service, enters step (4); All data, services of above-mentioned memory node can configure on another one memory node, mirror image each other; Also multiple data, services can be configured respectively on other multiple memory node, mirror image each other;
When certain memory node normal response, enter next step;
(3) obtain the store status of this memory node; The store status at this place refers to the feedback record of management end themselves capture, the source of these feedback records be memory node below detect and in processing procedure during different conditions to the feedback record that management end sends;
When store status is abnormal, this memory node is set to unavailable, stop the data, services on this memory node, the data, services on the mirrored storage node that on current memory node, all data, services of original configuration are corresponding provides service, enters step (4) simultaneously; All data, services of above-mentioned memory node can configure on another one memory node, mirror image each other; Also multiple data, services can be configured respectively on other multiple memory node, mirror image each other;
(4) continue to detect next memory node, until complete the detection of all memory nodes;
(5) after management end receives the disabled information of memory node, meeting mail or other known mode reporting system keeper, system manager can attempt recovering or contact technical staff voluntarily and recover, and after node to be stored reverts to upstate, starts the data, services on this memory node; Above-mentioned recovery voluntarily can be: machine delay machine time restart equipment in memory node, network connect to detect when disconnecting netting twine, network interface card or, the network equipment such as switch or router;
Described memory node detects and processing procedure comprises the following steps:
(1) according to the store status checking this memory node detection time;
(2), when the memory device of this memory node is without response, by the information feed back of this memory device to management end, this has detected; Above-mentioned memory device can be that common hard disc, disk array etc. are for storing the equipment of data;
When the memory device normal response of this memory node, enter step (3);
(3) check the data services network of this memory node, when data services network disconnects, suspend all data, services on this memory node, this has detected;
When the data services network of this memory node is normal, enter step (4);
(4) check the data sync network of this memory node, when data sync network disconnects, directly terminate this and detect and processing procedure, do not do any operation, this has detected;
When data sync network is normal, enter step (5);
(5) check the data service state of this memory node, when this memory node data service state is halted state, (halted state comprises: 1. current data service is set to not use, data belong to the legacy data that abandons or data, services is normally closed; 2. the non-responsive state (2) related in step), enter step (7);
When this memory node data service state is halted state, detect the data service state of mirror image each other, when the data service state of mirror image starts, enter step (7), if the data service state of mirror image does not start, the data service state of this memory node is started, enters step (6);
When this data service state starts, enter step (6);
(6) the independent IP state of this memory node data, services checked, when independent IP loses, recovers independent IP, enters next step;
When independent IP is normal, enter next step;
(7) jump to the inspection that (5) step carries out next data service state, until complete the detection of the data, services of all memory devices.
2. one according to claim 1 privately owned cloud storage system interior joint fault detects and processing method in real time, it is characterized in that: described management end detects and in processing procedure, the step method of operation whether responded (2) is: directly this memory node of PING or detect program corresponding on this memory node and whether normally run.
3. one according to claim 1 privately owned cloud storage system interior joint fault detects and processing method in real time, it is characterized in that: described memory node detects and in processing procedure, step nothing response is (2) divided into three kinds of situations:
(1) the plant: scanning system, when chkdsk label does not exist, attempts reloading storage, feed back to management end when cannot reload;
(2) the plant: disk failures, directly feeds back to management end;
(3) the plant: subregion is inconsistent, directly feeds back to management end;
Subregion is inconsistent is two kinds of situations:
1. the plant: memory device has been deleted without response or subregion;
2. the plant: subregion is modified.
4. one according to claim 1 privately owned cloud storage system interior joint fault detects and processing method in real time, it is characterized in that: when described memory node detects and in processing procedure, step data services network (3) disconnects be: accessed the some memory nodes preset by data services network, if all cannot access, think disconnection.
5. one according to claim 1 privately owned cloud storage system interior joint fault detects and processing method in real time, it is characterized in that: when described memory node detects and in processing procedure, step data sync network (4) disconnects be: accessed the some memory nodes preset by data sync network, if all cannot access, think disconnection.
CN201510897964.5A 2015-12-08 2015-12-08 A kind of private cloud storage system interior joint failure real-time detection and processing method Expired - Fee Related CN105490847B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510897964.5A CN105490847B (en) 2015-12-08 2015-12-08 A kind of private cloud storage system interior joint failure real-time detection and processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510897964.5A CN105490847B (en) 2015-12-08 2015-12-08 A kind of private cloud storage system interior joint failure real-time detection and processing method

Publications (2)

Publication Number Publication Date
CN105490847A true CN105490847A (en) 2016-04-13
CN105490847B CN105490847B (en) 2019-03-29

Family

ID=55677591

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510897964.5A Expired - Fee Related CN105490847B (en) 2015-12-08 2015-12-08 A kind of private cloud storage system interior joint failure real-time detection and processing method

Country Status (1)

Country Link
CN (1) CN105490847B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106331642A (en) * 2016-08-31 2017-01-11 浙江大华技术股份有限公司 Method and device for processing data in video cloud system
WO2018214887A1 (en) * 2017-05-23 2018-11-29 杭州海康威视数字技术股份有限公司 Data storage method, storage server, storage medium and system
CN109361777A (en) * 2018-12-18 2019-02-19 广东浪潮大数据研究有限公司 Synchronous method, synchronization system and the relevant apparatus of distributed type assemblies node state
CN111866054A (en) * 2019-12-16 2020-10-30 北京小桔科技有限公司 Cloud host building method and device, electronic equipment and readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1529426A (en) * 2003-10-10 2004-09-15 清华大学 SAN dual-node image schooling method and system based on FCP protocol
CN101022363A (en) * 2007-03-23 2007-08-22 杭州华为三康技术有限公司 Network storage equipment fault protecting method and device
KR101280754B1 (en) * 2010-01-04 2013-07-05 아바야 인코포레이티드 Packet mirroring between primary and secondary virtualized software images for improved system failover performance
CN103354503A (en) * 2013-05-23 2013-10-16 浙江闪龙科技有限公司 Cloud storage system capable of automatically detecting and replacing failure nodes and method thereof
CN103685481A (en) * 2013-11-29 2014-03-26 深圳市安云信息科技有限公司 Cloud storage clustering system and cloud storage method
CN104699566A (en) * 2013-12-16 2015-06-10 杭州海康威视数字技术股份有限公司 Data redundant backup method, data redundant backup system and storage node server based on cloud storage

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1529426A (en) * 2003-10-10 2004-09-15 清华大学 SAN dual-node image schooling method and system based on FCP protocol
CN101022363A (en) * 2007-03-23 2007-08-22 杭州华为三康技术有限公司 Network storage equipment fault protecting method and device
KR101280754B1 (en) * 2010-01-04 2013-07-05 아바야 인코포레이티드 Packet mirroring between primary and secondary virtualized software images for improved system failover performance
CN103354503A (en) * 2013-05-23 2013-10-16 浙江闪龙科技有限公司 Cloud storage system capable of automatically detecting and replacing failure nodes and method thereof
CN103685481A (en) * 2013-11-29 2014-03-26 深圳市安云信息科技有限公司 Cloud storage clustering system and cloud storage method
CN104699566A (en) * 2013-12-16 2015-06-10 杭州海康威视数字技术股份有限公司 Data redundant backup method, data redundant backup system and storage node server based on cloud storage

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106331642A (en) * 2016-08-31 2017-01-11 浙江大华技术股份有限公司 Method and device for processing data in video cloud system
CN106331642B (en) * 2016-08-31 2020-05-26 浙江大华技术股份有限公司 Data processing method and device in video cloud system
WO2018214887A1 (en) * 2017-05-23 2018-11-29 杭州海康威视数字技术股份有限公司 Data storage method, storage server, storage medium and system
CN108933798A (en) * 2017-05-23 2018-12-04 杭州海康威视数字技术股份有限公司 Date storage method, storage server and system
US11218541B2 (en) 2017-05-23 2022-01-04 Hangzhou Hikvision Digital Technology Co., Ltd. Data storage method, storage server, and storage medium and system
CN109361777A (en) * 2018-12-18 2019-02-19 广东浪潮大数据研究有限公司 Synchronous method, synchronization system and the relevant apparatus of distributed type assemblies node state
CN109361777B (en) * 2018-12-18 2021-08-10 广东浪潮大数据研究有限公司 Synchronization method, synchronization system and related device for distributed cluster node states
CN111866054A (en) * 2019-12-16 2020-10-30 北京小桔科技有限公司 Cloud host building method and device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN105490847B (en) 2019-03-29

Similar Documents

Publication Publication Date Title
CN107465721B (en) Global load balancing method and system based on double-active architecture and scheduling server
US7278055B2 (en) System and method for virtual router failover in a network routing system
US6952766B2 (en) Automated node restart in clustered computer system
CN102710457B (en) A kind of N+1 backup method of cross-network segment and device
CN105302661A (en) System and method for implementing virtualization management platform high availability
US20030158933A1 (en) Failover clustering based on input/output processors
CN102394914A (en) Cluster brain-split processing method and device
GB2407887A (en) Automatically modifying fail-over configuration of back-up devices
JP2005301975A (en) Heartbeat apparatus via remote mirroring link on multi-site and its use method
CN111176888B (en) Disaster recovery method, device and system for cloud storage
CN103729280A (en) High availability mechanism for virtual machine
CN111949444A (en) Data backup and recovery system and method based on distributed service cluster
CN109286529A (en) A kind of method and system for restoring RabbitMQ network partition
CN105490847A (en) Real-time detecting and processing method of node failure in private cloud storage system
CN103780417A (en) Database failure transfer method based on cloud hard disk and device thereof
CN110333986B (en) Method for guaranteeing availability of redis cluster
CN110677282B (en) Hot backup method of distributed system and distributed system
CN106850255A (en) A kind of implementation method of multi-computer back-up
CN104506372A (en) Method and system for realizing host-backup server switching
CN110971662A (en) Two-node high-availability implementation method and device based on Ceph
CN115878384A (en) Distributed cluster based on backup disaster recovery system and construction method
US10721135B1 (en) Edge computing system for monitoring and maintaining data center operations
CN109189854B (en) Method and node equipment for providing continuous service
JPH0728667A (en) Fault-tolerant computer system
CN111767166A (en) Data backup method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190329

Termination date: 20191208